Background

Costaceae Nakai, commonly known as the spiral ginger family, comprises more than 120 species that are primarily native to the tropical regions of South America, Africa, and Southeast Asia [1,2,3,4,5,6]. It is one of the most easily recognizable family within the order Zingiberales by its well-developed and sometimes branched aerial shoots that have a characteristic spiral phyllotaxy and petaloid labellum formed by fusion of five sterile staminodes [1,2,3,4,5,6]. Some species of Costaceae can be used as garden ornamental plants and cut flowers [4,

Results

General characteristics of thirteen chloroplast genomes

In this study, a total of 13 complete chloroplast genomes of 10 species covering three clades in Costaceae were analyzed, including 8 newly sequenced genomes and 5 published ones (Table 1). The 8 sequenced samples produced 5.97 to 12.47 Gb clean reads each after removal of adapters and low-quality reads (Table S1). The 8 complete chloroplast genomes of Costaceae generated in this study were deposited in the GenBank with accession numbers OP712648 to OP712655 (Table 1). All 13 chloroplast genomes exhibited a typical quadripartite structure containing a pair of inverted repeat (IR) regions (27,982 − 29,203 bp), an LSC region (90,802 − 92,189 bp) and an SSC region (18,363 − 20,124 bp) (Fig. 1; Table 1). The full-length variation of Costaceae was about 2.6 kb (genome size: 166,360 − 168,966 bp). The overall guanine-cytosine (GC) content varied slightly, from 36.16 to 36.55% (Table 1). The IR regions accounted for the highest GC content, followed by the LSC region, while the SSC region had the lowest GC content (Table 1). The GC content of the protein-coding gene sequences ranged from 37.57 to 37.76% (Table 1).

Table 1 Basic characteristics of thirteen complete chloroplast genomes of the Costaceae family

Herein,134 − 135 genes were annotated in these 13 genomes of Costaceae, consisting of 88 protein-coding genes, 8 ribosomal RNA genes (rRNAs) and 38 − 39 transfer RNA genes (tRNAs) (Table 1, Table S2). After annotation and manual checking, individual chloroplast genome resulted in 111 − 113 different genes, comprising 79 different protein-coding genes, 28 − 30 different tRNAs and 4 different rRNAs (Fig. 1; Tables 1 and 2, Table S2). Among all 13 genomes, the numbers of different protein-coding genes and different rRNAs were the same, but slight differences were found in tRNAs (Table 2, Table S2).

Table 2 Gene contents in thirteen complete chloroplast genomes of the Costaceae family

Among these 111 − 113 different genes, 21 genes were duplicated within IR regions, including 9 protein-coding genes, 8 tRNAs, and 4 rRNAs (Fig. 1; Table 2, Table S2). Sixteen genes contained one intron, while clpP and ycf3 each contained two introns in 12 chloroplast genomes except in genome of C. beckii (Table 2, Table S2). The genome of C. beckii, only contained 17 intron-containing genes, because trnG-UCC has lost the intron (Table 2, Table S2).

Fig. 1
figure 1

Chloroplast genome map of C. barbatus (GenBank accession number: OP712648; the outermost three rings) and CGView comparison of thirteen complete chloroplast genomes in the Costaceae family (the inter rings with different colors). Genes shown on the outside of the outermost first ring are transcribed counter-clockwise and on the inside clockwise. Outermost second ring with darker gray corresponds to GC content, whereas outermost third ring with the lighter gray corresponds to AT content of C. barbatus chloroplast genome by OGDRAW. The gray arrowheads indicate the direction of the genes. LSC, large single copy region; IR, inverted repeat; SSC, small single copy region. The innermost first black ring indicates the chloroplast genome size of C. barbatus. The innermost second and third rings indicate GC content and GC skews deviations in chloroplast genome of C. barbatus, respectively: GC skew + indicates G > C, and GC skew − indicates G < C. CGView comparison result of thirteen complete chloroplast genomes in Costaceae displayed from innermost fourth color ring to outwards 16th ring in turn: C. barbatus OP712648, C. beckii OP712653, C. dubius OP712651, C. speciosus Guangdong OP712649, C. speciosus var. marginatus OP712652, C. tonkinensis Yunnan OP712650, C. viridis MK262733, C. woodsonii OP712654, H. speciosa Guizhou OK641589, M. uniflorus OP712655, H. lacera ON598391, H. speciosa Yunnan ON598392, and C. tonkinensis ON598393; chloroplast genome similar and highly divergent locations are represented by continuous and interrupted track lines, respectively. The species in bold are sequenced in this study

Long repeats and SSRs analyses

Four types of long repeats, including forward, complement, reverse and palindromic repeats, were detected in 13 complete chloroplast genomes of Costaceae. Among these 13 genomes, H. lacera ON598391 contained the highest number of long repeats (254), and C. tonkinensis Yunnan OP712650 contained the lowest number of long repeats (119) (Fig. 2A, Table S3). The number of forward repeats varied from 46 (C. tonkinensis Yunnan OP712650) to 108 (C. viridis MK262733), the number of palindromic repeats varied from 32 (C. tonkinensis Yunnan OP712650) to 69 (H. lacera ON598391), the number of reverse repeats varied from 23 (C. tonkinensis ON598393) to 70 (C. woodsonii OP712654), and the number of complement repeats varied from 4 (C. tonkinensis ON598393) to 27 (H. lacera ON598391) (Fig. 2A, Table S3). The lengths of the long repeats varied among the 13 genomes, of which most were found to exist with the range of 30 − 34 bp (Fig. 2B, Table S3). Long repeats with lengths of 35 − 39 bp and 40 − 44 bp were the second and third most common, respectively (Fig. 2B, Table S3).

Fig. 2
figure 2

Analysis of long repeats in thirteen complete chloroplast genomes of the Costaceae family. (A), Total numbers and different types of long repeats in each chloroplast genome. (B), Numbers of long repeats more than 30 bp long in each chloroplast genome. * indicates chloroplast genome of the species sequenced in this study

Simple sequence repeats (SSRs) in these 13 complete chloroplast genomes of Costaceae were also detected (Fig. 3, Table S4). The number of SSRs detected among these 13 genomes ranged from 81 (C. tonkinensis ON598393) to 107 (C. viridis MK262733) (Fig. 3A, Table S4). Among these SSRs, only 2 chloroplast genomes (C. tonkinensis Yunnan OP712650 and C. tonkinensis ON598393) had no hexanucleotide repeats (Fig. 3A, Table S4). A/T (39.40%) were the most frequently observed repeats, followed by AT/AT (27.34%), AAAT/ATTT (9.87%) and AAT/ATT (7.77%), respectively (Fig. 3B, Table S4). Among the SSRs in these 13 genomes, each genome contained 55 to 75 SSRs in the LSC regions, 16 to 26 SSRs in the SSC regions, and 3 to 5 SSRs in the IRa and IRb regions, respectively (Fig. 3C, Table S4). Similarly, SSRs were analyzed in the protein-coding regions, intron regions and intergenic regions of these 13 genomes, indicating that each genome comprised 38 to 48 SSRs in intergenic regions, 12 to 14 SSRs in protein-coding regions, and 6 to 14 SSRs in introns (Fig. 3D, Table S4). Six genes, namely, ndhD, rpoB, rpoC2, rps14, ycf1 and ycf2 contained SSRs and their products longer than 150 bp in these 13 genomes, which can be used as potential DNA molecular markers for species identification in Costaceae (Table S4).

Fig. 3
figure 3

Analysis of SSRs in thirteen complete chloroplast genomes of the Costaceae family. (A), Total numbers and different types of SSRs detected in each chloroplast genome. (B), Frequencies of the identified SSRs in different motifs. (C), Frequencies of the identified SSRs in the LSC, SSC and IR regions. (D), SSR distribution in protein-coding regions, introns and intergenic regions detected in each chloroplast genome. * indicates chloroplast genome of the species sequenced in this study

Codon usage analysis

The amino acid frequency, codon usage and relative synonymous codon usage (RSCU) were analyzed based on all 79 different protein-coding genes (Table S5). The total codons (excluding stop codons) of these 13 complete chloroplast genomes of Costaceae ranged from 26,531 to 27,373. Among these codons, leucine (Leu) was the most abundant amino acid, followed by isoleucine (Ile); whereas cysteine (Cys) was the least abundant (Table S5). The codons ATG and TGG, encoding methionine (Met) and tryptophan (Trp), respectively, showed no codon bias both with RSCU values of 1.00 in these 13 genomes (Fig. 4, Table S5). The codons with the five lowest RSCU values (AGC, GAC, GGC, CTG and CGC) and three with the highest RSCU values (AGA, GCT, and TTA) were found in these 13 genomes (Fig. 4, Table S5). Twenty-nine codons showed codon usage bias with RSCU > 1.00 in these 13 genomes genes (Table S5). Interestingly, of these 29 codons, twenty-eight were A/T-ending codons. The result of higher usage frequency of A/T-ending than G/C-ending was also found in Aglaonema modestum [29], Phaseolus lunatus [32], and Zingiber montanum [33].

Fig. 4
figure 4

Heat map analysis for relative synonymous codon usage (RSCU) values of all protein-coding genes of thirteen complete chloroplast genomes in the Costaceae family. Red indicates higher RSCU values and blue indicates lower RSCU values. The species in bold are sequenced in this study

IR expansion and contraction

Detail comparisons at the LSC/IRs/SSC boundaries were analyzed among the 13 complete chloroplast genomes of Costaceae (Fig. 5). Although the IR/LSC boundaries of these 13 genomes were highly conserved, variations were also found in the IR/SSC boundaries. For IRa/LSC boundaries, the rpl22 and psbA genes were located at the boundaries in these 13 genomes, respectively. The distances between the ends of rpl22 and IRa/LSC boundaries ranged from 290 to 362 bp, and the distances between the starts of psbA and the IRa/LSC boundaries ranged from 154 to 289 bp (Fig. 5). Among these 13 genomes, the rps3 and rpl22 genes were found at the boundaries of the LSC/IRb regions, respectively (Fig. 5). rps3 expanded into the IRb regions in these 13 genomes, with the lengths ranging from 219 to 291 bp from the LSC/IRb boundaries; whereas the starts of rpl22 and the LSC/IRb boundaries ranged from 291 to 363 bp (Fig. 5).

For SSC/IRa boundaries, ycf1 was located in the boundaries in these 13 genomes, which crossed into the IRa regions with lengths varying from 1239 to 2445 bp (Fig. 5). Regarding the IRb/SSC boundaries, ycf1 and ndhF genes were located at the boundaries in these 13 genomes, respectively (Fig. 5). ycf1 expanded into the SSC regions ranging from 3 to 87 bp in 10 genomes, respectively (Fig. 5). In contrast, the end of the ycf1 gene was justly located within the IRb/SSC boundaries in 2 genomes (H. lacera and H. speciosa Yunnan) (Fig. 5). In the rest of the genome (C. tonkinensis ON598393), the distance between the end of ycf1 and the IRb/SSC boundary was 1 bp (Fig. 5). Among the 11 genomes, the lengths between the starts of ndhF and the IRb/SSC boundaries ranged from 6 to 71 bp, respectively (Fig. 5). However, in the other 2 genomes (C. tonkinensis Yunnan OP712650 and C. tonkinensis ON598393), ndhF expanded into the IRb regions by 14 and 16 bp, respectively (Fig. 5).

Fig. 5
figure 5

Comparisons of border distances between adjacent genes and junctions of the LSC, SSC and two IR regions among thirteen complete chloroplast genomes of the Costaceae family. Numbers above or near the colored genes indicate the distances between the genes and the boundary sites. The figure is not in scale for sequence length, and only shows relative changes at or near the IR/SC boundaries. The species in bold are sequenced in this study

Sequence divergence analysis and nucleotide diversity

Using the whole chloroplast genome of C. barbatus as the reference, a comparative analysis based on the mVISTA program was performed on the 13 complete chloroplast genomes of Costaceae (Fig. 6). The results indicated that the LSC and SSC regions were more divergent than the two IR regions (Fig. 6). In the protein-coding regions, most protein-coding genes were highly conserved except for rpl16, rpoC1, ccsA, ndhF, psaJ, rps3, rps15 and ycf1 (Fig. 6). The highly divergent regions among these 13 genomes mainly located in the intergenic regions, including trnS-trnG, atpH-atpI, accD-psaI and rpl16-exon2-rpl16-exon1 in the LSC region as well as ndhF-rpl32, rpl32-trnL, ccsA-ndhD, psaC-ndhE and rps15-ycf1 in the SSC region (Fig. 6). The CGview result also revealed that the IR regions were less divergent than the LSC and SSC regions (innermost 4th color ring to the outwards 16th ring in Fig. 1). In comparison to the chloroplast genome of C. barbatus (innermost 4th color ring in Fig. 1), the rest of the 12 genomes showed four divergent regions in LSC (psbI-trnS, trnS-trnG, trnT-trnE, and rps3), one region in SSC (ccsA-ndhD) and one region in IRa (rpl22-rps19).

Fig. 6
figure 6

Visualized alignment of thirteen complete chloroplast genomes sequences of the Costaceae family using mVISTA. C. barbatus chloroplast genome sequence was used as a reference. Gray arrows and thick black lines indicate gene orientation. Purple bars represent exons, sky-blue bars represent untranslated regions (UTRs), red bars represent non-coding sequences (CNS), gray bars represent mRNA and white regions represent sequence differences among all analyzed chloroplast genomes. Horizontal axis indicates the coordinates within the chloroplast genome. Vertical scale represents the identity percentage that ranges from 50–100%. The species in bold are sequenced in this study

Nucleotide diversity (Pi) and single nucleotide substitutions in the LSC, SSC, IRa, IRb and the total of the chloroplast genomes were analyzed (Table 3). Thirteen complete chloroplast genomes of Costaceae were aligned with a matrix of 168,717 bp with 3,161 variable sites (1.87%) and 3,070 parsimony informative sites (1.82%). The Pi value of the complete chloroplast genome was 0.006 (Table 3). The SSC region had the highest Pi value (0.015) and the IRb region had the lowest Pi value (0.001) (Table 3). Additionally, Pi values were measured by DnaSP v. 6.12.03 to identify highly variable regions in these 13 genomes (Fig. 7, Table S6). Of the protein-coding regions, the Pi value for each gene ranged from 0 to 0.0598, and the average value was 0.0026. The rpl16-exon1 had the highest Pi value (0.0598) followed by the other nine gene regions of rpl36, trnK-exon2, ycf1-D2, rps15, ndhF, psaJ, rps3, rpoC1-exon1 and ccsA (Pi > 0.007) (Fig. 7A, Table S6). For the intergenic regions, the Pi values ranged from 0 to 0.0708 (psaC-ndhE) and had an average of 0.0081. The average Pi value of intergenic regions was 3.11 folds higher than that in protein-coding regions. Nine of these intergenic regions also showed remarkably high values (Pi > 0.025), including psaC-ndhE, ccsA-ndhD, rps15-ycf1-D2, atpH-atpI, accD-psaI, trnS-trnG-exon1, rpl32-trnL, rpl16-exon2-rpl16-exon1 and psbI-trnS (Fig. 7B). Four universal chloroplast DNA markers, namely, trnL-F locus (trnL-exon2-trnF), trnL intron (trnL-exon1-trnL-exon2), trnK locus (matK-trnK-exon1) and trnK-rps16 inergenic spacer (trnK-exon1-rps16-exon2) were also tested on their variability. These four chloroplast DNA markers had Pi values of 0.0096, 0.0069, 0.0070 and 0.0079, respectively (Table S6). The Pi values of these four DNA markers were much lower than those of the newly identified highly variable intergenic regions.

Fig. 7
figure 7

Comparisons of nucleotide diversity (Pi) values among thirteen complete chloroplast genomes of the Costaceae family. (A), Protein-coding genes. Protein-coding genes with Pi values > 0.007 are labeled with gene names. (B), Intergenic regions. Intergenic regions with Pi values > 0.025 are labeled with intergenic region names

Table 3 Variable site analyses of thirteen complete chloroplast genomes of the Costaceae family

By using region length > 250 bp and integrating the results of Pi, CGView and mVISTA, 18 regions, including 14 divergent regions and 4 universal chloroplast DNA markers, were extracted and constructed using the maximum likelihood (ML) trees to differentiate these 13 species/accessions of Costaceae (Additional file 7, Fig. S1). The basic topological structures of the ML trees, which were consistent with topological structures constructed by chloroplast genome data (Fig. 8), were selected for resolution power analysis. The resolution power depended on the number of discrimination successes in the ML trees. If the bootstrap value of the node between two species/accessions was more than 50, species/accessions in the ML tree were counted. Otherwise, species/accessions in the ML tree were not counted. The ML trees constructed by five divergent regions (ndhF, ycf1-D2, ccsA-ndhD, rps15-ycf1-D2 and rpl16-exon2-rpl16-exon1), and four universal chloroplast DNA markers (Fig. S1), were consistent with topological structures constructed by chloroplast genome data (Fig. 8). The four universal chloroplast DNA markers had resolution powers of trnL-exon1-trnL-exon2 at 46%, trnK-exon1-rps16-exon2 at 31%, matK-trnK-exon1 at 15% and trnL-exon2-trnF at 0, respectively (Fig. S1a, b, c, d). Comparative analysis of these five potential new markers revealed that ycf1-D2 had the highest resolution power of 69%, followed by ndhF at 46%, rpl16-exon2-rpl16-exon1 at 38%, ccsA-ndhD at 31%, and rps15-ycf1-D2 at 31% (Fig. S1f, i, l, m, r). Single candidate new marker with differentiation success of 100% was not found. These five regions (ndhF, ycf1-D2, ccsA-ndhD, rps15-ycf1-D2 and rpl16-exon2-rpl16-exon1) were combined as new potential markers. These five combined potential markers (ycf1-D2 + ndhF, ccsA-ndhD + rps15-ycf1-D2, ccsA-ndhD + rpl16-exon2-rpl16-exon1, rps15-ycf1-D2 + rpl16-exon2-rpl16-exon1, and ccsA-ndhD + rps15-ycf1-D2 + rpl16-exon2-rpl16-exon1) showed differentiation success ≧ 69%, especially, the ML tree constructed from ccsA-ndhD + rps15-ycf1-D2 with high supports (bootstrap values > 65%, and resolution power at 92%), could be used as a candidate molecular marker in Costaceae (Fig. S1s, t, u, v, w).

Selective pressure analysis

The ratio (ω) of non-synonymous (dN) to synonymous (dS) substitution (dN/dS) for all 79 shared protein-coding genes was analyzed across 13 complete chloroplast genomes in Costaceae. According to the M8 (β & ω > 1) model, a total of 8 protein-coding genes were under positive selection with posterior probability greater than 0.95 using the Bayes empirical bayes (BEB) method (Table 4). Among these genes, ndhA harboured the highest number of positive amino acids sites (6), followed by rps12 (3), ycf1 (3), clpP (2), petB (2), psbD (2), cemA (1) and ndhF (1) (Table 4). However, the M2a model analysis revealed that there were only 14 positive amino acid sites by using the BEB method (Table 4). These results inferred that the M8 model was significantly better than the M2a model, identifying the presence of amino acid sites under positive selection.

Table 4 Positively selected sites detected in thirteen complete chloroplast genomes of the Costaceae family

Phylogenetic relationships

Two phylogenetic trees were constructed using chloroplast genome sequences by ML and Bayes inference (BI) methods, respectively (Fig. 8A and B). The species of Zingiberaceae were used as outgroups. Both ML and BI trees displayed similar topological structures (Fig. 8A and B). The analyzed Costaceae species were divided into three clades: a South American clade, an Asian clade and a Costus clade with strongly supported values (bootstrap values = 99–100% for the ML tree and posterior probabilities = 1 for the BI tree nodes) (Fig. 8A and B).

In both two trees, there were three subclades in the Asian clade with strong supports (bootstrap values = 100%; posterior probabilities = 1), namely, Hellenia, Tapeinochilos and Parahellenia, which had nested relationships (Fig. 8A and B). Within Hellenia, H. speciosa Guizhou OK641589, C. speciosus Guangdong OP712649, H. speciosa OL688995, H. speciosa Yunnan ON598392 and C. speciosus var. marginatus OP712652 were clustered one by one, forming a cluster with moderate to strong supports (bootstrap values = 83 − 100%; posterior probabilities = 0.84 − 1); H. lacera ON598391 and H. delinana OL689000 were clustered together, forming another cluster with strong supports (bootstrap value = 100%; posterior probability = 1); then the two clusters, H. viridis OL688999 and H. oblonga OL688997 were clustered step by step (Fig. 8A and B). Within Parahellenia, three accessions of P. tonkinensis (OL688992, OL688993 and OL688994), P. malipoensis OL688996 and C. tonkinensis ON598393 were clustered together, forming a cluster with strong supports (bootstrap values = 97 − 100%; posterior probabilities = 1); C. tonkinensis Yunnan OP712650 and P. yunanensis OL688998 were clustered together, forming another cluster with strong supports (bootstrap value = 100%; posterior probability = 1.0); then the two clusters were clustered together with strong supports (bootstrap value = 100%; posterior probability = 1.0) (Fig. 8A and B). In the Costus clade, C. pictus MH603409, C. barbatus OP712648, C. beckii OP712653 and C. viridis MK262733 were clustered together, forming a cluster with strong supports (bootstrap value = 93 − 95%; posterior probabilities = 1); C. woodsonii OP712654, C. dubius OP712651 and C. dubius MH603406 were also clustered together, forming another cluster with strong supports (bootstrap value = 97 − 100%; posterior probability = 1); then the two clusters, C. pulverulentus KF601573, C. osae MH603408 and C. gabonensis MH603407 were clustered one by one (Fig. 8A and B). In the South American clade, M. uniflorus OP712655 and M. uniflorus KF601572 were first clustered together with strong supports (bootstrap value = 100%; posterior probability = 1), then clustered with Dimerocostus strobilaceus MH603413 with strong supports (bootstrap value = 100%; posterior probability = 1), and finally clustered with Chamaecostus acaulis MH603404 with strong supports (bootstrap value = 100%; posterior probability = 1) (Fig. 8A and B).

Fig. 8
figure 8

Phylogenetic relationships of Costaceae species based on chloroplast genomes sequences reconstructed using maximum likelihood (ML) and the bayes inference (BI) methods. (A), ML tree. (B), BI tree. The species in bold are sequenced in this study

Divergence time estimation

Divergence time estimation suggested that the common ancestor of Costaceae firstly split from Zingiberaceae at about 67.1 Mya (95% HPD: 63.3 − 73.2 Mya), and then split from Musella-Ensete clade at approximately 56.5 Mya (95% HPD: 48.5 − 69.0 Mya) (Fig. 9). The crown node age of Costaceae was about 30.5 Mya (95% HPD: 14.9 − 49.3 Mya) (Fig. 9). The crown node age of the Costus clade and Asian clade was 23.8 Mya (95% HPD: 10.1 − 41.5 Mya). Diversification of the Costus clade and Asian clade occurred at 4.4 Mya (95% HPD: 1.5 − 10.8 Mya) and 10.7 Mya (95% HPD: 3.5 − 25.1 Mya), respectively. Within the Asian clade, diversification of Parahellenia and Hellenia took place at 3.9 Mya (95% HPD: 1.5 − 8.2 Mya) and 3.3 Mya (95% HPD: 1.5 − 6.2 Mya), respectively (Fig. 9).

Fig. 9
figure 9

Divergence time estimation of Costaceae species based on nucleotide sequences of 75 single-copy protein-coding genes shared in 22 chloroplast genomes of Costaceae. The fossil and calibration taxa are indicated with red points on the corresponding nodes. Mean divergence time of the nodes are shown at the nodes with blue. The numbers inside each blue bracket after mean divergence time represent 95% highest posterior density (HPD) of estimated divergence time, with minimum and maximum values, respectively. The species in bold are sequenced in this study

Discussion

Chloroplast genome structure and sequence variation

In this study, 13 complete chloroplast genomes of Costaceae were comparatively analyzed. These 13 genomes revealed a typical quadripartite structure, with a single LSC region, a single SSC region and two IR regions (Fig. 1). They shared similar GC content, protein-coding genes, rRNAs and most of the tRNAs, which also had been found in other flowering plants [24,25,26, 28,29,

Methods

Plant materials and DNA extraction

Due to sample collection challenges, samples of the eight Costaceae species, representing one Monocostus species from the South American clade (M. uniflorus), four Costus species (C. barbatus, C. beckii, C. dubius, and C. woodsonii) from the Costus clade, and three species (C. speciosus Guangdong, C. speciosus var. marginatus, and C. tonkinensis Yunnan) from the Asian clade (Fig. S2), were obtained from the resource garden of the environmental horticulture research institute (23°23′N, 113°26′E) at the Guangdong Academy of Agricultural Sciences, Guangzhou, China. Species formal identifications were made using the Flora of China [1], The Zingiberaceous resources in China [8], Botanical paintings of Chinese Zingiberales [52], and also conducted using photos (available on https://www.gingersrus.com/Costus.php). Young and healthy leaves of seedlings were collected and quickly frozen in liquid nitrogen and stored at -80 ℃ until use. The total genomic DNA was extracted from young leaves using sucrose gradient centrifugation method with minor modifications [53]. DNA integrity and quality were assessed by a NanoDrop 2000 microspectrometer (Wilmington, DE, USA), and detected using a 1% (w/v) agarose gel electrophoresis. The other five published complete chloroplast genomes of Costaceae were downloaded from NCBI for the following comparative analyses.

Illumina sequencing, assembly and annotation

Each high-quality DNA sample was sheared into fragments of about 350 bp to construct a library according to the manufacturer’s instructions (New England Biolabs, Ipswich, MA, England). Sequencing was carried out on an Illumina NovaSeq 6000 platform with 150 bp paired-end reads length (Biozeron, Shanghai, China). The raw data were checked using FastQC v. 0.11.9 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), and filtered by Trimmomatic v. 0.39 [54] with default parameters. Next, filtered reads were de novo assembled using GetOrganelle v. 1.7.6.1 [55] with default settings. Geneious Prime 2022 (Biomatters Ltd., Auckland, New Zealand) [56] was used to align the contigs and the start and stop codons were manually edited with a reference chloroplast genome of C. viridis (GenBank accession number MK262733). Then, each assembled chloroplast genome was annotated in GeSeq [57] and the online Dual Organellar Genome Annotator (DOGMA) [58] with default parameters, respectively. Additionally, tRNAscanSE v. 2.0.5 [59] and BLAST v. 2.13.0 [60] were used to confirm the tRNA and rRNA genes. The annotation results were also validated by comparing them with NCBI’s non-redundant (Nr) protein database, Gene Ontology (GO), Clusters of orthologous groups (COG) for eukaryotic complete genomes database, Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Server (KAAS) (http://www.genome.jp/kegg/kaas/) [61] and SWISS-PROT databases. The physical maps of complete chloroplast genomes were drawn using Organellar Genome Draw (OGDRAW) v. 1.3.1 [62]. The eight newly annotated complete chloroplast genome sequences were first validated using online GB2sequin [63]. Then, the annotation results were further validated and formatted using Sequin v. 15.50 from NCBI, and submitted to GenBank (see Table 1 for accession numbers).

Sequence analysis and statistics

Codon usage was analyzed by using MEGA v. 7.0 [64], and the relative synonymous codon usage (RSCU) and amino acid frequencies were calculated with default parameters. When the RSCU value is larger than 1, the codon is used more often than expected, while values less than 1 indicate its relative rarity [65, 66]. The clustered heat map of RSCU values of 13 complete Costaceae chloroplast genomes was conducted by R v. 4.0.2 [67].

The long repeats sequences, which included forward, palindrome, reverse and complement repeats, were detected using REPuter [68] with a minimal repeat size of 30 bp, a repeat identity of more than 90%, and a hamming distance of 3. In this study, due to the collection difficulties of original sequenced data for the five published chloroplast genomes of Costaceae, the possible effects by different assembled ways on detection SSRs were not considered. SSRs in the chloroplast genomes were detected via MISA-web [69] by setting the minimum number of repeats to 10, 5, 4, 3, 3 and 3 for mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide and hexanucleotide, respectively.

Genome comparison and sequence divergence analyses

The contraction and expansion of the IR regions were obtained by comparing the SC/IR borders and their adjacent genes of 13 complete Costaceae chloroplast genomes using IRscope [70]. The mVISTA program in the Shuffle-LAGAN mode [71] was employed to compare the complete chloroplast genomes divergence among 13 complete chloroplast genomes with the annotated chloroplast genome of C. barbatus as the reference. Additionally, the chloroplast genome of C. barbatus was compared to the other 12 whole chloroplast genomes of Costaceae using CGView Server [72]. GC distributions were measured based on GC skew using the equation: GC skew = (G-C)/(G + C). To analyze the sequence divergence of complete chloroplast genomes in Costaceae, the protein-coding and intergenic regions among these 13 complete chloroplast genomes were extracted and aligned using MAFFT v. 7.458 [73] with default parameters. Then, nucleotide variability (Pi) values were analyzed using DnaSP v. 6.12.03 [74]. The step size was set to 200 bp, and the window length was set to 600 bp. The protein-coding regions with Pi > 0.007, the intergenic regions with Pi > 0.025, the region length > 250 bp, and 4 universal chloroplast DNA markers including trnL-exon1-trnL-exon2, trnK-exon1-rps16-exon2, matK-trnK-exon1 and trnL-exon2-trnF, were extracted and then analyzed individually to differentiate these Costaceae species (Additional file 7). The maximum likelihood (ML) tree was calculated by using the nucleotide substitution model of Tamura-Nei in MEGA v. 7.0 [64] with 1000 replicates. Additionally, variable and parsimony informative base sites of the LSC, SSC, IRa, IRb, and complete chloroplast genomes of these 13 genomes were also calculated using C. barbatus as the reference.

Positive selection analysis

Selective pressure was analyzed for consensus 79 protein-coding genes among 13 complete chloroplast genomes of Costaceae. The nonsynonymous (dN) and synonymous (dS) substitution rates were calculated by using the CodeML program implemented in EasyCodeML [75]. First, each single protein-coding gene was extracted, their stop codons removed and aligned separately using ClustalW in MEGA v. 7.0 [64], followed by manual adjustment for abnormal alignments. Next, based on the alignments, the ML tree was constructed using MEGA v. 7.0 as an input tree. Six models were investigated to calculate the dN and dS ratios (ω) and the likelihood ratio tests (LRTs): M0 (one-ratio), M1a (nearly neutral), M2a (positive selection), M3 (discrete), M7 (β) and M8 (β & ω > 1). The positive selection models (M2a and M8) were used to detect positively selected sites based on both ω and LRTs values [76]. A bayes empirical bayes (BEB) method [77] was then selected to calculate posterior probabilities. In the BEB analysis, posterior probability higher than 0.95 and 0.99 indicated sites that were under positive selection and strong positive selection, respectively.

Phylogenetic analysis

To reconstruct and confirm the phylogenetic relationships of Hellenia and Parahellenia in Costaceae, a total of 31 chloroplast genomes sequences of Costaceae were analyzed, which included 13 complete and 18 incomplete chloroplast genomes (Table S7). Of these 31 genomes, 8 complete chloroplast genomes were generated in the present study, and the other 23 chloroplast genomes sequences were obtained from the GenBank database and individuals (Table S7, Additional file 9), respectively. Twelve chloroplast genomes of the Zingiberaceae species in GenBank were added as outgroups (Table S7). The chloroplast genome sequences were aligned using the MAFFT v. 7.458 [73] with default parameters and manually checked when necessary. The best nucleotide substitution model (general-time-reversible, gamma distribution and invariable sites, GTR + G + I) was determined using the Akaike Information Criterion (AIC) in jModelTest v. 2.1.10 [78]. Subsequently, the ML tree was constructed using PhyML v. 3.0 [79], and a bootstrap test was performed with 1000 replicates to calculate the bootstrap values for all branch nodes. Bayesian inference (BI) analysis was carried out using MrBayes v. 3.2.6 [80]. Two Markov Chain Monte Carlo algorithm (MCMC) runs were performed with 200,000 generations and four Markov chains, starting from random trees, sampling trees every 100 generations, and discarding the first 10% of samples as burn-in. The phylogenetic trees were edited and visualized using iTOL v. 3.4.3 (http://itol.embl.de/itol.cgi).

Divergence time estimation

As some published chloroplast genomes of Costaceae missed large fragments, we only selected complete or nearly complete chloroplast genomes for divergence time estimation (Table S8). Divergence time estimation was performed by the dataset of 75 single-copy protein-coding genes shared in 22 chloroplast genomes of Costaceae using the MCMC tree in PAML v. 4.4 [81]. First, the best nucleotide substitution model (GTR) was selected using jModelTest v. 2.1.10 [78] under AIC, and construction ML tree from the chloroplast genomes sequences were undertaken using PhyML v. 3.0 [79]. Second, two fossil records and one calibration point was obtained and used in the divergence time estimation. Zingiberopsis attenuate [82] was used as a mean age of 65 Million years ago (Mya) for the crown age of family Zingiberaceae. Ensete oregonense [83] was applied to calibrate the crown age of Ensete and Musella with a mean age of 43 Mya. Each fossil calibration point was assumed to follow a normal distribution with a standard deviation of 2 and an offset of 2, resulting in 63.1 − 70.9, and 41.1 − 48.9 Mya 95% intervals, respectively. Then, one calibration point (http://www.timetree.org/) was also used in this analysis, including the calibration point between Zingiber and Kaempferia with a mean age of 6.86 Mya (3.0 − 10.0 Mya). Thirdly, the new ML tree constructed from chloroplast genomes sequences was used as a starting tree for the MCMC run. MCMC run was set at 400,000 generations, sampling every 100 generations, and removing the first 10% generations as burn in. Divergence time estimation was calculated by parameters of clock = 2 and model = 0, with 95% highest posterior density (HPD) intervals, and then inserting the resulting divergence times into the ML tree.