Background

Trapaceae, containing the only genus Trapa, is an annual floating-leaved aquatic herb naturally distributed in tropical, subtropical and temperate regions of Eurasia and Africa, and invading North America and Australia [1]. APG II (The Angiosperm Phylogeny Group) [2] equated Trapaceae with Lythraceae. However, a handful of morphological differences exist between the two families. For example, flowers of Trapaceae are solitary, 4-merous and actinomorphic, with half-inferior and slightly perigynous ovaries; Lythraceae has racemes or cymes, and the flowers are usually 4-, 6- or 8-merous, regular or irregular, with obvious perigynous ovaries. Therefore, Trapaceae is still be used today by some researchers [1]. Trapa has important edible value because of high content of starch in seeds, and it has been widely cultivated as an important aquatic crop in China and India [3]. Trapa seed pericarps were traditional herb medicines in China, and recent studies found that the extract of seed pericarps had bioactive components to restrain cancer, atherosclerosis, inflammation and oxidation [4,5,6,7,8]. Additionally, Trapa plants can be used to purify water bodies due to their excellent performance in absorbing heavy metals and nutrients [9, 5).

Table 5 The GenBank accession numbers of 15 species using in phylogenetic analysis

The fresh leaves were sampled and dried in silica gel immediately. Genomic DNA was extracted from the dry leaves according to the CTAB protocol [65]. The DNA concentration and quality were quantified by the NanoDrop 2000 microspectrophotometer (Thermo Fisher Scientific).

Chloroplast genome sequencing and assembling

High quality DNA was used to build the genomic libraries. Sequencing was performed using paired end 150 bp (average short-insert about 350 bp) on Illumina NovaSeq 6000 at Bei**g Novogene bio Mdt InfoTech Ltd (Bei**g, China). To get the high quality clean data, Fastp [39] was run to cut and filter the raw reads with default settings. For the 13 Trapa species/taxa sequenced, 5.22 Gb (T. mammillifera) to 6.06 Gb (T. bispinosa) clean data were generated after removing adapters and low quality reads. De novo assembly was carried out using the assembler GetOrganelle v1.7 [66] with default settings. The software Geneious primer (Biomatters Ltd., Auckland, New Zealand) was employed to align the contigs and determine the order of the newly assembled plastomes, with T. quadrispinosa (MT941481) as reference. All the annotated cp sequences data reported here were deposited in GenBank with accession numbers shown in Table 5.

Annotation and codon usage

We used the genome annotator PGA [67] and GeSeq [68] to annotate PCGs, tRNAs and rRNAs, according to the references of T. quadrispinosa (MT941481). Manual correction was carried out to locate the start and stop codons and the boundaries between the exons and introns. Using tRNAscan-SE v1.21, BLASTN searches were further performed to confirm the tRNA and rRNA genes [69]. The physical maps of cp genomes were generated by OGDRAW [70].

The RSCU was the ratio of the frequency of a particular codon to the expected frequency of that codon, which was obtained by DAMBE v6.04 [37]. When the value of RSCU is larger than 1, the codon is used more often than expected. Otherwise, when the RSCU value < 1, the codon is less used than expected [71].

Comparative genomic analyses

Comparative genomic analyses were carried out among the 15 Trapa species/taxa, which included the 13 species/taxa newly sequenced, and two previously published ones (T. kozhevnikovirum and T. incisa) with the same research team[62, 63]. Notably, among the 15 Trapa species/taxa studied, T. incisa and T. maximowiczii have small size nuts (width, 9–14 mm; height, 9–12 mm), while the other 13 species/taxa are of large size nuts (width, 16–35 mm; height, 13–23 mm).The published cp genomes were downloaded from the National Center for Biotechnology Information (NCBI) organelle genome database (https://www.ncbi.nlm.nih.gov).

The mVISTA program in Shuffle-LAGAN mode was used to compare the 15 Trapa species/taxa complete cp genomes, with the annotation of T. quadrispinosa as a reference (MT941481). After manual multiple alignments using the program MUSCLE [72] in the software MEGA X [73], all regions, including coding and non-coding regions, were extracted to detect the hyper-variable sites. The nucleotide variability (Pi) was computed using DnaSP 5.10 [74].

Analysis of repeat sequences and SSRs

Repeat sequences, including forward, palindromic, reverse and complement repeats, were detected by REPuter [75]. The parameters were set with repeat size of ≥ 30 bp and 90% or greater sequence identity (hamming distance of 3).

Simple sequence repeats (SSRs) were identified using MISA perl script [76], with the threshold number of repeats set as 10, 5, 4, 3, 3 and 3 for mono-, di-, tri-, tetra-, penta- and hexa-nucleotide SSRs, respectively.

Phylogenetic analyses

Phylogenetic analyses were carried out based on 22 complete chloroplast genomes, including 19 Trapa cp genomes and three cp genomes of outgroups (Sonneratia alba and two Lagerstroemia species). Because of the close relationship between Trapaceae and Sonneratiaceae/Lythraceae [34], Sonneratia alba (Sonneratiaceae) and two Lagerstroemia species (L. calyculata and L. intermedia, Lythraceae) were used as outgroups. Except for the 13 Trapa cp genomes which were generated in this study, the other six published Trapa cp genomes and the three outgroup cp genomes were downloaded from Genbank.

The sequences were aligned using program Mafft 7.0 [77] with default parameters. The phylogenetic trees were constructed using three methods: (1) A Maximum Likelihood (ML) tree was performed using PhyML v.3.0 [78] with 5000 bootstrap replicates. The best-fit model of nucleotide substitution JC + I + G was obtained from software Jmodeltest 2 [79]. Previous molecular studies showed close genetic relationships between Trapa and Sonneratia/Lagerstroemia [33, 34]. Thus, Sonneratia alba (Sonneratiaceae) and two Lagerstroemia species (L. calyculata and L. intermedia, Lythraceae) were used as outgroups. The branch leading to two Lagerstroemia species was set as the root of the tree. The result was visible by the software Figtree v1.4 (https://github.com/rambaut/figtree/releases); (2) The Maximum Parsimony (MP) tree was obtained using the Subtree-Pruning-Regrafting (SPR) algorithm in the Mega X [73] with 5000 bootstrap values; (3) Bayesian Inference (BI) tree was built by the MrBayes v. 3.2.6 [80] with 2,000,000 generations and sampling every 5000 generations. The first 25% of all trees were regarded as “burn-in” and discarded, and the Bayesian posterior probabilities (PP) were calculated from the remaining trees.