Introduction

Flavonoids are one of the most important groups of specialized plant metabolites. Their enormous chemical diversity results in a plethora of biological functions [1]. Most noticeable are the anthocyanins which can provide blue to red coloration to flowers. Flavonoids are distributed across the whole plant kingdom. Given the visual phenotype of several flavonoids, this pathway was established as a model system for plant metabolism and transcriptional regulation [2, 3]. Today, the flavonoid biosynthesis and its regulation are among the best studied processes in plants. Numerous studies are published every year which investigate the flavonoid biosynthesis and the corresponding regulators in different plant species.

One of the best studied enzymes in the flavonoid biosynthesis pathway is the chalcone synthase (CHS). This enzyme catalyzes the first committed step of the flavonoid biosynthesis, particularly the reaction leading to the formation of the naringenin chalcone from one p-coumaroyl-CoA and three malonyl-CoA [4]. CHS belongs to the type III polyketide synthases, a larger protein family that also harbors several closely related enzymes like the stilbene synthase (STS) [5]. CHS and STS differ by only two functionally important residues which are Q166 and Q167 in the Arabidopsis thaliana CHS sequence [6]. The gene structure of CHS comprises usually two coding exons [7].

Numerous plant genomes are sequenced every year [8] and the rapid development of long read sequencing technologies enables the generation of high quality genome sequences [9]. Since an automatic annotation of all genes in the flavonoid biosynthesis and many regulators is possible [10, 11], it is feasible to perform large scale analyses of gene families acting in this pathway. Analyses at this scale are inherently prone to errors concerning individual species and sequences that require human intervention at the final interpretation step. A systematic analysis of domains in the chalcone synthase revealed an apparent triplication in several species.

Here, we describe an investigation of several CHS gene models that appeared to encode a protein with a triplicated CHS domain. A manual inspection of multiple cases based on transcriptomic data suggests mis-annotation leading to artificially fused gene models.

Main text

Methods

Identification of sequences with potential CHS domain duplications

A BLASTp analysis with the Nymphae colorata CHS (XP_049936683.1) as query and nr (containing 545,546,009 sequences) as subject was performed through the NCBI webportal (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins) on the 2nd of June 2023. The word size was set to 5 and an e-value cutoff of 10–30 was applied. Matched sequences longer than 500 amino acid residues were manually screened for CHS domain duplications or triplications. Redundant sequences were removed from the list.

Evaluation of gene models via RNA-seq read map**

The genome sequences and corresponding annotations of four species with apparent fusion of multiple CHS gene copies were selected for manual inspection: Musa balbisiana [12], Musa troglodytarum [13], Macadamia integrifolia [14], and Nymphaea colorata [15]. All available paired-end RNA-seq data sets of these species were retrieved from the Sequence Read Archive via fastq-dump [16] (Additional File 1 in [17]). Read pairs were aligned to the genome sequence with STAR v2.5.1b [18, 19] in 2-pass-mode using a minimal similarity of 95% and a minimal alignment length of 90% as previously described [20]. The resulting BAM files were processed with customized Python scripts [17]. All reads mapped to the locus of interest were extracted from each individual BAM file with samtools v1.15.1 [21]. These subset BAM files were merged to produce one BAM file per species that contains all reads belonging to the locus of interest (Additional File 2, Additional File 3, Additional File 4, Additional File 5 in [17]). The final BAM file was visualized with Integrative Genomics Viewer (IGV) [22].

Analysis and visualization of RNA-seq read map** coverage

A previously developed [23] Python script was applied to convert BAM files into coverage files that list the number of aligned reads per genomic position. These coverage files served as the basis for the generation of gene model-focused coverage plots [17]. These plots visualize the number of aligned reads per position around a given locus of interest. High coverage with RNA-seq reads indicates exon positions while introns are characterized by very low or no RNA-seq read coverage at all. Introns are not included in the final mRNAs and usually mostly mature mRNAs are extracted with standard RNA extraction protocols used for RNA-seq experiments.

Comparison of sequence similarity

Sequences around the CHS loci of interest were retrieved from the respective genome sequence using the Python script seqex3.py v0.5 [24]. Sequence similarity between the different domains predicted in the inspected sequences was analyzed by constructing dot plots with dotplotter v0.1 [25] using a k-mer size of 31. Dotplotter compares two sequences by generating k-mers from one given sequence and identifying matches in the other sequence. The corresponding k-mer positions in both sequences are visualized in a dotplot. This allows an intuitive visualization of repeats.

Results

An analysis of the CHS sequences retrieved from the NCBI revealed apparent fusion proteins harboring multiple (two or three) CHS domains in 28 protein sequences across different plant species (Additional File 6 in [17]). More specifically, 8 sequences showed a triplication of the CHS domain, while 20 sequences showed a duplication. Representative examples of predicted proteins with multiple CHS domains were identified in Macadamia integrifolia, Musa balbisiana, Musa troglodytarum, and Nymphae colorata (Fig. 1). A comparison of each CHS locus in the four species against itself revealed the expected high similarity (Additional File 7 in [17]).

Fig. 1
figure 1

Domain composition of protein sequences with apparent CHS domain triplication. CHS abbreviation in figure stands for Chalcone synthase domain, and Cond abbreviation stands for Chalcone and stilbene synthases, N-terminal domain. So-called catalytic triad consisting of conserved Cysteine (C), Histidine (H), and Asparagine (N) amino acid residues is always depicted (if preserved)

Loci with an unexpected CHS annotation were inspected in an RNA-seq read map** (Fig. 2). The annotation indicates a single CHS gene encoding three CHS domains, but the results of our RNA-seq read map** suggest that there are multiple individual CHS genes. The coverage continuously drops towards the ends of several exons. That is an indication of the end of a gene. In contrast, an abrupt drop in coverage to almost zero would indicate a splice site. There are almost no connecting reads between some of the annotated exons and individual exons show very different RNA-seq coverage. There are also substantial fractions of annotated exons without RNA-seq support. It is not expected that exons at the 5’- and 3’-end of a gene have high coverage, while the enclosed exons have low coverage. The underlying RNA-seq read map** files of Macadamia integrifolia (Additional File 2 in [17]), Musa balbisiana (Additional File 3 in [17]), Musa troglodytarum (Additional File 4 in [17]), and Nymphae colorata (Additional File 5 in [17]) are available for in depth inspection. Detailed instructions are provided to enable researchers to perform a similar investigation of other cases of potential annotation artifacts [17].

Fig. 2
figure 2

RNA-seq read coverage of CHS loci in Macadamia integrifolia(A), Musa balbisiana(B), Musa troglodytarum(C), and Nymphae colorata(D). The annotation suggests that a single gene is spanning the entire displayed locus, but the continuously drop** coverage towards the end of exons (black lines) in our RNA-seq read map** suggests that there are multiple individual genes

Discussion

We outlined how surprising results of a quick database search and in-depth inspection of these entries resulted in the identification of most likely annotation artifacts. We investigated the assembly and annotation process underlying the annotations of the four analyzed species. The Macadamia integrifolia assembly was produced with MaSuRCA v3.2.6 and ALLMAPS v.Jul-2019 [14, 26, 27]. The Musa balbisiana and Musa troglodytarum genome sequences were assembled with wtdbg v1.2.8 and NextDenovo v2.4.0, respectively [12, 13, 28, 29]. The Nymphae colorata assembly was produced by Canu v1.3 [15, 30]. All these assemblers are well established tools that have been deployed in numerous plant genome sequencing projects. The annotation was produced by Gnomon [31] which is the default annotation pipeline operated at the NCBI. There is no indication how any of these steps could have caused the mis-annotation. Gnomon was also applied on many other plant genome sequences and usually generated CHS annotations comprising only a single CHS domain [32,33,34]. The repetitive regions with multiple CHS copies in an array might contribute to the mis-annotation as suggested for repeats before [35].

To the best of our knowledge, there are no reports that validated a CHS protein with multiple repeated domains. Despite all efforts, it might not be possible to automatically maintain a perfect database free of any mis-annotations. Despite technological advances, manual inspection and correction might be required in some cases [36]. Swiss-Prot is an initiative to establish a collection of sequences that were curated by experts [37, 38]. However, such a manual inspection is not a scalable approach given the rapid growth of sequence collections due to improved sequencing technologies [8, 9]. As most records in the database are likely correct, users should carefully inspect any substantially deviating records. Especially sequences, which appear as exciting discoveries, should be considered as suspicious and thorough inspection is required.

Unexpected sequences should be carefully checked, because annotation artifacts are more likely than striking differences between closely related species. We describe our strategy for the gene model investigation in detail to enable repetition by others [17].

  1. (1)

    A comparison with orthologs in many other species is often a way to identify unexpected sequence properties. Sequences should only differ systematically if there are particularly striking events during evolution. If such events are not known, any major sequence differences could point to artifacts and should be considered as such.

  2. (2)

    Generally, the alignment of RNA-seq reads or other types of transcriptomic evidence should be used to gain insights into the structure of genes. It is important to use a proper split read aligner like STAR [18, 19] or HISAT2 [39] for this step. The coverage should be consistent or should drop continuously towards the ends of a gene. Splice sites within a gene are characterized by abrupt changes of the coverage to almost zero. In addition, introns should be spanned by a number of reads that is almost equivalent to the coverage at the border of the flanking exons.

  3. (3)

    If no suitable datasets are available, it is also possible to validate the gene structure and sequence through amplification via RT-PCR followed by Sanger sequencing [40]. As this approach is more time consuming and requires more financial resources, the aforementioned data-reuse approaches should be applied first.

Limitations

Our investigation was restricted to four examples of apparent CHS domain triplication events within a single CHS gene in four different plant species. The gene models in question were contradicted by the analyzed RNA-seq datasets. However, we cannot rule out the possibility that such modified CHS versions exist elsewhere.