Background

Viridiplantae, or green plants, are a clade of perhaps 500,000 species [16] that exhibit an astounding diversity of life forms, including some of the smallest and largest eukaryotes [3, 7]. Fossil evidence suggests the clade is at least 750 million years old [810], while divergence time estimates from molecular data suggest it may be more than one billion years old [1114]. Reconstructing the phylogenetic relationships across green plants is challenging because of the age of the clade, the extinction of major lineages [1517], and extreme molecular rate and compositional heterogeneity [1822]. Most phylogenetic analyses of Viridiplantae have recovered two well-supported subclades, Chlorophyta and Streptophyta[23, 24]. Chlorophyta contain most of the traditionally recognized “green algae,” and Streptophyta contain the land plants (Embryophyta), as well as several other lineages also considered “green algae”. Land plants include the seed plants (gymnosperms and angiosperms; Spermatophyta), which consist of ~270,000 to ~450,000 species [1, 3].

While many of the major green plant clades are well defined, questions remain regarding the relationships among them. For example, the closest relatives of land plants have varied among analyses [23, 2529], as have the relationships among the three bryophyte lineages (mosses, liverworts, and hornworts) [2935]. The relationships among extant gymnosperms also remain contentious, particularly with respect to the placement of Gnetophyta[20, 3643].

Most broad analyses of green plant relationships based on nuclear gene sequence data have relied largely on 18S/26S rDNA sequences [30, 37, 44, 45], although recent analyses have employed numerous nuclear genes [40, 46]. Some studies have used mitochondrial gene sequence data, often in combination with other data [29, 47, 48]. However, investigations of green plant phylogeny typically have either largely or exclusively employed chloroplast genes (e.g., [29, 4952]). Sequence data from the plastid genome have transformed plant systematics and contributed greatly to the current view of plant relationships. With the plastid genome present in high copy numbers in each cell in most plants, and with relatively little variation in gene content and order [53], as well as few reported instances of gene duplication or horizontal gene transfer [54, 55], the plastid genome provides a wealth of phylogenetically informative data that are relatively easy to obtain and use [56, 57]. Although early phylogenetic studies using one or a few chloroplast loci provided fundamental insights into relationships within and among green plant clades, these analyses failed to resolve some backbone relationships [5659]. These remaining enigmatic portions of the green plant tree of life ultimately motivated the use of entire, or nearly entire, plastid genome sequences for phylogenetic inference.

Complete sequencing of the relatively small (~150 kb) plastid genome has been technically feasible since the mid-1980s [60, 61], although few plastid genomes were sequenced prior to 2000 (see [62, 63]). Next-generation sequencing (NGS) technologies, such as 454 [62] and Illumina [6467], greatly reduced the cost and difficulty of sequencing plastid genomes, and consequently, the number of plastid genomes available on GenBank increased nearly six-fold from 2006 to 2012 [68]. Phylogenetic analyses based on complete plastid genome sequences have provided valuable insights into relationships among and within subclades across the green plant tree of life (recently reviewed in [26, 35, 68, 69]). Still, studies employing complete plastid genomes generally have either focused on subclades of green plants or have had relatively low taxon sampling. Thus, they have not addressed the major relationships across all green plants simultaneously.

We assembled available plastid genome sequences to build a phylogenetic framework for Viridiplantae that reflects the wealth of new plastid genome sequence data. Furthermore, we highlight analytical challenges for resolving the green plant tree of life with this type of data. We performed phylogenetic analyses of protein-coding data on 78 genes from 360 taxa, exploring the effects of different partitioning and character-coding protocols for the entire data set as well as subsets of the data. While our analyses recover many well-supported relationships and reveal strong support for some contentious relationships, several factors, including base composition biases, can affect the results. We also highlight the challenges of using plastid genome data in deep-level phylogenomic analyses and provide suggestions for future analyses that will incorporate plastid genome data for thousands of species.

Results

Data set

We assembled plastid protein-coding sequences from 360 species (Additional file 1) for which complete or nearly complete plastid genome sequences were available on GenBank. Of the 360 species, there were 258 angiosperms (Angiospermae), 53 gymnosperms (Acrogymnospermae, including three Gnetophyta), seven monilophytes (Monilophyta), four lycophytes (Lycopodiophyta), three liverworts (Marchantiophyta), one hornwort (Anthocerotophyta), two mosses (Bryophyta), six taxa from the paraphyletic streptophytic algae, and 26 chlorophytic algae (Chlorophyta). The phylogenetic character matrices contained sequences from 78 genes and the following number of alignment positions: 58,347 bp for the matrix containing all nucleotide positions (ntAll) and the RY-coded (RY) version of the ntAll matrix; 38,898 bp in the matrix containing only the first and second codon positions (ntNo3rd), and 19,449 amino acids (AA). The number of genes present per taxon varied from 18 to 78 (mean = 70), while the number of taxa present per gene ranged from 228 to 356 (mean = 322; see Additional file 2). Taxa with few genes present, such as Helicosporidium (18 genes) and Rhizanthella (19 genes), represent highly modified complete plastid genomes of non-photosynthetic species [70, 71]. The percentage of missing data (gaps and ambiguous characters) was ~15.6% for each of the four data sets. The pattern of data across each of the four matrices is decisive, meaning that it can uniquely define a single tree for all taxa [72]. The data contain 100% of all possible triplets of taxa, and are decisive for 100% of all possible trees. All alignments have been deposited in the Dryad Data Repository [73].

GC bias

GC content varied considerably both among lineages and also within single genomes, and chi-square tests rejected the null hypothesis of homogeneous base frequencies (Table 1). The average GC content in the ntAll matrix was 38.9%, and it ranged from 54.3% in Selaginella uncinata to 27.5% in Helicosporidium sp. (Figure 1, Additional file 3). Also, the average GC content varied among first, second, and third codon positions, with by far the most variation among lineages at the third codon position (Figure 1, Additional file 3). Although there was extensive heterogeneity in GC content across all species, there was relatively little variation among the seed plant taxa (Figure 2). There also was significant correlation between nucleotide composition and amino acid composition. Plastid genomes that are GC-rich had a significantly higher percentage (Figure 3; p < 0.001) of amino acids that are encoded by GC-rich codons (i.e., G, A, R, and P). Similarly, GC-rich plastid genomes had a significantly lower percentage (Figure 4; p < 0.001) of amino acids that are coded by AT-rich codons (i.e., F, Y, M, I, N, and K).

Table 1 Chi-square tests of nucleotide composition homogeneity among lineages
Figure 1
figure 1

Box plots of percent GC content in the ntAll and ntNo3rd data sets as well as in the first, second, and third codon positions of the ntAll data set.

Figure 2
figure 2

Box plots of percent GC content in seed plants ( Spermatophyta; on left) and the data set as a whole ( Viridiplantae; on right) in the ntAll and ntNo3rd data sets as well as the first, second, and third codon positions of the ntAll data set. For each pair of box plots, values for seed plants (Spermatophyta) are on the left, and values for all green plant taxa (Viridiplantae) are on the right.

Figure 3
figure 3

Correlation between percent GC nucleotide content in the ntAll matrix and percent of amino acids in the AA matrix that are coded for by GC-rich codons (G, A, R, and P).

Figure 4
figure 4

Correlation between percent GC nucleotide content in the ntAll matrix and percent of amino acids in the AA matrix that are coded for by AT-rich codons (F, Y, M, I, N, and K).

Phylogenetic analyses

In the phylogenetic analyses of all data sets and partitioning schemes, the partitioning strategy with the most partitions consistently fit the data best based on the AICc (Table 2). These best-fit models partitioned the AA matrix by gene (78 partitions) and the nucleotide (ntAll, ntNo3rd) and RY matrices by codon position and gene (234 partitions). All a posteriori bootstop** analyses indicated that convergence of support values had been reached after 100 replicates, and thus our choice of 200 replicates was more than sufficient to obtain reliable bootstrap values.

Table 2 AICc scores for each of the phylogenetic matrix partitioning strategies

We will focus on reporting the relationships of major clades of Viridiplantae shown in the 50% maximum likelihood (ML) majority-rule bootstrap consensus summary trees for each data set: ntAll (Figure 5), ntNo3rd (Figure 6), RY (Figure 7), and AA (Figure 8). These summary trees collapse some clades for ease of viewing the major relationships within Viridiplantae. A summary of important results and conflicts among these four data sets is given in Table 3. We provide full majority-rule bootstrap consensus trees for the ntAll (Figures 9, 10, 11, 12, 13, and 14), ntNo3rd (Additional file 4), RY (Additional file 5), and AA (Additional file 6) data sets. ML trees with branch lengths and BS values are also provided: ntAll (Additional file 7), ntNo3rd (Additional file 8), RY (Additional file 9), and AA (Additional file 10). Average support values among all internal nodes in the ML trees were slightly higher in the ntAll phylogeny (~94% bootstrap support [BS]; Additional file 7) compared to the other data sets (~90-91% BS; Additional files 8, 9, and 10). The ntAll phylogeny also had the most clades resolved with ≥ 70% BS (92%; 327 bipartitions resolved out of 357 possible) while the ntNo3rd, RY, and AA data sets had 87%, 87%, and 86% of the possible bipartitions resolved at ≥ 70% BS, respectively. All resulting trees have been deposited in the Dryad Data Repository [73].

Figure 5
figure 5

Fifty percent maximum likelihood majority-rule bootstrap consensus summary tree of Viridiplantae inferred from the all nucleotide positions (ntAll) analysis. Data set derived from 78 protein-coding genes of the plastid genome (ntax = 360; 58,347 bp; missing data ~15.6%). Bootstrap support values ≥ 50% are indicated. Terminals with a triangle represent collapsed clades with > 2 taxa. Note position of Lycopodiophyta as sister to Spermatophyta is likely caused by base composition bias (see text). See Figures 9, 10, 11, 12, 13, and 14 for the complete tree and Additional file 1 for taxonomy. Lami. = Lamiidae; Campanuli. = Campanulidae; Lyco. = Lycopodiophyta.

Figure 6
figure 6

Fifty percent maximum likelihood majority-rule bootstrap consensus summary tree of Viridiplantae inferred from the first and second codon positions (ntNo3rd) analysis. Data set derived from 78 protein-coding genes of the plastid genome (ntax = 360; 38,898 bp; missing data ~15.6%). Bootstrap support values ≥ 50% are indicated. Terminals with a triangle represent collapsed clades with > 2 taxa. See Additional file 4 for the complete tree and Additional file 1 for taxonomy. Lami. = Lamiidae; Campanuli. = Campanulidae.

Figure 7
figure 7

Fifty percent maximum likelihood majority-rule bootstrap consensus summary tree of Viridiplantae inferred from the RY-coded (RY) analysis. Data set derived from 78 protein-coding genes of the plastid genome (ntax = 360; 58,347 bp; missing data ~15.6%). Bootstrap support values ≥ 50% are indicated. Terminals with a triangle represent collapsed clades with > 2 taxa. See Additional file 5 for the complete tree and Additional file 1 for taxonomy. Lami. = Lamiidae; Campanuli. = Campanulidae.

Figure 8
figure 8

Fifty percent maximum likelihood majority-rule bootstrap consensus summary tree of Viridiplantae inferred from the amino acid (AA) analysis. Data set derived from 78 protein-coding genes of the plastid genome (ntax = 360; 19,449 AAs; missing data ~15.6%). Bootstrap support values ≥ 50% are indicated. Terminals with a triangle represent collapsed clades with > 2 taxa. See Additional file 6 for the complete tree and Additional file 1 for taxonomy. Lami. = Lamiidae; Campanuli. = Campanulidae.

Table 3 Summary of selected similarities and conflicts between bootstrap consensus topologies derived from the four data sets
Figure 9
figure 9

Fifty percent maximum likelihood majority-rule bootstrap consensus tree of Viridiplantae inferred from the all nucleotide positions (ntAll) analysis. Portion of tree showing Chlorophyta, Chlorokybophyceae, Mesostigmatophyceae, Charophyceae, Coleochaetophyceae, Zygnematophyceae, Marchantiophyta, Bryophyta, and Anthocerotophyta. Data set derived from 78 protein-coding genes of the plastid genome (ntax = 360; 58,347 bp; missing data ~15.6%). Bootstrap support values ≥ 50% are indicated. See also Figure 5 for a summary tree of major Viridiplantae clades and Additional file 1 for taxonomy. Tree continued in Figure 10.

Figure 10
figure 10

Fifty percent maximum likelihood majority-rule bootstrap consensus tree of Viridiplantae inferred from the all nucleotide positions (ntAll) analysis. Portion of tree showing Monilophyta, Lycopodiophyta , and Acrogymnospermae. Data set derived from 78 protein-coding genes of the plastid genome (ntax = 360; 58,347 bp; missing data ~15.6%). Bootstrap support values ≥ 50% are indicated. See also Figure 5 for a summary tree of major Viridiplantae clades and Additional file 1 for taxonomy. Note position of Lycopodiophyta as sister to Spermatophyta is likely caused by base composition bias (see text). Tree continued in Figures 9 and 11.

Figure 11
figure 11

Fifty percent maximum likelihood majority-rule bootstrap consensus tree of Viridiplantae inferred from the all nucleotide positions (ntAll) analysis. Portion of tree showing Amborellales, Nymphaeales, Austrobaileyales, Chloranthales, and Magnoliidae. Data set derived from 78 protein-coding genes of the plastid genome (ntax = 360; 58,347 bp; missing data ~15.6%). Bootstrap support values ≥ 50% are indicated. See also Figure 5 for a summary tree of major Viridiplantae clades and Additional file 1 for taxonomy. Tree continued in Figures 10 and 12.

Figure 12
figure 12

Fifty percent maximum likelihood majority-rule bootstrap consensus tree of Viridiplantae inferred from the all nucleotide positions (ntAll) analysis. Portion of tree showing Monocotyledoneae. Data set derived from 78 protein-coding genes of the plastid genome (ntax = 360; 58,347 bp; missing data ~15.6%). Bootstrap support values ≥ 50% are indicated. See also Figure 5 for a summary tree of major Viridiplantae clades and Additional file 1 for taxonomy. Tree continued in Figures 11 and 13.

Figure 13
figure 13

Fifty percent maximum likelihood majority-rule bootstrap consensus tree of Viridiplantae inferred from the all nucleotide positions (ntAll) analysis. Portion of tree showing Ceratophyllales, Ranunculales, Sabiaceae, Proteales, Trochodendrales, Buxales, Gunnerales, and Superasteridae. Data set derived from 78 protein-coding genes of the plastid genome (ntax = 360; 58,347 bp; missing data ~15.6%). Bootstrap support values ≥ 50% are indicated. See also Figure 5 for a summary tree of major Viridiplantae clades and Additional file 1 for taxonomy. Tree continued in Figures 12 and 14.

Figure 14
figure 14

Fifty percent maximum likelihood majority-rule bootstrap consensus tree of Viridiplantae inferred from the all nucleotide positions (ntAll) analysis. Portion of tree showing Dilleniaceae and Superrosidae. Data set derived from 78 protein-coding genes of the plastid genome (ntax = 360; 58,347 bp; missing data ~15.6%). Bootstrap support values ≥ 50% are indicated. See also Figure 5 for a summary tree of major Viridiplantae clades and Additional file 1 for taxonomy. Tree continued in Figure 13.

The monophyly of Chlorophyta receives 100% BS in all analyses. Prasinophyceae are consistently not monophyletic. Instead, the prasinophyte Nephroselmis is sister to all other Chlorophyta (Figure 9; Additional files 4, 5, and 6), while remaining Prasinophyceae form a clade that is variously supported (ntAll 97% BS, ntNo3rd 78% BS, RY 93% BS, and AA 68% BS) and is sister to a clade of the remaining Chlorophyta. Chlorophyceae are monophyletic (100% BS in all analyses), but Trebouxiophyceae and Ulvophyceae are not monophyletic, and the relationship of Chlorophyceae to these lineages is unresolved.

We consistently recovered a single set of relationships among the streptophytic algae subtending the land plant clade. Zygnematophyceae are sister to land plants, Coleochaetophyceae are sister to Zygnematophyceae + Embryophyta, Charophyceae are sister to Coleochaetophyceae + (Zygnematophyceae + Embryophyta), and a clade of Mesostigmatophyceae + Chlorokybophyceae is sister to all other Streptophyta. Each of these relationships has ≥86% BS support (Figures 5, 6, 7, and 8).

The branching order of the non-vascular land plant lineages differs among analyses. In analyses of the ntAll and RY data sets, Marchantiophyta (liverworts), followed by Bryophyta (mosses), and then Anthocerotophyta (hornworts) are the earliest-branching land plant lineages, with Anthocerotophyta the immediate sister to the vascular plants (Tracheophyta; Figures 5 and 7). In the ntAll and RY analyses, these relationships had ≥89% BS support except for the Bryophyta + (Anthocerophyta + Tracheophyta) relationship in the ntAll analysis, which received only 69% BS (Figure 5). In contrast, in the ntNo3rd and AA analyses, Bryophyta and Marchantiophyta formed a clade (78% BS [Figure 6] and 99% BS [Figure 8], respectively), followed by Anthocerophyta as sister to Tracheophyta (94% [Figure 6] and 53% BS [Figure 8], respectively).

Within Tracheophyta, the ntNo3rd, RY, and AA data sets all place Lycopodiophyta sister to a Euphyllophyta clade (Monilophyta + Spermatophyta; ≥89% BS, Figures 6, 7, and 8). However, the analysis of the ntAll data set places Monilophyta sister to a clade of Lycopodiophyta + Spermatophyta (75% BS, Figures 5, 6, 7, 8, 9, and 10).

Our analyses of Monilophyta generally reveal strong support for a clade of Equisetales + Psilotales as sister to Marattiales + leptosporangiate ferns (represented by Cyatheales and Polypodiales). The lowest support obtained was for Equisetales + Psilotales in the ntNo3rd analysis (84% BS; Figure 6) and ntAll (89% BS; Figure 5); all other nodes in all analyses received > 90% BS, with Marattiales + leptosporangiate ferns receiving ≥ 99% BS.

Within Spermatophyta, all analyses place the extant gymnosperms (Acrogymnospermae) sister to Angiospermae with 100% BS. Within extant gymnosperms, Cycadales and Ginkgoales form a clade (≥ 98% BS in ntAll, ntNo3rd, and AA; 51% BS in RY) that is sister to a clade in which Gnetophyta (100% BS in all analyses) are nested within the paraphyletic conifers. There is generally high support (100% BS in ntAll [Figure 5], ntNo3rd [Figure 6], and AA [Figure 7]; 87% BS [Figure 8] in RY) placing Gnetophyta as sister to a clade of Araucariales + Cupressales. This “Gnecup” clade [sensu 16, 30, 41] is then sister to Pinales, which has 100% BS in all analyses.

In all analyses, Angiospermae receive 100% BS, and Amborella (Amborellales) is sister to all other angiosperms, followed by Nymphaeales, and then Austrobaileyales. These relationships are mostly supported by 100% BS. However, Nymphaeales + (Austrobaileyales + Mesangiospermae) receives 81% BS (Figure 6) in the ntNo3rd analyses and 70% BS (Figure 8) in the AA analyses. The remaining angiosperms (Mesangiospermae) receive 100% BS in all analyses. Within Mesangiospermae, the relationships among Monocotyledoneae, Magnoliidae, Eudicotyledoneae, and Ceratophyllum (Ceratophyllales) are not well supported and vary depending on the analysis. The strongest support for the placement of Ceratophyllales is 75% BS as sister to Eudicotyledoneae in the RY analysis (Figure 7).

Chloranthales receive 61-69% BS as sister to the well-supported (100% BS in ntAll, RY; 83% BS in ntNo3rd) Magnoliidae. However, Magnoliidae are not monophyletic in the AA analyses, where Piperales are sister to Ceratophyllales (67% BS; Figure 8).

Within the monocot clade (Monocotyledoneae), Acorales, followed by Alismatales, have 100% BS in all analyses as subsequent sisters to the remaining monocots. In three of our analyses (ntAll, ntNo3rd, and AA), a variously supported clade (72%, 69%, and 80% BS, respectively) of Liliales + (Pandanales + Dioscoreales) is sister to a clade (>95% BS in these three analyses) of the remaining monocots (Asparagales + Commelinidae). However, in the RY-coded analysis, Pandanales + Dioscoreales (100% BS) is sister to a clade of Liliales + (Asparagales + Commelinidae), which receives 69% BS (Figure 7). Here Asparagales + Commelinidae is supported by 80% BS.

Within the eudicots (Eudicotyledoneae), which receive 100% BS in all analyses, Ranunculales are sister to the remaining taxa. In the ntAll, ntNo3rd, RY, and AA analyses, the clade of these remaining taxa receives 100%, 85%, 100%, and 62% BS, respectively. Relationships vary among Sabiaceae, Proteales, and a clade of the remaining taxa, depending on the analysis. In the ntAll and ntNo3rd analyses, Proteales + Sabiaceae are supported as a clade, although with only 63% and 60% BS, respectively. However, in the RY analysis, Proteales are sister to a clade containing Sabiaceae plus the remaining taxa, which has 79% BS. In the AA analysis, relationships among these three clades are unresolved.

Among the remaining eudicots, we consistently recovered Trochodendrales as sister to Buxales + Pentapetalae and Gunnerales as sister to the remaining lineages of Pentapetalae: Dilleniaceae, Superrosidae, and Superasteridae. The placement of Dilleniaceae remains uncertain. The family is sister to Superrosidae in the ntAll (95% BS), ntNo3rd (77% BS), and RY (57% BS) analyses, but appears as sister to Superasteridae (70% BS) in the AA analysis.

Within Superrosidae, a clade of Vitales + Saxifragales is supported in the ntAll (75% BS), ntNo3rd (70% BS), and AA (78% BS) analyses. In the RY analysis, the relationship among Saxifragales, Vitales, and remaining Rosidae (Fabidae + Malvidae) is unresolved. Fabidae and Malvidae are both recovered with ≥ 99% BS in the ntAll and RY analyses. However, each clade receives only 70% BS in the ntNo3rd analysis. In the AA analysis neither clade is monophyletic; Zygophyllales are embedded (68% BS) within a clade of Malvidae taxa. The COM clade (Celastrales, Oxalidales, Malpighiales) is sister to a clade of Fagales, Cucurbitales, Rosales, and Fabales in Fabidae in the AA (69% BS; Figure 8), RY (82% BS; Figure 7), and ntAll (81% BS; Figure 5) trees and forms a trichotomy with Zygophyllales and the clade of Fagales, Cucurbitales, Rosales, and Fabales in the ntNo3rd tree (70% BS; Figure 6). Zygophyllales are sister to Geraniales (69% BS; Figure 8) in the AA tree and sister to all other Fabidae in the ntAll and RY trees (with 100% [Figure 5] and 99% BS [Figure 7], respectively).

Superasteridae (Santalales, Berberidopsidales, Caryophyllales, and Asteridae) are recovered in all analyses. This clade receives 100% BS in the ntAll and RY analyses, 95% BS in the ntNo3rd analysis, and 66% BS in the AA analysis. Santalales and Berberidopsidales are strongly supported as subsequent sisters to Caryophyllales + Asteridae. Within Asteridae, Cornales, followed by Ericales, are subsequent sisters to a strongly supported clade that comprises strongly supported Campanulidae and Lamiidae clades. Within Lamiidae, the placement of Boraginaceae is weak among the various analyses. Boraginaceae are sister to Gentianales (59% BS; Figure 8) in the AA tree, part of a trichotomy (100% BS; Figure 5) with Lamiales and Solanales + Gentianales in the ntAll tree, and sister to a weakly supported clade including Gentianales, Lamiales, and Solanales in the ntNo3rd (Figure 6) and RY (Figure 7) trees.

Analysis of only the third codon positions (nt3rdOnly, Additional file 11) resulted in several very strong conflicts along the backbone of Viridiplantae when compared to the topology from the ntNo3rd analyses. These conflicts include the backbone relationships within Chlorophyta, the placements of Cycadales and Lycopodiophyta, the relationships of the three major bryophyte lineages, and backbone relationships within Poales. Removal of four taxa (Epifagus, Helicosporidium, Neottia, and Rhizanthella) with elevated rates of molecular evolution and few genes present in the data sets did not significantly affect the resulting topologies.

Discussion

While the enormous phylogenetic data sets that result from new genome or transcriptome sequencing efforts can ameliorate the effects of random or stochastic error, they also may exacerbate the effects of systematic error, or error resulting from problems in the analysis, such as model inaccuracy. The high amount of agreement among our various analyses and strong support for results generally consistent with previous studies (many of which also used plastid genes) suggest that plastid genome sequence data hold much promise for resolving relationships throughout the green plants. However, several areas of conflict between analyses using different character-coding strategies demonstrate that plastid genome phylogenetics is also susceptible to systematic error. Here we evaluate the phylogenetic results, emphasizing areas of agreement and concern, and then address some of the methodological issues raised by our results.

Evaluation of phylogenetic relationships

Historically, Chlorophyta have been divided into Prasinophyceae, Trebouxiophyceae, Chlorophyceae, and Ulvophyceae based on the ultrastructure of the flagellar apparatus and features related to cytokinesis [74, 75]. The current status of green algae phylogenetics (Chlorophyta and streptophytic algae) has been reviewed recently [26, 76, 77]. The most comparable study to ours in terms of data and taxon sampling is by Lang and Nedelcu [26], who constructed a phylogeny of green algae with plastid genome sequence data. However, they analyzed only an amino acid data set using Bayesian inference and the CAT model [78, 79]. We found a paraphyletic Prasinophyceae (not including Pedinomnas; Figures 5, 6, 7 and 8), which agrees with previous molecular analyses [26, 76, 77]. However, Lang and Nedelcu [26] recovered a monophyletic Prasinophyceae, albeit with little support. Chlorophyceae are monophyletic (100% BS in all of our analyses), which agrees with the results of Lang and Nedelcu [26]. We also find that Trebouxiophyceae and Ulvophyceae are not monophyletic, and that the relationship of Chlorophyceae to these lineages is unresolved. The branching order of the various Trebouxiophyceae, Ulvophyceae, and Chlorophyceae lineages within Chlorophyta, unresolved in our analyses, was also uncertain in earlier analyses (reviewed in [26, 76, 77]). Similarly in Lang and Nedelcu [26], Trebouxiophyceae and Ulvophyceae were not supported as monophyletic, although unlike our results, almost all nodes in their phylogeny were maximally supported.

Our analyses provide consistent, strong support for the relationships of streptophytic algae to land plants, and all analyses support Zygnematophyceae as the sister to land plants (Figures 5, 6, 7, and 8). Relationships among these lineages and the closest relatives of land plants have varied in previous studies depending on taxon sampling and gene choice. Some studies agree with our results placing Zygnematophyceae as sister to land plants [25, 27, 8082], while other phylogenetic analyses indicate that Charophyceae[23, 83, 84] or Coleochaetophyceae[26, 40, 85, 86] occupy this position. Depending on the analysis, Zhong et al. [167]. All trees were rooted at the branch between Chlorophyta and Streptophyta[23, 24].

To further explore our data, we conducted the following phylogenetic analyses using the methods described above unless otherwise noted. To determine if there is conflict between the phylogenetic signal in the ntNo3rd data set and the data set containing only third positions (nt3rdOnly), we analyzed the nt3rdOnly data partitioned by gene region. We also conducted phylogenetic analyses on each of the four main data sets (ntAll, ntNo3rd, RY, and AA) with four taxa removed: Neottia nidus-avis and Rhizanthella gardneri (mycoheterotrophic orchids), Epifagus virginiana (a parasitic flowering plant), and Helicosporidium sp. (a parasitic green alga). These taxa have elevated rates of molecular evolution and relatively few genes present in the data sets (see Additional file 2). We removed them to ensure that their inclusion did not cause any phylogenetic artifacts.

Availability of supporting data

The data sets supporting the results of this article are available in the Dryad Digital Repository: http://doi.org/10.5061/dryad.k1t1f.