Introduction

Angiosperms, that is, flowering plants, are one of the most diverse and species-rich groups on Earth and are the major components of the current terrestrial ecosystems1. The geologically sudden appearance of diverse angiosperm fossils could not be explained by Darwin’s evolutionary theory of gradual changes and prompted his reference of ‘abominable mystery’2. Such angiosperm diversity has since been extended by recent fossil discoveries of the now extinct early angiosperm Archaefructus, waterlilies (Nymphaeales) and a relative of buttercup (Ranunculales, sister to all other eudicots) in the Early Cretaceous (~125 million years ago) or even earlier3,4,5. Decades of efforts have produced an angiosperm phylogeny that defines major groups and identifies small sister lineages to the vast majority of angiosperm diversity6. Among the estimated 350,000 angiosperm species ( http://www.theplantlist.org/), only ~175 species form three small successive sister groups to other groups, Amborellales (a single species of understory bush found in New Caledonia, the South Pacific), Nymphaeales (waterlilies and related plants) and Austrobaileyales (star anise and relatives), collectively named the ANITA grade7. The remaining 99.95% of extant angiosperms form Mesangiospermae, a highly supported monophyletic group composed of five major lineages: eudicots, monocots, magnoliids, Chloranthaceae and Ceratophyllaceae8. Therefore, after a few early divergent branches in the ANITA grade, the highly diverse and species-rich Mesangiospermae represent the rapid expansion of early angiosperms and account for nearly all extant angiosperm diversities.

Within Mesangiospermae, eudicots and monocots are the two largest and diversified groups, containing ~75% and 20% of angiosperm species, respectively. Eudicots include many familiar fruits (for example, apple, orange and melons), beans, nuts (walnut and chestnut), vegetables (for example, tomato, lettuce and cabbage), spices and flowers (roses and carnations), whereas monocots include major grains (maize, rice and wheat) and flowers (orchids, tulip and lilies), as well as palm trees. Magnoliids, the third major group with ~9,000 species, contains some of the most ‘early angiosperms’ defined in earlier studies, such as magnolia, as well as black pepper and avocado9. The other two groups, Chloranthaceae and Ceratophyllaceae, are small and morphological unusual with only 77 and 6 species, respectively; however, they represent separate ancient lineages with evolutionary significance. Chloranthaceae has the simplest flowers and was once considered as the most ‘primitive’ group because of its extensive and early fossil records9,10. Ceratophyllaceae is a group of cosmopolitan aquatic plants with unusual morphologies, including inconspicuous flowers and greatly reduced roots, with an ancient origin supported by related fossils since the early Cretaceous11.

Resolving the relationships among these five groups will inform the order of their divergence and identify the sister groups of eudicots and monocots, the two largest angiosperm groups. The divergence order is crucial for estimating the time of the rapid angiosperm radiation and identifying possibly relevant contributing factors; moreover, knowledge of the sisters of eudicots and monocots is vital for understanding the origin and evolutionary patterns of characters. In the widely accepted Angiosperm Phylogeny Group III (APG III) system6, Ceratophyllaceae is sister to eudicots and they together are sister to monocots; then, Chloranthaceae and magnoliids form a clade that is sister to the (eudicots–Ceratophyllaceae)–monocots clade (Fig. 1a)6. According to this hypothesis, monocots separated from the clade of eudicots and Ceratophyllaceae after the divergence of a series of small lineages (that is, the ANITA grade, magnoliids and Chloranthaceae)12. However, the relationships among the 5 mesangiosperm groups are far from resolved, with 15 proposed topologies having low-to-moderate support, including those hypothesizing sisterhood of monocots with either eudicots or magnoliids7,12,13,14,15,16,17,18 (Fig. 1 and Supplementary Fig. 1). Therefore, the relationship of the five mesangiosperm groups has long been regarded as one of the most difficult problems remaining in angiosperm phylogenySupplementary Fig. 14k), but the BS values decreased slightly, again suggesting that conflict signals exist between single-copy genes and IR genes, and that IR genes and single-copy plastid genes possibly have different evolutionary histories. Therefore, when conflicting signals exist, simply increasing the number of genes could not resolve the difficult question of mesangiosperm phylogeny.

Morphological characters revisited using the new topology

Recently, Endress and Doyle43 reconstructed the angiosperm phylogeny using a morphological data set composed of 110 characters, and they proposed that Ceratophyllaceae might be sister to Chloranthaceae (Fig. 1d). The topology with high support values uncovered here provides a new opportunity to examine the evolution of morphological characters. Compared with the minimum (775) steps required for the Endress and Doyle’s43 topology obtained by morphological analyses (Fig. 1d), 777 steps were needed for our best maximum likelihood (ML) tree (Table 2), suggestive of the strong agreement between our topology and the morphological data set, whereas 785 steps were required for the topology accepted by APG III (Fig. 1a), 10 more steps than the most parsimony one. The sisterhood of Chloranthaceae and Ceratophyllaceae was supported by six characters originated before their diversification (Fig. 5 and Supplementary Fig. 15) in the context of our topology. In contrast, only one (dry fruit wall) supported the alternative hypothesis of Ceratophyllaceae and eudicots being sisters43 and only the loss of cambium associated Ceratophyllaceae with monocots. The Chloranthaceae–Ceratophyllaceae sisterhood was also uncovered using other molecular data sets, albeit with low support values15,17,40. In addition, seven characters including absence of cambium (4), parallel major venation (17), boat-shaped pollen (61) and one cotyledon (110), originated before the origin of monocots but after the diversification of monocots from Mesodicots, enabling relatively easy differentiation between these groups (Fig. 5 and Supplementary Fig. 16) and consistent with the well-known distinctive morphologies of monocots. Two, three and nine morphological characters could be interpreted as novel before the origin of, respectively, the eudicots, Chloranthaceae and Ceratophyllaceae (Fig. 5 and Supplementary Figs 17–19). In contrast, no novel character was found for the magnoliids, providing an explanation for the previous idea that magnoliids were early angiosperms in classical taxonomy mainly according to morphological characters.

Figure 5: Estimated divergence times of Mesangiospermae and shared morphological characters.
figure 5

(a) The estimated divergence time for each node is shown relative to the geological time scale below the cladogram. Numbers in the rectangles on some branches indicate the numbers of morphological characters shared by the corresponding lineages. (by) Plant photographs show the diversities of angiosperms; (Rosids: A. thaliana (b), Chaenomeles cathayensis (c), Wisteria sinensis (d), Juglans regia (e); Asterids: Coreopsis basalis (f), Solanum wrightii (g), Scutellaria baicalensis (h), Pharbitis purpurea (i); Chloranthaceae: Chloranthus japonicus (j), Hedyosmum orientale (k), Sarcandra glabra (l); Ceratophyllaceae: Ceratophyllum demersum (m); Magnoliids: Magnolia denudata (n), Aristolochiae heterophyllae (o), Houttuynia cordata (p), Chimonanthus praecox (q); Monocots: Paphiopedilum henryanum (r), Cocos nucifera (s), Hippeastrum rutilum (t), Setaira viridis (u); Basal angiosperms: A. trichopoda (v), Nuphar advena (w), Cabomba caroliniana (x), Schisandra sphenanthera (y).

The plicate and completely sealed postgenital carpel (75 and 76) (Supplementary Fig. 20) might have originated before the origin of mesangiosperms, which could have enabled mesangiosperm species to produce more seeds or better protect the develo** seeds when compared with most ANITA species with only ascidiate and not postgenitally sealed carpels. On the other hand, in the context of the topologies obtained from morphological data (Fig. 1d) or plastid genome data sets (Fig. 1a), this morphological novelty would have originated not before but after the origin of mesangiosperms (Supplementary Figs 21 and 22). These differences suggest that the phylogenetic relationship is critical for inferring ancestral characters and the topology here provides a potential new framework to investigate the evolution of these and other morphological, developmental and physiological characters.

Possibly early origins of angiosperms and mesangiosperms

According to the well-supported mesangiosperm topology and other deep relationships, a framework is proposed for estimating the divergence times of angiosperms, particularly for the five mesangiosperm groups, providing possible geological contexts of their rapid radiation. In general, times inferred by r8s49 were somewhat earlier (~10 million years, hereafter Myr) than those obtained by BEAST50 and only small differences were detected when using different codon positions (Supplementary Table 6), suggesting that time estimations were robust to methodology and data used. The origin of angiosperm was estimated to be 225–240 Myr, that is, in the Late-to-Middle Triassic (Figs 5 and 6, and Supplementary Table 6), considerably earlier than the previously accepted 140–180 Myr21, but in agreement with those recently reported independently20,51. An earlier angiosperm origin further expands the large gap between the origin of angiosperm and the earliest undisputed angiosperm fossil found in Hauterivian52; thus, possibly the oldest crown angiosperm fossils are yet to be discovered. Recently, angiosperm-like pollen grains were found in the Middle Triassic, consistent with our hypothesis of earlier origin of flowering plants; however, such an early origin is still controversial because of the lack of unequivocal meso-macro fossils (for example, fossilized flowers) at that time53. The diversification of Mesangiospermae was estimated to have initiated in the Jurassic (154–191 Myr) (Figs 5 and 6, and Supplementary Table 6), ca. 60 Myr after the angiosperm origin, consistent with an early origin of Mesangiospermae tentatively proposed by Smith et al.20 and Magallon51, but earlier than the 144 Myr estimated using whole plastid genome data12. Since the origin of Mesangiospermae, during an ~20-Myr period (instead of the 4 Myr based on plastid genomes12), the five major groups diversified successively as indicated by the well-resolved topology here, making the radiation less rapid than previously thought.

Figure 6: A chronogram showing the angiosperm divergence times as estimated by the BEAST using 59 genes.
figure 6

Two fossil calibration points: (1) the earliest gymnosperm fossils (ca. 290–310 Myr) and (2) the earliest fossil tricolpate pollen (~125 Myr) were marked with two solid circles.

Strikingly, the time of angiosperm origin estimated here overlaps with the origin of several insect lineages. Curculionoidea (weevils and bark beetles) and Chrysomeloidea (leaf beetles and long-horned beetles), two important groups of plant feeders, were estimated to have originated ~230 Myr; meanwhile, fossils of Diptera (flies) and Hymenoptera (bees and wasps), the most important pollinators, were also found in Late Triassic20,54. Moreover, the rapid expansion of Mesangiospermae is congruent with the radiation of Lopidoptera (butterflies and moths), Hymenoptera (bees, ants and pollen wasps) and many kinds of flies from Late Jurassic to Early Cretaceous20,23,54. The coincidence in geological time of the mesangiosperm lineages and pollinating insects might provide opportunities for investigating the diversification of Mesangiospermae. We also noted that the extant crown eudicots diversified since ~35 Myr after their separation from the clade of Chloranthaceae and Ceratophyllaceae, suggesting potential extinctions of the stem relatives or a relatively long period of ‘stasis’ before their rapid diversification into the most successful plant group.

Discussion

In this study we showed that the combination of a moderate number of carefully evaluated nuclear genes with appropriate sampling could provide robust and highly supported relationships among deep lineages of Mesangiospermae. These relationships differed substantially from those accepted by the APG III system. The topology here provides a new phylogenetic framework for ancestral character reconstruction, molecular clock estimates of divergence times and other studies, suggestive of the necessity and importance of conserved low-copy nuclear genes for evolutionary studies.

The topology obtained here is different with the one inferred by using single-copy plastid genes but congruent with the one obtained by highly conserved genes from the plastid IR region, indicating that the evolutionary histories of IR and single-copy plastid genes might be different. The evidence presented here for conflicting signals among single-copy plastid genes and between single-copy and IR plastid genes further suggests that some of the single-copy plastid genes might be unsuitable for resolving the deep relationships of angiosperms, as almost all single-copy genes have been used and conflicting signals exist. Conflicts between plastid and nuclear genome about the position of Malpighiales, Cornales and Ericales also suggested that evidence from nuclear genes is necessary15. With rapid advances in sequencing technologies and decreasing cost, nuclear genes will probably be used more and more in molecular phylogeny.

The early origin and diversification of angiosperms proposed here were also supported by two other recent independent studies20,51; all of these proposed origins were earlier than previous estimates, providing a new temporal framework for the evolution of angiosperms and ecologically related organisms. The possible coincidental origins and divergence of major angiosperm lineages with those of major pollinator insects provide a possible environmental factor that might have contributed to the rapid diversification of mesangiosperm lineages referred to by Darwin as the ‘abominable mystery’. In addition, our results demonstrate that ‘bushes’ in the tree of life from rapid radiations can be resolved by using a moderate number of nuclear genes55, which can be identified by careful screening of probable orthologues from transcriptomes of representative taxa, a strategy generally applicable to other phylogenetic questions.

Methods

Taxon sampling and data collection

Young leaves or flower buds of 26 species (see details in Supplementary Tables 1 and 2) were collected and frozen at −80 °C. Total RNA was extracted by a modified CTAB method15 and then paired-end reads of 2 × 100 were generated using the Illumina technology with HiSeq2000 (Table 1). Short reads were assembled into longer contigs de novo using Trinity56 (trinityrnaseq_r2012-06-08) with default parameters. For longer and more complete complementary DNA sequence, TGICLv2.1 (ref. 57) was also used with the parameter being –P=0.98, −l=40 and −v. Thirty sequenced genomes and five EST data sets were respectively retrieved from http://www.phytozome.net/search.php and ftp://ftp.ncbi.nih.gov/repository/UniGene/ (Supplementary Table 1).

Orthologue identification and gene selection

To identify probable orthologous genes for phylogenetic analyses HaMStR was used, as it performs well in identification of orthologues from EST and RNA-seq data35, and its utility has been tested in previous phylogenomic studies of plants32 and animals30. First, we downloaded from the Deep Metazoan Phylogeny ( http://www.deep-phylogeny.org/hamstr/) 4,180 OGs, which were previously generated by comparing 9 angiosperm species with sequenced genome (A. thaliana, Glycine max, Medicago truncatula, Populus trichocarpa, Solanum lycopersicum, Vitis vinifera, Oryza sativa, Sorghum bicolor and Zea mays)35. These 4,180 OGs were then compared with 1,989 OGs identified from the analysis of seven whole-sequenced genomes using OrthoMCL (A. thaliana, P. trichocarpa, Prunus persica, Mimulus guttatus, V. vinifera, S. bicolor and O. sativa), resulting in 931 OGs that overlapped between these two data sets. HMM files of 931 OGs distributed with HaMStR were used to search for corresponding orthologues from other species with the parameter being -est, -hmmset=magnoliophyta_hmmer3, -relaxed, -eval_limit=0.01. To identify marker genes with sufficient coverage among the taxa, OGs were selected with putative orthologues found in 80% of the 26 species with newly generated RNA-seq datasets (Table 1); in addition, only sequences of coding regions with the length 80% of the A. thaliana homologue were retained for further analyses, ultimately resulting in 349 OGs.

Angiosperms have probably experienced a number of WGDs and subsequent gene losses32,33, making it difficult to identify orthologues. Because of the recently identified WGD before the divergence of all extant angiosperms and all seed plants, strictly defined orthologues that never experienced any duplication probably do not exist in angiosperms. Nevertheless, those genes that experienced rapid loss of one paralogue before the divergence of the species of interest can be considered as orthologues operationally. Therefore, we searched for low-copy genes and follow well-supported established organismal history. To minimize the effect of hidden paralogues58 and identify the most probable orthologues, 349 single-gene trees were reconstructed using RAxML36 with protein sequences of 20 representative species with well-supported relationships (Fig. 2), with the evolutionary model for each gene estimated by ProtTestv2.4 (ref. 59). Next, these gene trees were compared with the species tree.

As the informative sites of one gene are limited, it was difficult to resolve relationships among low-level taxonomic hierarchies using only one gene. Therefore, if genes of the same OG from species of a monophyletic organismal group (that is, eudicots, monocots and magnoliids) form a monophyletic gene clade, as they should, the gene was selected for further analyses; in contrast, if genes of the same OG from species of different monophyletic groups are in a supported gene clade, then this OG was excluded (see examples of ‘selected’ and ‘excluded’ genes in Supplementary Fig. 3). After careful examination, 54 genes were selected for further analyses. Combined with five genes (SMC1, SMC2, MCM5, MSH1 and MLH1) effectively used previously15, a total of 59 genes was used for investigating the relationship among Mesangiospermae. Characteristics of these 59 genes, including functional annotation and percentages of PI sites are shown in Supplementary Table 2. Gene copy number detected in each species with whole sequenced genomes is listed in Supplementary Data 1; in species with more than one copy, the paralogues were found to represent terminal branches from recent duplications. The length of protein sequences encoded by orthologous genes in each species is listed in Supplementary Data 2.

Phylogenetic analyses

Amino acid sequences of each OG were aligned using MUSCLE v3.8.31 (ref. 60) with default settings, the alignments were manually inspected to delete sequences of low quality, then the poorly aligned regions were further trimmed by using trimAl v1.2 (ref. 61). Single-gene trees were reconstructed with RAxML using the fittest evolutionary model inferred by ProtTest v2.4 (ref. 59). In species with two or more copies in one OG, the paralogues from recent duplication formed adjacent terminal branches in the gene tree; thus, only the gene with the shortest branch was retained for further analyses. Finally, amino acid sequences of 59 genes from 61 species were concatenated by SeaView62 and the length of the concatenated 59-gene amino acid matrix reached 25,589 amino acids.

ML and Bayesian trees based on the 59 protein sequences of 61 species were inferred by RAxML and MrBayes 3.2.1 (ref. 37), respectively. For ML analysis, the model was specified as JTT+I+G based on the results of ProtTest and fast BS analyses were replicated for 100 times. For Bayesian analysis, one cold and three incrementally heated Markov chain Monte Carlo chains were run simultaneously with the JTT model. The Markov chain Monte Carlo convergence in Bayesian phylogenetic inference was monitored by AWTY ( http://ceb.csit.fsu.edu/awty)63. Trees were sampled per 100 generations. The first 25% trees were discarded as burnin, with the remaining trees being used for generating the consensus tree.

To determine statistic support for other possible alternative relationships among the five major groups of Mesangiospermae (eudicots, monocots, magnoliids, Chloranthaceae and Ceratophyllaceae), all 105 potential topologies were tested using our data set (Table 2 and Supplementary Data 3). First, per site log likelihoods for each topology were estimated by RAxML under the JTT+I+G model, and then approximately unbiased test was conducted using CONSEL v1.20 (ref. 42).

To explore the minimal number of genes needed to resolve the relationship among the five major clades of Mesangiospermae, the relationship between the number of genes and the proportion of gene trees supporting the topology shown in Fig. 3 was studied. The number of genes ranged from 2 to 58, with increments of 2; for each number, 20 replicates of randomly selected genes were performed using the sample function implemented in R, generating a total of 580 matrices. Gene tree was inferred by RAxML with the model being JTT+I+G and the fast bootstrap replicate was set to 100.

As the phylogenetic information varies among the 59 genes, we also ranked them based on the extent of the congruence between the single-gene tree of 20 representative species and the corresponding species tree (Fig. 2). First, these single-gene tree was treated as condensed tree with the cut-off BS values being 50%; if the position of one species in single-gene tree is congruent, conflict or uncertain with the species tree, it was scored as 1, −1 and 0, respectively. Then, scores of all nodes from one single-gene tree were summed up and then 59 single-gene trees were ranked by their scores, with the gene with the highest being considered the best (Supplementary Data 4). Starting with 16 genes that had the highest scores, additional genes were added successively with total scores from high to low, resulting in matrices composed of 16, 25, 33, 41, 46, 50 and 55 gene sequences (Supplementary Data 4); finally, ML and Bayesian trees were inferred using RAxML and MrBayes, respectively, with same settings as described above.

To evaluate the effects of different evolutionary models on the species topology, ML trees using models other than the fittest one (that is, JTT+CAT, JTT+G, WAG+CAT and DAYHOFF+CAT) were inferred by RAxML.

Detection of possible sequence biases

To investigate possible effects of sequence biases in our phylogenomic data set on the phylogenetic reconstruction, several related tests were performed using TreSpEx, which was recently developed for detection of possible sequence biases, including LBA and saturation46.

To detect long branches, this programme defined a parameter called LB score based on PD values between taxa in a gene tree, where PD is the sum of lengths of branches that link two nodes in a tree. For each gene, the mean pairwise PD of taxon i to all other taxa and the average pairwise PD across all taxa in the single-gene tree were estimated. The LB score of taxon i (LBi) in each gene was then determined by the relative value of the mean PD of taxon i to the average PD across all taxa. Fifty-nine single-gene trees were used as input files for TreSpEx. TreSpEx then provided two parameters derived from LB score for each gene: the average of upper quartile of LB scores and the s.d. of LB scores for comparisons between genes. Density plots (distribution) of these two parameters of 59 genes were generated with the R programme and shoulder areas deviated from the normal distribution were filled with red. Genes whose mean values of upper quartile or the s.d. of LB scores were found in right shoulders were considered as having long branches. Next, these long-branch genes were excluded from the 59-gene amino acid matrix and the ML tree was inferred by RAxML as mention above. In addition, to identify genes that behaved as long branches in a specific taxon, taxon-specific LB scores for each gene were calculated and a heatmap of these scores was generated with hierarchical clustering. Genes with the highest LB scores (above 150), were pruned from our data set and the remaining sequences were used to reconstruct the ML tree using RAxML with the model being JTT+I+G.

The degree of saturation of each gene can be determined using the linear regression of PD and uncorrected distances p. The p distance is the number of difference in sequences without considering multiple changes47. The linear regression of each gene was estimated either by the slope or R2 values. Fifty-nine single-gene trees and alignments of 61 species were used as input files for TreSpEx. First, PD matrix and p matrix of each gene were respectively calculated and the linear regression of them was generated for each gene. Second, distributions of the slope or R2 values were plotted with the aid of R. The rationale of TreSpEx is that the better the fit to linear regression, or in another way, the larger the slope or R2 value, the less saturated the data. Genes located in left shoulders of the slope or R2 graphs, respectively, were pruned from the 59-gene data set and the ML tree was reconstructed with RAxML.

Taxon sampling analyses

To evaluate the influence of taxa sampling on phylogenetic reconstruction, several analyses were performed using concatenated 59 genes from different subsets of taxa. Basal lineages of major groups are known to be crucial for reconstructing phylogeny31,64; hence, three Piperales species in magnoliids or three basal monocot species were deleted. ML and Bayesian trees using the pruned matrices were reconstructed. To test whether additional representatives of small eudicot groups that are sisters to major eudicot groups could affect the topology, we added orthologous genes from the recently sequenced genome of N. necifera (Proteales)65, publicly available transcriptome of E. californica (Ranunculales) from NCBI ( http://www.ncbi.nlm.nih.gov/SRA) and the transcriptome of M. arviflora (Sabiaceae) that we recently obtained.

Moreover, three Pinaceae species (P. sitchensis, P. glauca and P. taeda), representatives of the largest clade of gymnosperms, were included. EST data sets of these three species were downloaded from NCBI ( ftp://ftp.ncbi.nih.gov/repository/UniGene/). Orthologues of 59 genes from these three species were obtained by HaMStR as described above, then ML and Bayes tree were inferred using RAxML and MrBayes, respectively.

Reconstruction of angiosperm phylogeny using plastid genomes

To investigate possible factors on the difference between the APG III topology from chloroplast markers for the five major groups and the hypothesis presented in this study, we re-examined the data sets from 83 plastid genes of 86 species and 40 genes from the plastid IR of 244 species. First, the percentage of PI sites of each gene was estimated by PAUP66. Next, ML trees using plastid genes with different percentage of informative sites were inferred by RAxML with the model being GTRCAT, and the BS analysis was repeated 100 times.

Re-analysis of 110 morphological characters

One hundred and ten morphological characters sorted by Endress and Doyle43 were reinvestigated with Mesquite (version 2.75)67. As the five major groups of Mesangiospermae were also sampled in their study, we simply tested alternative relationships among the five major lineages based on their samplings. First, the minimum step required for each of 105 possible topologies was inferred using the parsimony model. Second, ancestral states of each character were inferred using the likelihood model in the context of our topology (Fig. 3) to detect potential synapomorphies. As polymorphic or uncertain taxa are not supported by likelihood model, those taxa with uncertain character and missing data were removed from the matrix for each character. Then, ancestral states with probabilities were obtained. To detect the novel character specific to one group, for example, the ancestral state of the eudicots was compared with the one of the upper node connecting both eudicots and the clade of Chloranthaceae and Ceratophyllaceae. If the ancestral states of these two nodes are supported significantly and represent different state, then there was a novel character before the origin of eudicots. To test whether different topologies could affect the ancestral reconstruction of morphological characters, the ancestral state of each character was also inferred in the context of other two topologies, that is, the one obtained by Endress and Doyle43 based on morphological data (Fig. 1d) and the one accepted by APG III (Fig. 1a).

Divergence time estimation

For divergence time estimation, besides 61 species used for phylogenetic analyses, Selagenella moellendorffii, a basal vascular plant with a sequenced genome, was also included. S. moellendorffii is helpful for determining the crown node of seed plants, which is critical as a deep fossil constrained node and has been widely used in previous dating analyses12. The S. moellendorffii orthologues of the 59 genes were retrieved using HaMStR as described above.

We estimated the divergence times based on three sets of data partitions, data consisting of the first codon positions, the first and second positions and all three codon positions, respectively. Penalized likelihood (PL) implemented in r8s (v.1.7.1)49, and Bayesian relaxed clock in BEAST (v.1.7.5)50 were used to estimate the divergence times of extant angiosperm lineages, as constant substitution rate across the phylogenetic tree was rejected (P<0.01) for all three data partitions by likely ratio test conducted in PAUP 4.0 beta10 (ref. 66).

For the PL method, the ML tree with branch length generated by RAxML was used as the input tree. The topologies inferred by these three nucleotide matrices are different from the one shown in Fig. 3 (data not shown) regarding the relationships among the five major clades of Mesangiospermae, possibly because of mutation saturation and homoplasy. Therefore, when inferring the ML tree by RAxML, the constraint tree was given, which forced the relationships among the five major clades of Mesangiospermae as shown in Fig. 3.

The outgroup Selagenellia was pruned as required by r8s. Cross-validation was tested to determine the best smoothing value for our data. After testing a range of smoothing parameters from 0.01 to 320 (cvstart=−2; cvinc=0.5; cvnum=10), the smoothing parameter of all three codon positions, the first and second positions and the first codon position was set to 3.5, 10 and 10, respectively. These low smoothing values also indicate a large deviation from the strict molecular clock hypothesis. One hundred BS trees with branch length were also generated using RAxML, which were used as input trees to calculate the confidence time intervals. The s.e. and 95% confidence time interval of a few nodes of interest were estimated and summarized across the 100 BS trees. The algorithm of TN was used and all other parameters were set as default in all above PL analysis.

For the BEAST analysis, (UCLN) was used with nucleotide substitution model being GTR+I+Γ and Yule speciation was specified for all three matrices. Two independent replications each with 60,000,000 generations were run with sampling every 5,000 generations. The stationary of the chains and convergence of the two runs was monitored by Tracer (v. 1.5), determining whether the effective sample size of all parameters was larger than 200 as recommended. The files from two independent runs were combined using LogCombiner (v. 1.7.5). The chronogram with nodal heights and 95% confidence time intervals was generated with TreeAnnotator (v. 1.7.5), with the first 5,000 trees being discarded as burnin; finally, the chronogram was displayed by Figtree (v. 1.0).

The earliest gymnosperm fossils (ca. 290–310 Myr) assigned to cycads68 and conifers69, and the earliest fossil tricolpate pollen (~125 Myr) associated with eudicots70 were used as two calibration points. For the PL analysis, the node of crown seed plants was constrained with minimum age of 290 Myr and maximum age of 310 Myr, while the crown eudicots was treated as the fixed and the minimum age of 125 Myr, respectively. For the BEAST analysis, the crown seed plants was constrained using uniform distribution with lower bound of 290 Myr and upper bound of 310 Myr; the node for crown extant eudicots was constrained using a prior of exponential distribution with offset of 125 Myr and mean value of 1, respectively. Our major fossil constraints are comparable to those used for calibrating the chloroplast genome data in previous studies, except that several different additional internal fossils were also included in previous studies12,16.

Additional information

Accession Codes: The nucleotide sequences of the 59 genes used in this study have been deposited in GenBank nucleotide database, with the accession codes KM397373 to KM400584. In addition, the nucleotide and amino acid sequence alignments of 59 genes have been deposited in the TreeBASE under accession code S16175.

How to cite this article: Zeng, L. et al. Resolution of deep angiosperm phylogeny using conserved nuclear genes and estimates of early divergence times. Nat. Commun. 5:4956 doi: 10.1038/ncomms5956 (2014).