Background

Echinoidea Leske, 1778 is a clade of marine animals including species commonly known as sea urchins, heart urchins, sand dollars and sea biscuits. It constitutes one of the five main clades of extant Echinodermata, typically pentaradially symmetric animals, which also includes highly distinctive components of the marine fauna such as sea lilies and feather stars (crinoids), starfish (asteroids), brittle stars (ophiuroids) and sea cucumbers (holothuroids). Fossil evidence suggests that these lineages, as well as a huge diversity of extinct relatives, trace their origins to the early Paleozoic [1, 2]. Their deep and rapid divergence from one another, coupled with long stem groups leading to the origin of extant forms, for a long time impeded a robust resolution of their interrelationships. Nonetheless, a consensus has emerged supporting a close relationship between echinoids and holothuroids, as well as between asteroids and ophiuroids, with crinoids as sister to them all [3,4,5].

A little over 1000 extant species of echinoids have been described [6], comprising a radiation whose last common ancestor likely arose during the Permian [7, 8], although the stem of the group extends back to the Ordovician [9]. Extant echinoid species richness is vastly eclipsed by the more than 10,000 species that constitute the rich echinoid fossil record [10]. Nonetheless, it seems safe to assume that echinoid diversity at any given point in time has never exceeded that of the present day [11, 12]. Today, sea urchins are conspicuous occupants of the marine realm, inhabiting all benthic habitats from the poles to the Equator and from intertidal to abyssal zones [13]. As the main epifaunal grazers in many habitats, sea urchins contribute to the health and stability of key communities such as kelp forests [14] and coral reefs [15, 16]. Likewise, bioturbation associated with the feeding and burrowing activities of a large diversity of infaunal echinoids has a strong impact on the structure and function of marine sedimentary environments [17, 18]. Figure 1 provides a snapshot of their morphological diversity. Since the mid-nineteenth century, research on sea urchins has played a major role in modelling our understanding of animal fertilization and embryology [19, 20], with many species becoming model organisms in the field of developmental biology. This line of research was radically expanded recently through the application of massive sequencing methods, resulting in major breakthroughs in our understanding of the organization of deuterostome genomes and the gene regulatory networks that underlie embryogenesis [21, 22].

Fig. 1
figure 1

Morphological and taxonomic diversity of echinoids included in this study. a Prionocidaris baculosa. b Lissodiadema lorioli. c Caenopedina hawaiiensis. d Asthenosoma varium. e Colobocentrotus atratus. f Strongylocentrotus purpuratus. g Pilematechinus sp. h Brissus obesus. i Dendraster excentricus. j Clypeaster subdepressus. k Conolampas sigsbei. l Current echinoid classification, modified from [6]. Clade width is proportional to the number of described extant species; clades shown in white have representatives included in this study (see Table 1). Colored pentagons are used to identify the clade to which each specimen belongs, and also correspond to the colors used in Fig. 2. Throughout, nomenclatural usage follows that of [6], in which full citations to authorities and dates for scientific names can be found. Photo credits: G.W. Rouse (a, c, e-i), FLMNH-IZ team (b), R. Mooi (d, j), H.A. Lessios (k)

The higher-level taxonomy and classification of both extant and extinct sea urchins have a long history of research (reviewed by [12, 23]). The impressive fossil record of the group, as well as the high complexity of their plated skeletons (or tests), have allowed lineages to be readily identified and their evolution tracked through geological time with a precision unlike that possible for other clades of animals (e.g., [24,25,26,27,28]). Morphological details of the test have also been used to build large matrices for phylogenetic analysis [9, 12, 25, 29,30,31,32]. The most comprehensive of these morphological phylogenetic analyses [12] has since served as a basis for the current taxonomy of the group (Fig. 11). This analysis confirmed several key nodes of the echinoid tree of life that were also supported by previous efforts, such as the position of cidaroids (Fig. 1a) as sister to all other sea urchins (Fig. 1b-k)—united in the clade Euechinoidea Bronn, 1860—and the subdivision of the latter into the predominantly deep-sea echinothurioids (Fig. 1d) and the remainder of euechinoid diversity (Acroechinoidea Smith, 1981). Likewise, Kroh and Smith [12] confirmed the monophyly of some major clades such as Echinacea Claus, 1876, including all the species currently used as model organisms and their close relatives (Fig. 1e, f); and Irregularia Latreille, 1825, a group easily identified by their antero-posterior axis and superimposed bilateral symmetry [33]. The irregular echinoids were shown to be further subdivided into the extant echinoneoids, atelostomates (Fig. 1g, h) (including the heart urchins) and neognathostomates (Fig. 1i-k) (including the sand dollars). Other relationships, however, proved more difficult to resolve. For example, the pattern of relationships among the main lineages of acroechinoids received little support and was susceptible to decisions regarding character weighting, revealing a less clear cut-picture [10, 12].

In stark contrast with these detailed morphological studies, molecular efforts have lagged. Next-generation sequencing (NGS) efforts have been applied to relatively small phylogenetic questions, concerned with the resolution of the relationships within Strongylocentrotidae Gregory, 1900 [34], a clade of model organisms, as well as among their closest relatives [35]. Although several studies have attempted to use molecular data to resolve the backbone of the sea urchin phylogeny [8, 10, 25, 30, 36, 37], all these have relied on just one to three genes, usually those encoding ribosomal RNA. The lack of comprehensive sampling of loci across the genome thus limits the robustness of these phylogenies. Furthermore, recent analyses have suggested that ribosomal genes lack sufficient phylogenetic signal to resolve the deepest nodes of the echinoid tree with confidence [8].

In light of this, it is not clear how to reconcile the few—yet critical—nodes for which molecular and morphological data offer contradicting resolutions. For example, most morphological phylogenies strongly supported the monophyly of sea biscuits and sand dollars (Clypeasteroida L. Agassiz, 1835), and their origin from a paraphyletic assemblage of lineages collectively known as “cassiduloids”, including Echinolampadoida Kroh & Smith, 2010 and Cassiduloida Claus, 1880 among extant clades, as well as a suite of extinct lineages [12, 29, 31, 38]. In contrast, all molecular phylogenies to date that incorporated representatives of both groups have resolved extant “cassiduloids” nested within clypeasteroids, sister to only one of its two main subdivisions, the scutelline sand dollars [8, 10, 25, 30]. This molecular topology not only undermines our understanding of the evolutionary history of one of the most ecologically and morphologically specialized clades of sea urchins [38, 39], it also implies a strong mismatch with the fossil record, requiring ghost ranges of the order of almost 100 Ma for some clypeasteroid lineages [10, 40]. Likewise, the earliest divergences among euechinoids, including the relative positions of echinothurioids and a collection of lineages collectively known as aulodonts (micropygoids, aspidodiadematoids, diadematoids and pedinoids [41]), have consistently differed based on morphological and molecular data, often with poor support provided by both [8, 10, 12, 25, 40]. Finally, previous studies have resolved different lineages of regular echinoids, including diadematoids, aspidodiadematoids, pedinoids, salenioids and salenioids + echinaceans, as sister to Irregularia [8, 12, 25, 37].

Given the outstanding quality of their fossil record and our thorough understanding of their development, sea urchins have the potential to provide a singular basis for addressing evolutionary questions in deep-time [42], providing access to the developmental and morphological underpinnings of evolutionary innovation (e.g., [8, 43]). However, uncertainties regarding the phylogenetic history of sea urchins propagate into all of these downstream comparative analyses, seriously limiting their potential in this regard. Here, we combine available genome-scale resources with de novo sequencing of transcriptomes to perform the first phylogenomic reconstruction of the echinoid tree of life. Our efforts provide a robust evolutionary tree for this ancient clade, made possible by gathering the first NGS data for many of its distinct lineages. We then explore some important morphological transformations across the evolutionary history of the clade.

Results

Several publicly available transcriptomic and genomic datasets are available for sea urchins and their closest relatives, the products of multiple sequencing projects [44, 45] stretching back to the sequencing of the genome of the purple sea urchin [46]. A subset of these datasets was employed here and complemented with whole transcriptomic sequencing of 17 additional species, selected to cover as much taxonomic diversity as possible. In the end, 32 species were included in the analyses, including 28 echinoids plus 4 outgroups. A complete list of these, including details on specimen sampling for all newly generated data, as well as SRA and Genome accession numbers, is provided in Table 1. All analyses were performed on a 70% occupancy matrix composed of 1040 loci and 331,188 amino acid positions (Additional file 1: Figure S1), as well as constituent gene matrices.

Table 1 Information on the species and sequences used in the analysis. Sampling locality is shown for newly sequenced taxa, citations for data obtained from the literature. For deep-sea specimens, sampling depth is also reported

Initial analyses were complicated by the problem of resolving the position of Arbacia punctulata; different methods resolved this species as either a member of Echinacea, as suggested by previous morphological and molecular studies [10, 12], or as the sister lineage to all remaining euechinoids. Further analyses suggested that this second, highly conflicting topology might be the consequence of sequence contamination (see Additional file 1: Figure S2). Topologies obtained after attempting to control this problem showed strong support for a monophyletic Echinacea—including Arbacia punctulata, Stomopneustes variolaris and camarodonts—although the relationships among these three lineages received only weak support (Additional file 1: Figure S2). Given the ad hoc nature of our approach, we regard this result as preliminary, and excluded Arbacia from all subsequent analyses.

Phylogenomic matrices are the product of complex evolutionary histories which are only partially captured by our current models of molecular evolution. This often results in fully supported yet incorrect topologies, as all methods are susceptible to systematic biases in various ways and to different degrees [47, 48]. In order to explore the effects of model selection, phylogenetic inference was performed on the concatenated alignment using a diversity of procedures, including maximum likelihood (ML) inference using two different mixture models and the best-fit partitioning scheme, as well as Bayesian inference (BI) under site-homogenous and site-heterogenous models (see Methods). All five methods of probabilistic inference recovered exactly the same phylogeny (Fig. 2a), showing the robustness of our results to the implementation of different approaches to model molecular evolution. Furthermore, support was maximum for almost all nodes across all methods, and no other tree was found in the credible set of topologies explored by either of the BI methods. This phylogeny shows strong agreement with the current higher-level classification of echinoids, supporting the monophyly of most previously recognized clades classified at or above the level of order. These include the position of Cidaroida as sister to all other echinoids, the monophyly and close relationship of Echinacea and Microstomata (including all sampled irregular echinoids), and the subdivision of the latter into atelostomates and neognathostomates (as labeled on the tree, Fig. 2a). Relationships at lower taxonomic levels are beyond the scope of this study, as only one or two species per major clade were sampled, with the exception of camarodonts and scutelline sand dollars. Internal relationships among camarodonts fully agree with recently published estimates based on mitochondrial genomes [35], even though our taxonomic sampling differs. For scutelline sand dollars, our phylogeny confirms a close relationship of Dendrasteridae to Echinarachniidae, as suggested by early DNA hybridization assays [49], rather than between Dendrasteridae and Mellitidae, as previously argued based on morphological evidence [9, 12, 38].

Fig. 2
figure 2

a Maximum likelihood phylogram corresponding to the unpartitioned analysis. The topology was identical across all five probabilistic methods employed, and all nodes attained maximum support except for the node at the base of Scutellina, which received a bootstrap frequency of 97 and 98 in the maximum likelihood analyses under the LG4X and PMSF mixture models, respectively (see Methods). Circles represent number of genes per terminal. Numbered nodes denote novel taxon names proposed or nomenclatural amendments (see Discussion), and are defined on the top right corner. b Distance of each ingroup species to the most recent common ancestor of echinoids, which provides a metric for the relative rate of molecular evolution. Dots correspond to mean values out of 2000 estimates obtained by randomly sampling topologies from the post burn-in trees from PhyloBayes (using the CAT-Poisson model), which better accommodates scenarios of rate variation across lineages. Lines show the 95% confidence interval

On the other hand, our topology conflicts with current echinoid classification (Fig. 1l) in two main aspects. First, it does not recover echinothurioids as sister to the remaining euechinoids, therefore contradicting the monophyly of Acroechinoidea. Instead, Echinothurioida is recovered as a member of a clade that also incorporates the lineages of aulodonts that were sampled—pedinoids and diadematoids (Fig. 1b, c). Second, and more surprisingly, it rejects the monophyly of the sea biscuits and sand dollars, proposing instead a sister relationship between Conolampas sigsbei (an echinolampadoid) and only one of the two main subdivisions of clypeasteroids, the scutellines. Both of these topologies were recovered by previous molecular analyses [8, 10, 25], but were disregarded due to the perceived strong conflict with morphological data [12, 40].

We further explored coalescent-based inference using ASTRAL-II [50], which recovered a very similar topology to the other approaches. Notably, however, it strongly supported the placement of Conolampas in an even more nested position, inside the clade formed by scutelline sand dollars, sister to Scutelliformes Haeckel, 1896 (Fig. 3a). Exploration of gene tree incongruence using a supernetwork approach revealed topological conflicts among gene trees in the resolution of the Conolampas + scutelline clade, with Conolampas, Echinocyamus and scutelliforms forming a reticulation (Fig. 3a, inset). We hypothesize this to be the consequence of high levels of gene tree error caused by the heterogeneity in rates of evolution among the included lineages, with Conolampas evolving significantly slower, and Echinocyamus significantly faster, than the scutelliforms (as shown by non-overlap** 95% confidence intervals in Fig. 2b). To test this hypothesis, we performed species tree inference with ASTRAL-II using approximately a third of the gene trees, selecting those derived from genes with the lowest levels of both saturation and rate heterogeneity across lineages (Fig. 3c). The resulting topology agrees with those obtained from concatenation approaches in every detail, with the position of Conolampas shifting to become sister to the scutellines with a relatively strong local posterior probability (localPP) of 0.91 (Fig. 3b). In contrast, most species trees derived from equal-sized subsets of randomly selected gene trees provide strong support for placing Conolampas in disagreement with the position obtained by other methods (average localPP = 0.92; Fig. 3d). The few replicates in which Conolampas is recovered as sister to the scutellines (16%), receive low support values (average localPP = 0.51; Fig. 3d).

Fig. 3
figure 3

Phylogenetic inference using the coalescent-based summary method ASTRAL-II. a Phylogeny obtained using all 1040 gene trees. The phylogeny conflicts with that obtained using all other methods by placing Conolampas sigsbei inside Scutellina, sister to Scutelliformes. The neognathostomate section of a supernetwork built from gene tree quartets is also depicted, showing a reticulation involving Conolampas, Echinocyamus and scutelliforms. b Phylogeny obtained using 354 gene trees, selected to minimize the negative effects of saturation and across-lineage rate heterogeneity. The position of Conolampas shifts to become sister to Scutellina (as in all other methods), with relatively strong support. To emphasize the shift in topology between the two, only neognathostomate clades have been colored (as in Fig. 2), and nodes have maximum local posterior probability unless shown. c Values of the two potentially confounding factors across all genes. Genes in red were excluded from the analysis leading to the topology shown in b. Histograms for both variables are shown next to the axes. d Summary of the results obtained performing inference with ASTRAL-II after deleting 66% of genes selected at random (100 replicates). Most replicates showed the same topology as in a. Only 16% placed Conolampas as sister to Scutellina (top), and even among them the support for this resolution was generally weak (bottom)

Finally, we used a series of topological tests to assess the strength of evidence for our most likely topology against the two traditional hypotheses of relationships with which it conflicts most strongly: the monophyly of Acroechinoidea and Clypeasteroida, clades that are supported by morphological data [12] and recognized in the current classification of echinoids (Fig. 11). SOWH tests [51] strongly rejected monophyly in both cases (both P values < 0.01). We were able to trace the signal opposing the monophyly of these two clades down to the gene level, with a predominant fraction of genes showing support for the novel position of Echinothurioida united with Pedinoida and Diadematoida, as well as for the position of echinolampadoids as sister to the scutelline sand dollars (Fig. 4). Genes supporting these novel grou**s showed strong preference for them, while the comparatively smaller fraction of genes favoring the traditional resolutions did so only weakly.

Fig. 4
figure 4

Distribution of phylogenetic signal for novel resolutions obtained in our phylogenomic analyses. Signal is measured as the difference in gene-wise log-likelihood scores (δ values) for the unconstrained (green) and constrained topologies enforcing monophyly of Acroechinoidea (top, red) or Clypeasteroida (bottom, blue). The same results are shown on the right, except that values are expressed as absolute differences and genes are ordered following decreasing δ values to show the overall difference in support for both alternatives

Furthermore, we were unable to detect evidence that this signal arises from non-historical sources. The set of genes supporting these novel topologies is not enriched in potentially biasing factors, including compositional heterogeneity, among-lineage rate variation, saturation, and amount of missing data (multivariate analysis of variance (MANOVA), P = 0.130 and 0.469 for clypeasteroid and acroechinoid monophyly contraints, respectively; see Fig. 5). In fact a multiple linear regression model using these variables as predictors of gene-wise δ values (i.e., the difference in log-likelihood score for constrained and unconstrained ML topologies for each individual locus) is also non-significant (P = 0.202 and 0.160), explaining in each case less than 3% of total variance in δ values. Thus, we detect no evidence that the support for these novel hypotheses stems from anything other than phylogenetic history.

Fig. 5
figure 5

Exploration of potential non-phylogenetic signals biasing inference. Gene-wise δ values obtained by constraining acroechinoid (top) and clypeasteroid (bottom) monophyly are shown using dot size and color (as in Fig. 4, see legend). Root-to-tip variance axs were truncated to show the region in which most data points lie. The relative support for these topological alternatives does not depend on the four potentially biasing factors explored, as seen by the lack of clustering of genes with similar δ values along the axes

Discussion

General comments

Since the publication of Mortensen’s seminal monographs (starting almost a century ago), echinoid classifications have largely relied on morphological data. Detailed study of the plate arrangements in the echinoid test has proved a rich source of characters for both fossil and extant taxa, integrating them in a unified classification scheme. However, the amount of time separating the main echinoid lineages, coupled with the profound morphological reorganization they have experienced, have resulted in parts of their higher-level classification remaining uncertain. Although molecular data offer an alternative source of phylogenetic information, efforts so far have largely targeted a restricted character set of limited utility for deep-time inference, resulting in issues similar to those faced by morphological attempts. Phylogenomics hold the potential to provide insights into the deep evolutionary history of echinoids, an avenue explored here for the first time.

Analysis of our phylogenomic dataset provided similar estimates of phylogeny using either concatenation or coalescent-based methods (Figs. 2a and 3a), with the exception of one node that was resolved differently by the two approaches. This node involved the order of divergences among two lineages with dissimilar rates of molecular evolution, namely Echinocyamus crispus, a scutelline sand dollar with the fastest rate of evolution among all sampled taxa, and Conolampas sigsbei, a relatively slow-evolving echinolampadoid (at least in the context of the remaining neognathostomates, Fig. 2b). Extensive rate variation among neognathostomate lineages has been reported previously, with potential consequences for phylogenetic inference and time-calibration [25, 27, 48]. Although increased taxonomic sampling is required, several lines of evidence suggest that the tree obtained by ASTRAL-II is artefactual, including the strong support for the alternative resolution found by all other methods, the implausible morphological history that this species tree implies, and its even greater departure from previous phylogenetic results [10, 25, 30]. We were able to bring ASTRAL-II into agreement with concatenation-based approaches by including only those genes expected to better handle the difference in evolutionary rate among the sampled taxa (Fig. 3b, c).

The resulting topology shows strong support for the same resolution of Neognathostomata found across all concatenation-based approaches. Although we did not formally test the reason behind this change in topology inferred with ASTRAL-II, we found that species trees obtained from randomly subsampled gene trees are generally identical to the one supported by the full set of gene trees (Fig. 3d). The widespread adoption of methods accounting for incomplete lineage sorting is one of the major innovations made possible by phylogenomics [52], but its utility for phylogenetic inference in deep time remains a topic of discussion [53, 54]. Simulations have demonstrated that genes with minimal phylogenetic information might produce unreliable gene trees, which in turn reduce the accuracy of species tree estimation using summary methods [111] for the analysis run in IQ-TREE. BI was also performed with the concatenated dataset using two different approaches. In the first, two independent chains of ExaBayes v. 1.5 [112] were run for five million generations using automatic substitution model detection. In the second, PhyloBayes-MPI v. 1.8.1 [113] was run under the site-heterogenous CAT model [114], which models molecular evolution employing site-specific substitution processes. Preliminary runs using the complex CAT+GTR model (two chains, 3000 cycles) failed to converge, as is routinely the case with large phylogenomic datasets [115]. Nonetheless, exploration of the majority rule consensus tree (available at the Dryad data repository [116]) revealed disagreement among chains regarding a few nodes within camarodonts, with the rest of the topology being identical to that of Fig. 2a. A more thorough analysis was performed under the simpler CAT-Poisson model, with two independent chains being run for 10,000 cycles. For both BI approaches, stationarity was confirmed using Tracer v1.6 [128], enforcing separate monophyly constraints for Acroechinoidea and Clypeasteroida and setting the model to JTT + Γ + I with empirical amino-acid frequencies (−raxml_model = PROTGAMMAIJTTF), selected as the optimal unpartitioned model by IQ-TREE for the concatenated dataset. Evaluation of the confidence interval around the resulting P-values revealed no need for more replicates (i.e., upper limits of the 95% confidence intervals surrounding both P-values were < 0.05). Subsequently, log-likelihood scores for all sites in the ML unconstrained and both constrained topologies were obtained using RAxML, allowing gene-wise δ values to be calculated (as in [71]). The relationship between these gene-specific δ values and several factors with the potential to introduce systematic biases was explored using MANOVA and multiple linear regression approaches. Potentially confounding variables explored included the amounts of saturation and branch-length heterogeneity (calculated as explained above), as well as the levels of missing data and compositional heterogeneity. This last was estimated as the relative composition frequency variability (RCFV; [129]) using BaCoCa v1.103 [130]. Only genes that showed some support for either one of the topologies were included in the analyses, enforcing an arbitrary cutoff of absolute δ values > 3. With this approach, we evaluated both the strength and distribution of signal for our alternative hypotheses, as well as the possibility that this signal is the product of processes other than phylogenetic history.