Main

Understanding how complex eukaryotic cells emerged from prokaryotic ancestors represents a major challenge in biology1,5. A main point of contention in refining eukaryogenesis scenarios revolves around the exact phylogenetic relationship between Archaea and eukaryotes. The use of phylogenomic approaches with improved models of sequence evolution combined with enhanced archaeal taxon sampling—progressively uncovered using metagenomics—has recently produced strong support for the two-domain tree of life, in which the eukaryotic clade branches from within Archaea6,7,8,9,10. The discovery of the first Lokiarchaeia genome provided additional evidence for the two-domain topology because this lineage was shown to represent, at the time, the closest relative of eukaryotes in phylogenomic analyses2. Moreover, Lokiarchaeia genomes specifically contain many genes that encode eukaryotic signature proteins (ESPs)—proteins involved in hallmark complex processes of the eukaryotic cell—more so than any other prokaryotic lineage. The subsequent identification and analyses of several diverse relatives of Lokiarchaeia, together forming the Asgard archaea superphylum, confirmed that Asgard archaea represent the closest archaeal relatives of eukaryotes1,2,3. Their exact evolutionary relationship to eukaryotes, however, remained unresolved. Specially, it has been unclear whether eukaryotes evolved from within Asgard archaea or whether they represented a sister lineage3. Furthermore, two studies questioned this view of the tree of life altogether, suggesting that Asgard archaea represent a deep-branching Euryarchaea-related clade11,12. These studies suggested that, in accordance with the three-domain tree, eukaryotes represent a sister group to all Archaea; however, this view has been challenged13,14. More recently, a study that included an expanded taxonomic sampling of Asgard archaeal genome data failed to resolve the phylogenetic position of eukaryotes in the tree of life4.

Here we expand the genomic diversity of Asgard archaea by generating 63 new Asgard archaeal metagenome-assembled genomes (MAGs) from samples obtained from 11 locations around the world. By analysing the enlarged genomic sampling of Asgard archaea using state-of-the-art phylogenomics analyses, including recently developed gene tree and species tree reconciliation approaches for ancestral genome content reconstruction, we firmly place eukaryotes as a clade nested within the Asgard archaea. By revealing key features regarding the identity, nature and physiology of the last Asgard archaea and eukaryotes common ancestor (LAECA), our results represent important, thus far missing pieces of the eukaryogenesis puzzle.

Expanded Asgard archaea genome diversity

To increase the genomic diversity of Asgard archaea, we sampled aquatic sediments and hydrothermal deposits from 11 geographically distinct sites (Supplementary Table 1 and Supplementary Fig. 1). After extraction and sequencing of total environmental DNA, we assembled and binned metagenomic contigs into MAGs. Of these MAGs, 63 belonged to the Asgard archaea superphylum, with estimated median completeness and redundancy values of 83% and 4.2%, respectively (Supplementary Table 1). To assess the genomic diversity in this dataset, we reconstructed a phylogeny of ribosomal proteins encoded in a conserved 15 ribosomal protein (RP15) gene cluster from these MAGs and in all publicly available Asgard archaea assemblies (retrieved 29 June 2021; Fig. 1). These analyses showed that we expanded the genomic sampling across previously described major Asgard archaea clades (that is, Lokiarchaeia, Thorarchaeia, Heimdallarchaeia, Odinarchaeia, Hermodarchaeia, Sifarchaeia, Jordarchaeia and Baldrarchaeia2,3,4,15,16). We also recovered a previously undescribed clade of high taxonomic rank (Candidatus Asgardarchaeia; see Extended Data Fig. 1 and Supplementary Information for a proposed uniformization of Asgard archaea taxonomic classification to which we will adhere throughout the current paper). We observed that the median estimated Asgard archaeal genome size (3.8 Mb) is considerably larger than those of representative genomes from TACK archaea and Euryarchaea (median = 1.8 Mb for both) and DPANN archaea (median = 1.2 Mb) (Supplementary Table 1). Among Asgard archaea, Odinarchaeia displayed the smallest genomes (median = 1.4 Mb), whereas Lokiarchaeales and Helarchaeales contained the largest (median = 4.3 Mb for both). Unlike other major Asgard archaeal clades, Heimdallarchaeia possessed a wide range of genome sizes, spanning from 1.6 to 7.4 Mb (median = 3.5 Mb). This large class contained five clades with diverse features: Njordarchaeales (median genome size = 2.4 Mb); Kariarchaeaceae (median genome size = 2.7 Mb); Gerdarchaeales (median genome size = 3.4 Mb); Heimdallarchaeaceae (median genome size = 3.7 Mb); and Hodarchaeales (median genome size = 5.1 Mb). The smallest heimdallarchaeial genome corresponded to the only Asgard archaeal MAG recovered from a marine surface water metagenome (Heimdallarchaeota archaeon RS678)17. This result is in agreement with the reduced genome sizes typically observed among prokaryotic plankton of the euphotic zone18 .

Fig. 1: Phylogenomic analysis of 15 concatenated ribosomal proteins expands Asgard archaea diversity.
figure 1

ML tree (IQ-TREE, WAG+C60+R4+F+PMSF model) of concatenated protein sequences from at least 5 genes, encoded on a single contig, of a RP15 gene cluster retrieved from publicly available and newly reported Asgard archaeal MAGs. Bootstrap support (100 pseudo-replicates) is indicated by circles at branches, with filled and open circles representing values equal to or larger than 90% and 70% support, respectively. Leaf names indicate the geographical source and isolate name (inner and outer label, respectively) for the MAGs reported in this study. Only the in-group is shown (263 out of 542 total sequences). Scale bar denotes the average number of substitutions per site. AB, Aarhus Bay (Denmark); ABE, ABE vent field, Eastern Lau Spreading Center; ALCG, Asgard Lake Cootharaba Group; Asgard, Asgardarchaeia; Baldr, Baldrarchaeia; GB, Guaymas Basin (Mexico); Gerd, Gerdarchaeales; Hel, Helarchaeales; Heimdall, Heimdallarchaeaceae; Hermod, Hermodarchaeia; Hod, Hodarchaeales; Jord, Jordarchaeia; JZ, **ze (China); Kari, Kariarchaeaceae; Loki, Lokiarchaeales; Mar, Mariner vent field, Eastern Lau Spreading Center; Njord, Njordarchaeales; Odin, Odinarchaeia; QC, QuCai village (China); QZM, QuZhuoMu village (China); RP, Radiata Pool (New Zealand); SHR, South Hydrate Ridge; Sif, Sifarchaeia; Thor, Thorarchaeia; TNS, Taketomi Island (Japan); WOR: White Oak River (USA); Wukong, Wukongarchaeia.

Identification of phylogenetic conflict

Inferring deep evolutionary relationships in the tree of life is considered one of the hardest problems in phylogenetics. To interrogate the evolutionary relationships within the current set of Asgard archaeal phyla, and between Asgard archaea and eukaryotes, we performed an exhaustive range of phylogenomic analyses. We analysed a pre-existing marker dataset comprising 56 concatenated ribosomal protein sequences (RP56)2,3 for a phylogenetically diverse set of 331 archaeal (175 Asgard archaea, 41 DPANN archaea, 43 Euryarchaea and 72 TACK archaea representatives) and 14 eukaryotic taxa (Supplementary Table 2). Of note, the inclusion of an expanded diversity of 12 new Korarchaeota MAGs among these TACK archaea considerably affected phylogenomic analyses (see below). Initial maximum-likelihood (ML) phylogenetic inference based on this RP56 dataset confirmed the existence of 12 major Asgard archaeal clades of high taxonomic rank (Supplementary Fig. 2). These included the previously described Lokiarchaeia, Odinarchaeia, Heimdallarchaeia and Thorarchaeia2,3, for which we present 36 new genomes here. The clades also included the recently proposed Sifarchaeia16, Hermodarchaeia15, Jordarchaeia19, Wukongarchaeia4 and Baldrarchaeia4, for most of which we also identified new near-complete MAGs. Finally, we identified 15 MAGs that represented the recently described Njordarchaeales102. All 56 trimmed ribosomal protein alignments were concatenated into a RP56-A64 supermatrix (236 taxa including 64 Asgard archaea, 6,332 amino acid positions). Once this taxon set was gathered, we identified homologues of the NM57 gene set as described above, thus generating supermatrix NM57-A64 (236 taxa, 14,847 amino acid positions).

We carried out a large number of phylogenomic analyses on variations of these two RP56-A64 and NM57-A64 datasets with different phylogenetic algorithms. Notably, preparing these datasets must be done with great care and is therefore time-consuming, and subsequent phylogenomic analyses generally require an enormous amount of computational running time. However, the rapid expansion of available Asgard archaeal MAGs, notably in a previous publication4, urged us to update and re-run many of the computationally demanding analyses. As some of the work that was based on a more restrained taxon sampling is still deemed valuable, such as some of the Bayesian phylogenomic analyses and ancestral genome content reconstructions, we retained these in the current study.

An updated Asgard archaeal genomic sequence dataset was constructed by including all 230 Asgard archaeal MAGs and genomes available at the NCBI database as of 12 May 2021, as well as 63 new MAGs described in the current work. All 56 trimmed ribosomal protein alignments were concatenated into an RP56-A293 supermatrix (465 taxa including 293 Asgard archaea, 7,112 amino acid positions), which was used to infer a preliminary phylogeny using FastTree (v.2)103 (Supplementary Fig. 16). Given the high computational demands of the subsequent analyses, we then used this phylogeny to select a subsample of Asgard archaea representatives. For this, we first removed the most incomplete MAGs encoding fewer than 19 ribosomal proteins (that is, one-third of the markers) in the matrix. We also used the preliminary phylogeny to subselect among closely related taxa: among taxa that were separated by branch lengths of <0.1, we only kept one representative. This led to a selection of 331 genomes, including 175 Asgard archaea, 41 DPANN, 43 Euryarchaea and 72 TACK representatives (RP56-A175 dataset). Out of these 175 Asgard archaea, 41 correspond to MAGs newly reported here. Once this taxon set was gathered, we identified homologues of the NM57 gene set as described above, thus generating supermatrix NM57-A175 (15,733 amino acid positions). All datasets and their composition are summarized in Supplementary Table 2.

To test for potential phylogenetic reconstruction artefacts, our datasets were subjected to several treatments. Supermatrices were recoded into four categories using the SR4 scheme25. The corresponding phylogenies were reconstructed using IQ-TREE (using a user-defined previously described model referred to as C60SR4 based on the implemented C60 model and modified to analyse the recoded data3) and Phylobayes (under the CAT+GTR model). We also used the estimated site rate output generated by IQ-TREE (-wsr) to classify sites into 10 categories, from the fastest to the slowest evolving, and we removed them in a stepwise fashion, removing from 10% to 90% of the data. Finally, we combined both approaches by applying SR4 recoding to the alignments obtained after each fast-site removal step. All phylogenetic analyses performed are summarized in Supplementary Table 2. See Supplementary Information for details and discussion.

Analyses of individual proteins

For individual proteins of interest, we gathered homologues using various approaches depending on the level of conservation across taxa. To detect putative Asgard homologues of eukaryotic proteins, we used a combination of tools, including BLASTp104 and the HMMer toolkit (http://hmmer.org/) if HMM profiles were available, and queried a local database containing our 240 archaeal representatives (including all Asgard predicted proteomes). We then investigated the Asgard candidates as following: (1) using them as seed for BLASTp searches against the nr database; (2) 3D modelling using Phyre2 and SwissModel when sequence similarity was low; (3) annotating them using Interproscan (v.5.25-64.0)105, EggNOG mapper (v.0.12.7)106, against the NOG database106, and GhostKoala annotation server107; (4) annotating the archaeal orthologous cluster they belonged to using profile–profile annotation as described above. Eukaryotic homologues were gathered from the UniRef50 database108. Depending on the divergence between homologues, they were aligned using mafft-linsi and trimmed using TrimAl109 (--automated1) or BMGE102, or, in cases where we investigated a specific functional domain, we used the hmmalign tool from the HMMer package with the --trim flag to only keep and align the region corresponding to this domain. When divergence levels allowed, phylogenetic analyses were performed using IQ-TREE with model testing including the C-series mixture models (-mset option)110. Statistical support was evaluated using 1,000 ultrafast bootstrap replicates (for IQ-TREE)109.

Ancestral reconstruction

For the ancestral reconstruction analyses, only a subset of 181 taxa were included (64 Asgard, 74 TACK and 43 Euryarchaea; see Supplementary Table 2 for details). Protein families with more than three members were aligned and trimmed using mafft-linsi (v.7.402)101 and trimAl (v.1.4.rev15) with the --gappyout option109. Tree distributions for individual protein families were estimated using IQ-TREE (v.1.6.5) (-bb 1000 -bnni -m TESTNEW -mset LG -madd LG+C10,LG+C20 -seed 12345 -wbtl -keep-ident)111. The species phylogeny together with the gene tree distributions were subsequently used to compute 100 gene–tree species tree reconciliations using ALEobserve (v.0.4) and ALEml_undated112,113, including the fraction_missing option that accounts for incomplete genomes. The genome copy number was corrected to account for the extinction probability per cluster (https://github.com/maxemil/ALE/commit/136b78e). The missing fraction of the genome was calculated as 1 minus the completeness values (in fraction) as estimated by CheckM (v.1.0.5) for each of the 181 taxa67. Protein families containing only one protein (singletons) were considered as originations at the corresponding leaf. The ancestral reconstruction of 5 protein families that included more than 2,000 proteins raised errors and could not be computed. The minimum threshold of the raw reconciliation frequencies for an event to be considered was set to 0.3 as commonly done114,115,116,117 and recommended by the authors of ALE (G. Szölősi, personal communication).

Ancestral metabolic inferences

Metabolic reconstruction of the Asgard ancestors was based on the inference, annotation and copy number of genes in ancestral nodes. The presence of a given gene was scored if its copy number in the ancestral nodes was above 0.3. A protein family was scored as ‘maybe present’ if the inferred copy number was between 0.1 and 0.3. The protein annotation of each of the clusters containing the ancestral nodes was manually verified for each of the enzymatic steps involved in the pathways, as detailed in Supplementary Table 4.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.