Introduction

How the coronavirus SARS-CoV-2 evolved to infect humans continues to be an active area of research. To understand the origin of the zoonosis it is crucial to identify the closest viral wild population from which SARS-CoV-2 originated. Very early during the pandemic, the bat coronavirus RaTG13 from China’s Yunnan province, was identified as the most closely related to SARS-CoV-2, showing an average genome-wide nucleotide identity of 96.1% (Zhou et al. 2020).

This was followed by the proposal that the Receptor-Binding Domain (RBD) of the spike protein from SARS-CoV-2 was acquired by recombination with pangolin-infecting coronaviruses (Li et al. 3). In both cases, the Bayes factor (BCS) is larger than 100 which is interpreted as decisive evidence against the SEPARATE hypothesis (Kass and Raftery 1995). Therefore, the evolution of the gene coding for the spike protein for these 14 coronaviruses is better described by 4 recombination breakpoints.

Next, we focused on the phylogenetic history of the concatenated segments 2 and 4 versus the segment 3. Segments 2 and 4 contain a small fraction of the NTD, the RBD (minus the RBM), SD1, SD2, FP and a fragment of the IFP motif (Fig. 3); while segment 3 corresponds to the RBM. The topology of the phylogenetic trees of these segments are almost identical except for three exceptions (Fig. 4). In the first place, the coronavirus Guangxi_P4L is in a different bipartition in the tree inferred from segment 3 (the tree in the right of Fig. 4), although the posterior probability of the internal node supporting this bipartition is low (0.61), raising doubts on its veracity. Second, the coronaviruses Guangdong 1 and bat_RShSTT182 interchange positions between the two trees, but again, the posterior probability supporting the position of Guangdong 1 is low (0.58) in the tree inferred from segment 3. And most importantly, in the segment 3 tree (Fig. 4 right), the coronavirus RaTG13 branches outside the well supported bipartition (0.91) defined by the coronaviruses: bat_RShSTT182, Guangdong 1, Wuhan-Hu-1/2019, BANAL-20-103 and BANAL-20-52, thus supporting the hypothesis that the RBM in RaTG13 was acquired via recombination with a yet unknown coronavirus (Boni et al. 2020).

We further evaluated the phylogenetic dissonance (D) between the two trees in Fig. 4 by using GALAX software. Dissonance is a measure of phylogenetic conflict between segments/partitions of data; and is estimated by measuring the average information content in Bayesian posterior tree samples from individual segments minus the information contained in the merged set of Bayesian tree samples from all segments (Lewis et al. 2016). D takes values from 0 to 1 (or 0 to 100%) where 0 indicates no phylogenetic conflict between segments. Dissonance between two trees can be further partitioned by clades. This is, it is possible to identify which clades contribute most to dissonance between trees.

In Fig. 4 we show which clades contribute most to dissonance (D) between trees. The largest percentage to dissonance (43%) is contributed by the partition that divides the tree between the external group (Rco319) and the rest of the OTUs. This is expected because the two trees are different as a whole. However, the second percentage to dissonance is contributed by the partition containing coronaviruses most closely related to Wuhan-Hu-1/2019, including RaTG13 (69%–43% = 26%), these are depicted with red doted lines connecting the two trees. On the third place, is the contribution to dissonance of the partition that includes the above species plus GuanxiP4L and RSYN04 (56%–43% = 13%), these are depicted with orange lines. This result further reinforces that segment 3 has a different phylogenetic history than segments 2 and 4. The coverage of the dissonance analysis is 0.78 (see supplementary material for a complete description of the statistics associated with the dissonance analysis).

The origin of the RBM by recombination in RaTG13 was further confirmed by analysis with the Recombination Detection Program (RDP) (Martin et al. 2020). This software applies several different methodologies to a set of sequences and calculates an overall consensus score to assess the veracity of detected recombination events. A full exploratory recombination scan identified a recombination event with high confidence (consensus score > 60) between RaTG13 and an unknown coronavirus at positions 1308 to 1514 of the multiple sequence alignment. These coordinates correspond to the RBM and coincides with that detected by GARD (see supplementary material).

The next question is whether the immediate co-descendant to the clade conformed by Wuhan-Hu-1/2019, BANAL-20-52 and BANAL-20-103 is the bat (bat_RShST188) or the pangolin (Guangdong 1) coronavirus. This is important because it would indicate if the RBM from Wuhan-Hu-1/2019 descend from a coronavirus that infects bats or pangolins.

Given that the posterior probability of the node supporting the close relationship of Guangdong 1 to the clade containing Wuhan-Hu-1/2019, BANAL-20-52 and BANAL-20-103 in the tree inferred from segment 3 is low (0.58; Fig. 4, right), one possibility is that the closest coronavirus to the clade containing Wuhan-Hu-1/2019 is bat_RShST188, as shown in the tree inferred from segments 2 and 4 in Fig. 4 (left). In fact, an alternative phylogeny to that shown in Fig. 4 (right) were Guangdong 1 shifts position with bat_RShST188, is not significatively worse than the original tree according to a Kishino-Hasegawa test (p value = 0.341) (see supplementary material). Therefore, the hypothesis that the RBM from SARS-CoV-2 evolved from a bat infecting coronavirus cannot be rejected.

If we use this alternative topology where Guangdong 1 shifts position with bat_RShST188 to reconstruct the ancestral sequences, we find that the RBM of the common ancestor of Wuhan-Hu-1/2019 and Guangdong 1 (the sequence named as “ancestral 4” in Fig. 5) was identical to that of Wuhan-Hu-1/2019, with the exception of residue Q498H. Showing that natural selection did not favor changes in the RBM of these coronaviruses to adapt to new hosts since they last shared a common ancestor. The same result is obtained if the original tree (the one shown in Fig. 4 right) is used for the ancestral sequence reconstruction (see supplementary material).

Fig. 5
figure 5

Ancestral sequence reconstruction shows that the RBM of the common ancestor of Wuhan-Hu-1/2019, BANAL-20-52, BANAL-20-103, bat_RShST188 and Guangdong 1 (here named as “ancestral_4”) was identical to the RBM of Wuhan-Hu-1/2019, except for the residue Q498H (red arrow). Amino acids involved in human ACE2 recognition are indicated with arrows

Discussion

The analyses provided here shows that the RBM from RaTG13 is not closely related to the RBM from SARS-CoV-2 and was likely acquired by recombination with a yet unknown coronavirus (Boni et al. 2020). Because of that, the RBM from the coronaviruses from Laos (BANAL-20-52 and BANAL-20-103) are the most closely related to the RBM from SARS-CoV-2 (Temmam et al. 2022).

Our results also show that SARS-CoV-2 did not acquire its RBM by recombining with a pangolin infecting coronavirus. Instead, our analyses indicate that the coronaviruses Wuhan-Hu-1/2019, BANAL-20-52 and BANAL-20-103 inherited its RBM most likely from a bat infecting coronavirus. Parsimony favors this interpretation given that bat_RShST188, Wuhan-Hu-1/2019, BANAL-20-52 and BANAL-20-103 are all bat-infecting coronaviruses. If recombination between a bat and pangolin infecting coronaviruses played a role in the evolution of the RBM (or the whole RBD), this may have occurred prior to the divergence of bat_RShST188 and Wuhan-Hu-1/2019.

Our results are in agreement with the interpretation of Temmam et al. (2022) regarding the evolution of the RBM in SARS-CoV-2. Accordingly, natural selection did not incidentally improve the affinity of the RBM for human ACE2 in an intermediate host before spillover (Makarenkov et al. 2021), nor did selection optimize the RBM in humans early after spillover (Andersen et al. 2020). This follows from the fact that the RBM from SARS-CoV-2 is identical to the ancestral sequence it shared with Guangdong 1 with the exception of a single amino acid change Q498H. The conservation of the RBM between coronaviruses that infects pangolins, bats and humans is consistent with recent research showing that SARS-CoV-2 is a generalist virus that is not specifically adapted to humans (Li et al. 2023). However, the origin(s) of other peculiarities of SARS-CoV-2, like the furin-cleavage site, remain to be elucidated. Such features may have evolved by different mechanisms that may have included the passage of the coronavirus in an intermediate host.

Material and Methods

Gene sequences from the spike protein were retrieved from the GenBank and GISAID databases (https://gisaid.org/; Khare et al. 2021). Author acknowledgments for sequences downloaded from GISAID are provided in supplementary material. The spike protein coding genes were extracted from genome sequences following annotation. When annotation was not available, we identified the spike coding gene by aligning the gene from Wuhan-Hu-1/2019 to the query genome using BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi). Codon multiple sequence alignment was performed in MEGA software 11v (Tamura et al. 2021).

Recombination analysis was done with GARD as implemented in http://datamonkey.org/ (Weaver et al. 2018) with the following parameters: normal run mode, universal genetic code, without site-to-site rate variation and 2 rate classes.

Domains in the spike protein follow those defined by Lan et al. (2020) and ** stone algorithm as implemented in MrBayes (Ronquist et al. 2012). Next, the Bayes factor BF is:

$${B}_{CS} = \frac{p(y|{M}_{C})}{p(y|{M}_{S})}$$

These authors suggest the following interpretation of BF:

log10(BCS)

BCS

Evidence against MS

0 to ½

1 to 3.2

Not worth more than a bare mention

½ to 1

3.2 to 10

Substantial

1 to 2

10 to 100

Strong

 > 2

 > 100

Decisive

Phylogenetic dissonance, D, was calculated with GALAX software (https://github.com/plewis/galax). To generate the sample trees required for GALAX, we ran the mcmc algorithm in MrBayes with 1,000,000 generations and a sampling frequency of 500. The model was set to: GTR + G + I. Phylogenetic trees from Fig. 4 were inferred with MrBayes with the same parameters and 25% of burnin was discarded from the sample. For step**-stone analysis the ss algorithm was set to run 1,000,000 generations and sample each 1000 generation. Example files to run mcmc and ss algorithms in MrBayes are provided in supplementary material.

A full exploratory recombination scan was applied to the multiple sequence alignment with the program RDP (Martin et al. 2020). Methods used within RDP were: RDP, GENECONV, BootScan, MaxChi, Chimera, SiScan and 3Seq. Default parameters were used and sequences were assumed to be linear. We further asked RDP to save a distributed alignment with recombinant regions separated. Based on this distributed alignment we inferred a Maximum-Likelihood tree with MEGA11 (100 bootstrap replicas and GTR + G model of sequence evolution). For clarity, we included in this tree only the recombinant sequence corresponding to the RBM from RaTG13.

Figure 3 was generated with Circos (Krzywinski et al. 2009). Kishino-Hasegawa test was implemented in IQ-TREE (Kishino and Hasegawa 1989; Minh et al. 2020). Ancestral sequence reconstruction (ASR) was performed in MEGA software 11v by Maximum-Likelihood under the Tamura-3 parameter model and including all sites (Tamura et al. 2021). Multiple sequence alignment was visualized with Jalview (Waterhouse et al. 2009).

Supplementary material.