Background

Reconstructing phylogenetic relationships for rapidly radiating groups has proven to be particularly difficult [1,2,3,4]. This is because rapid radiations are particularly prone to extensive incomplete lineage sorting (ILS) and resulting high gene-tree discordance, which can result in unresolved or poorly resolved nodes in species trees [5,48]. We found that the most common topology occurred in 16% of the windows, which was not recovered by any coalescent or concatenated inference phylogenies (Fig. 4c). Conversely, the topology recovered by the intron-set-based phylogenies was the fourth most common topology and occurred in 12% of the windows, while the exon-set-based MP-EST and ASTRAL topologies were the thirteenth and fourteenth most common topologies and appeared in only 2.5 and 2.1% of the windows, respectively (Fig. 4c). Altogether, the distribution of gene tree frequency in combination with short internal branches in the species tree is consistent with the expectation of the existence of an anomaly zone in Prunellidae.

Effect of recombination rate variation on topology distribution

If the introgression is the predominant process generating topological discordance and anomaly zone, we would expect gene tree topology in the genomic regions with low recombination rate would be more resistant to introgression. We subsequently investigated tree topology and variation in introgression and recombination rates across the chromosomes for the species falling within the anomaly zone. We used population sequencing data from P. modularis (n = 9) to estimate recombination rates using ReLERNN [49] and PyRho v0.1.6 [50]. As the comparisons based on recombination rates estimated by ReLERNN and PyRho (see “Methods”) showed similar results, we present only the ReLERNN-based results in the main text; those based on PyRho are placed in the supplementary material (Additional file 1: Fig. S3). We averaged recombination rate (cM/Mb) in 50 kb non-overlap** windows and selected windows falling in the upper and lower 10% percentile of recombination rate and estimated topology distribution across these windows. We found that topology 4 ((P. montanella, P. rubida), ((P. koslowi, P. fulvescens), (P. o. fagani, P. o. ocularis, P. atrogularis))) was more frequent within the high-recombination regions of autosomes (Fig. 5a and Additional file 1: Fig. S3). This topology is congruent with phylogeny inferred from intron-set. In contrast, the low-recombination regions on the autosomes recovered topology 1 as having the highest frequencies. The analysis of the Z chromosome found topology 3 to be the dominant topology, especially in the low-recombination regions of that chromosome (Fig. 5a and Additional file 1: Fig. S3).

Fig. 5
figure 5

Tree topology changes with variation in recombination rate and introgression. a The frequency distribution of the four most common topologies in the high- and low-recombination regions of the autosomal and Z chromosomes, respectively. b, c Interplay between the topological distribution and recombination rate variation (left) as well as between the topological distribution and genetic introgression (right) in the Z chromosome (b) and autosomes (c). Topology 4 (blue), which is congruent with the phylogeny inferred from the intron-set, is enriched in the genomic regions with high-recombination rate and high level of gene flow, while the topology 3 (reddish) is more common in the genomic regions with low-recombination rates and less signature of gene flow. d ASTRAL species trees reconstructed for the low-recombination regions within the Z chromosome (left) and for the high-recombination regions within the autosomes (right), respectively. The two phylogenies differ in the position of P. montanella/P. rubida, P. fulvescens/P. koslowi, and P. modularis (indicated by reddish branches). The phylogeny of high recombination regions within autosomes is similar to those of intron-set

We then investigated the interplay between the topology distribution and variation in introgression and recombination rate. We specifically focused on gene flow between P. modularis, P. ocularis/P. atrogularis, P. montanella/P. rubida, and P. fulvescens/P. koslowi with Passer montanus as outgroup (see “Methods”). We found that the genomic regions supporting topology 4 have high rates of recombination and gene flow, while genomic regions supporting topology 3 have low rate of recombination rate and introgression (Wilcoxon statistic, P < 0.001, Fig. 5b and Fig. 5c, Additional file 1: Fig. S3 and Fig. S4). This pattern is more pronounced in the Z chromosome than in the autosomes.

We further reconstructed ASTRAL trees using 50-kb genomic windows with the upper and lower 10% percentile of recombination rate separately, and found that the topology from the genomic regions of the autosomes with the highest recombination rate was identical to the trees estimated from the intron-set-based phylogeny (Fig. 5d, Additional file 1: Fig. S4). However, the phylogenetic relationships reconstructed using the low-recombination regions in the Z chromosome placed P. montanella + P. rubida as a separate lineage, instead of clustering with P. koslowi + P. fulvescens as exhibiting by the phylogeny based on the high-recombination regions (Fig. 5d, Additional file 1: Fig. S4). Taken together, these results suggest that the low-recombination regions within the Z chromosome tend to contain few introgressed segments, likely representing the probable speciation-driven branching relationships for the accentors.

Discussion

Phylogenomic relationship of accentors

Lineages that have experienced a rapid radiation are prone to ILS and interspecific hybridization, a situation that poses a great challenge for phylogenetic reconstruction [74] and 3D-DNA v190716 [75] was used to anchor contigs to scaffolds. Possible assembly errors such as misjoins, translocations, and inversions were manually examined and corrected using the Assembly Tools module within JUICEBOX v1.11.08 [74] (Additional file 1: Fig. S5). We aligned P. strophiata genome with the Zebra finch (Taeniopygia guttata) genome using MUMmer v3.23 [76] and checked the collinearity of the two genomes.

Taxon sampling

We included all currently recognized species of Prunellidae (Supplementary Table 1), which consists of a single genus (Prunella) with twelve species [40, 42]. Prunella ocularis fagani was previously treated as a distinct species [77] but is now treated as a subspecies of P. ocularis [40, 42]. As P. o. fagani is geographically widely separated from P. o. ocularis, we herein treat P. o. fagani and P. o. ocularis as two taxonomic units. We included two to nine individuals for each species except for P. koslowi and P. atrogularis, for which only a single individual was available for each species. We used cryo-frozen or 96% ethanol-preserved tissue for all taxa except for P. o. fagani for which DNA was extracted from the toepad of a museum study skin.

DNA extraction, library preparation, and resequencing

The DNA was extracted from the tissue and museum toepad samples of 34 accentors and two Tree Sparrow Passer montanus using the Qiagen QIAamp DNA Mini Kit according to the manufacturer’s protocol. Sequencing libraries for fresh tissues were prepared using the Illumina TruSeq PCA-free (190/350 bp) kit and were sequenced on an Illumina Novaseq platform in Annoroad Gene Technology and Berry Genomic Institute. The library from museum specimen was prepared using the protocol published by Irestedt et al. [78] and sequenced by SciLifeLab (Stockholm). The samples were sequenced to a mean coverage of 21 × (Supplementary Table S2).

Filtering raw reads and reference map**

Raw sequenced data were cleaned using the fastx toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) with the following steps: (1) removal of adapters, (2) removal low-quality reads; reads with the proportion of “N” > 3% or reads with > 50% low-quality bases (< 3). Raw sequencing data from the museum specimen were cleaned by the same procedure except deleting 5 bp from both ends to avoid wrong sequences of the degraded DNA. We mapped clean reads of 34 accentors, and two tree sparrows and one Red-banded flowerpecker ( Dicaeum eximium, GCA013396995) against the de novo genome of P. strophiata using BWA mem v0.7.12 [79], and then sorted and removed duplicates using Picard (http://broadinstitute.github.is/picard/). We called variants using bcftools mpileup v1.9 [80]. We removed indels and filtered variant call format (VCF) using criteria: (1) minQ > 30, (2) min-DP > 10 and max-DP < 2500, (3) max-missing rate ≤ 0.1, (4) SNPs at least 5 bp away from indels. The VCF after filtering was used for downstream analysis.

Extracting and aligning homologous exonic and intronic loci

To investigate the potential influence of different genetic markers on phylogenetic inference, we assembled intronic and exonic datasets. We carried out these steps using a custom designed BirdScanner pipeline [81] (github.com/Naturhistoriska/birdscanner). Specifically, we performed searches using profile hidden Markov models (HMM) [82] to obtain a large number of sequence homologs of nuclear exonic and intronic loci across the whole genome. Profile HMMs use information from variation in multiple sequence alignments to seek similarities in databases, or as here, genome assemblies [83]. The HMM profiles were based on the alignments of exonic and intronic loci generated by Jarvis et al. [1] for four passerine species, Acanthisitta chloris, Corvus brachyrhynchos, Geospiza fortis, and Manacus vitellinus. For each HMM query and taxon, the location in the genome for the highest hit was identified, and the sequence parsed out using the genomic coordinates. The parsed-out gene sequences were then aligned gene by gene using MAFFT v7.310 [84] and poorly aligned sequences were identified, based on a calculated distance matrix using OD-Seq (github.com/PeterJehl/OD-Seq) and excluded from further analyses. We also checked the alignments manually and removed those that included non-homologous sequences for some taxa (indicated by an extreme proportion of variable positions in the alignment) and those that contained no phylogenetic information (no parsimony-informative sites). We also filtered the alignments to only include those that contained all samples. A total of 2373 exonic and 6879 intronic loci were kept for the subsequent analyses. All separate alignments were combined to a single concatenated alignment for the concatenation analyses, or kept separate for coalescent analyses based on gene trees.

Phylogenomic analyses

We used both concatenated and coalescent approaches to estimate phylogenomic relationships of the accentors for the intron-set and exon-set, respectively. For the concatenated approach, trees were constructed for the exon-set and intron-set separately using IQ-TREE [43] and applying “–m TEST” option to find the best substitution model for each alignment. We inferred the maximum-likelihood trees from the two concatenated datasets with 1000 ultrafast bootstraps to obtain branch supports as implemented in the IQ-TREE software [85].

For the coalescent analyses, we first used IQ-TREE to estimate the best maximum-likelihood tree for each intronic or exonic dataset. Statistical confidence of each gene tree was assessed by performing 100 bootstrap replicates using the best substitution model for each alignment. We used ASTRAL-III v5.6.3 [44, 45] to construct coalescent trees from the best maximum-likelihood gene trees estimated for the exon-set and intron-set separately. We also ran MP-EST coalescent analyses (MP-EST v2.1) [46] with 100 runs beginning with different random seed numbers and ten independent tree searches within each run. The MP-EST species tree topology was inferred using the best maximum-likelihood gene trees as input. Confidence of each node was evaluated by performing the same species tree inference analysis on 100 maximum-likelihood bootstrap gene trees. The resulting 100 species trees estimated from bootstrapped samples were summarized onto the ASTRAL and MP-EST species trees using the option “-f b” in RAxML.

Test topological difference between estimated gene trees and species trees

We next considered whether topological differences between estimated gene trees and the species trees are well supported. For each locus, we tested the estimated gene tree topology against each of the four candidate species trees that were inferred for the intron-set and exon-set, respectively (see “Results”). We used approximately unbiased (AU) tests in IQ-TREE to test whether individual gene trees fit each of the four candidate species trees. For each gene tree, a Bonferroni-corrected P value of 0.05 adjusted for multiple comparisons was considered to reject species tree topology.

Coalescent simulations

To investigate how much gene tree heterogeneity can be explained by ILS and gene tree estimation error, we carried out coalescent simulations as described in Cai et al. [Integrating signals of topology distribution and variation in introgression and recombination rate

We estimated chromosome-wide introgression using fd statistic in 50-kb non-overlap** sliding windows using ABBABABAwindows.py [97]. We specifically focused on comparisons between P. montanella/P. rubida and P. fulvescens/P. koslowi as these two lineages constitute the major topological conflicts observed (see “Results”). We estimated fd values for each of the four trios using P. modularis as outgroup, and then calculated their average fd values for subsequent comparisons. To assess how topology frequency changes with variation in introgression and recombination rate, we compared average fd and r values between the windows supporting the different topologies for the autosomal and Z chromosomes, respectively. We used Wilcoxon statistic to test for statistical significance.