Background

Tumors contain multiple, genetically diverse subclonal populations of cells that have evolved from a single progenitor population through successive waves of expansion and selection [1-3]. Reconstructing their evolutionary histories can help identify characteristic driver mutations associated with cancer development and progression [4,5], and can provide insight into how tumors might respond to treatment [6,7]. In some cases, it is possible to genotype the subpopulations present in a tumor, while reconstructing its history, using the population frequencies of mutations that distinguish these subclonal populations [2,8-21]. Increasingly, tumors are being characterized using whole-genome sequencing (WGS) of bulk tumor samples [22] and few automated methods exist to perform this reconstruction on the basis of these data reliably.

Subclonal reconstruction algorithms attempt to infer the population structure of heterogeneous tumors based on the measured variant allelic frequency (VAF) of their somatic mutations. Some methods perform this reconstruction based solely on single nucleotide variants or small indels (collectively known as simple somatic mutations or SSMs) [16-19,21,23]. Others use changes in read coverage to identify genomic regions with an average ploidy that differs from normal, which they explain using inferred copy number variations (CNVs) that affect some of the cells in the sample [15,20,24,25].

The low read depth of current WGS complicates subclonal reconstruction. Until recently, subclonal populations (i.e., subpopulations) were defined based on accurate estimates of the proportion of cells with each mutation (i.e., their population frequency), which, for individual SSMs, are only available through targeted resequencing where the read depths are orders of magnitude higher than typical WGS depths [17,18,23]. However, preliminary evidence suggests that the much larger number of mutations detected by WGS can compensate for their decreased read depth [26]. In contrast, CNVs affect large, multi-kilobase-sized or megabase-sized regions of the genome, which allow the average copy number of these regions to be accurately estimated with WGS. Unfortunately, CNV-based subclonal reconstruction is more difficult than SSM-based reconstruction because of the need to estimate simultaneously population frequency and new copy number for each CNV. Most CNV-based methods only attempt to infer the copy number status of the clonal cancerous population [24,25] that contains the mutations shared by all of the cancerous cells. The few CNV-based methods [15,20] that attempt to resolve more than one cancerous subpopulation are practically limited to a small number (often two) of subpopulations. In contrast, SSM-based methods applied to targeted resequencing data can reliably resolve many more cancerous subpopulations [16-18,23]. However, it remains unclear what the limits of WGS-based automated subclonal reconstruction are.

Another open question is how to combine CNVs and SSMs when doing reconstruction. CNVs overlap** SSMs can interfere with SSM-based reconstruction because they complicate the relationship between VAF and population frequency. Although some methods attempt to model the impact of CNVs on the allele frequency of overlap** SSMs [17-19,27], these methods have significant restrictions. For example, several of these methods [17,18] make the unrealistic assumption that every cell either contains the structural variation and the mutation or neither. Also, no method places structural variations in a phylogenetic tree, which is important for studying the evolution of cancerous genomes.

We describe PhyloWGS, the first method designed for complete subclonal phylogenic reconstruction of both CNVs and SSMs from WGS of bulk tumor samples. Unlike all previous methods, PhyloWGS appropriately corrects SSM population frequencies in regions overlap** CNVs and is fast enough to perform reconstruction of at least five cancerous subpopulations based on thousands of mutations. We present results on subclonal reconstruction problems that cannot be correctly reconstructed using previous methods. We also probe the relationship between WGS read depth and the number of subpopulations that PhyloWGS can recover. Finally, we demonstrate that even in the absence of reliable CNV estimates, it is still feasible to perform automated subclonal composition reconstruction based on SSM frequency data at typical WGS read depths (30 to 50 ×), even for highly rearranged genomes where less than 2% of the SSMs lie in regions of normal copy number. Open-source, free software implementing PhyloWGS is available under the GNU General Public License v3 [28].

Previous work

Figure 1 provides an overview of an evolving tumor, the measurement of somatic VAFs and the resulting subclonal reconstruction process. Panel (i) of this figure shows a visualization of the evolution of a tumor over time as non-cancerous cells (subpopulation A, shown in grey) are replaced by, at first, one clonal cancerous population (subpopulation B, shown in green), which then further develops into multiple cancerous subpopulations (C and D, shown in blue and yellow, respectively). Tumor cells define new subpopulations by acquiring new oncogenic mutations that allow their descendants to expand relative to the other tumor subpopulations. Each circle in panel (i) refers to a subpopulation. We associate subpopulations with the set of shared somatic mutations that distinguish it from its parent subpopulation (or, in the case of A, from the germ line (or reference) genome); this mutation set is indicated by the corresponding lower case letter (e.g. mutation set b first appears in subpopulation B). However, each subpopulation also inherits all of its parent’s mutations; the subclonal lineage of a mutation is the set of all subpopulations that contain it (e.g., the subclonal lineage of a is A, B, C and D).

Figure 1
figure 1

The development of intratumor heterogeneity and subclonal reconstruction. Tumor composition over time (i), the resulting distribution of variant allele frequencies (VAFs) (ii), the result of successful inference of the VAF clusters (iii), and the desired output of subclonal inference (iiii). SSM, simple somatic mutation; VAF, variant allelic frequency.

In general, the subpopulation-defining mutation sets include more than one mutation. Cancerous cells often have increased mutation rates, and even non-cancerous cells accumulate somatic mutations at a rate of 1.1 per cell division [29]. As such, subpopulations are defined not only by the small number of oncogenic ‘driver’ mutations that support rapid expansion but also by a larger number of ‘passenger’ mutations acquired before the driver mutation(s). The selective sweeps that cause subpopulation expansion increase the population frequency of both driver and passenger SSMs, driving them to having indistinguishable population frequencies [30,31]. However, sampling and technical noise in sequencing means that the observed VAFs are distributed around the true value for a subpopulation. Panel (ii) shows an example histogram of measured VAFs for SSMs found in a heterogeneous tumor sample.

Subclonal reconstruction algorithms define mutation sets, and their associated subpopulations, by analyzing the population frequencies of somatic mutations detected in a tumor sample. In Figure 1, all mutations are SSMs, and all SSMs occur on one copy in diploid regions of the genome. In this case, the estimated population frequency of an SSM is simply twice its VAF. Figure 2, discussed in the next section, shows how CNVs overlap** SSM loci change this relationship. Note that although each VAF cluster corresponds to a subclonal lineage, and a subpopulation that was present at some point during the tumor’s evolution, this subpopulation need not be present when the tumor is sampled. In Figure 1, subpopulation B is no longer present in the tumor, although its two descendant subpopulations are. These vestigial VAF clusters, if they exist, always correspond to subpopulations at branchpoints in the phylogeny, however, not every branchpoint generates a vestigial cluster.

Figure 2
figure 2

Example of copy number variations affecting the distribution of variant allele frequencies.

Simple-somatic-mutation-based approaches

SSM-based subclonal reconstruction algorithms attempt to reconstruct the subpopulation genotypes based on VAF clusters (and their associated mutation sets) identified by fitting statistical mixture models to the VAF data either without phylogenic reconstruction [18,19,21,32], before phylogenic reconstruction [33] or concurrently with it [16,17]. Often, as in Figure 1, the clusters overlap, which introduces uncertainty in the exact number of mutation sets represented in the tumor (as well as in the assignment of SSMs to clusters). Adding more clusters to the model always provides a better data fit, so to prevent overfitting, the cluster number is selected by balancing data fit versus a complexity penalty (e.g. the Bayesian information criteria) or by Bayesian inference in a non-parametric model [17,18,32]. In panel (iii) in Figure 1, the correct number of clusters has been recovered along with appropriate central VAFs.

Assuming that the correct VAF clusters can be recovered, the subclonal lineages corresponding to each mutation set must still be defined. Defining the subclonal lineages is equivalent to defining the tumor phylogeny, and often multiple phylogenies are consistent with the recovered VAF clusters (e.g. panel (iiii) in Figure 1). Complete and correct reconstruction of subpopulation genotypes requires resolving this ambiguity. To do so, reconstruction methods make one of a handful of assumptions about the process of tumor evolution.

A common, and powerful, assumption is the infinite sites assumption (ISA) [17,34,35], which posits that each SSM occurs only once in the evolutionary history of the tumor. The ISA implies that the tumor evolution is consistent with a ‘perfect and persistent phylogeny’ [18]: each subpopulation has all of the SSMs that its ancestors had, each SSM appears in only one subclonal lineage and each subclonal lineage corresponds to a subtree in the phylogeny of tumor subpopulations. Because SSMs are relatively rare (compared to the genome size), the ISA is nearly always valid for all SSMs, so there is little danger of incorrect reconstructions due to violations of the ISA. In many cases, the ISA alone permits the recovery of multiple, complete subpopulation genotypes from a single or small number of tumor samples using either the sum rule [17] (also called the pigeonhole principle [26]) or the crossing rule [17], respectively. Methods that do not use the ISA require, in the case of no measurement noise, at least as many tumor samples as there are subpopulations [16,36]; in actual application when there is noise, even more samples are required.

Unfortunately, the ISA alone is often unable to resolve reconstruction ambiguity fully. As such, some methods [16,33] also make a sparsity assumption to select among ISA-respecting phylogenies consistent with the VAF data. This assumption, which we call strong parsimony, posits that due to expansion dynamics, there are a small number of subpopulations still present in the tumor [16,33], and that many of the VAF clusters are vestigial. These methods therefore select the phylogeny (or phylogenies) that maximizes the number of vestigial VAF clusters [16], or equivalently, the number of branchpoints where the parental subpopulation has a zero frequency in the current tumor [16,33]. The strong parsimony assumption does resolve some ambiguity, and leads to the correct reconstruction in Figure 1, but it is risky as its empirical validity is not yet established. For example, under some conditions, a linear (i.e. non-branching) phylogeny can be mistaken for a branching one; the risk of this occurring increases as the VAF measurement noise or the number of subpopulations in the tumor increases. This background distribution of false positive vestigiality is not yet considered by either of the methods that assume strong parsimony.

By assigning all SSMs within a VAF cluster to the same mutation set, reconstruction methods make another implicit assumption, which we call weak parsimony. This assumption does not hold if two mutation sets have the same population frequency. Note that if the ISA is valid, by the pigeonhole principle, weak parsimony is guaranteed to be valid whenever the population frequency of the mutation set is >50%.

Table 1 classifies reconstruction methods based on these assumptions, whether they recover complete subpopulation genotypes (or simply identify subclonal lineages), and whether they can handle single tumor samples, multiple tumor samples or both.

Table 1 Subclonal reconstruction methods, their properties and assumptions

PhyloWGS, like its predecessor PhyloSub [17], does not make the strong parsimony assumption nor does it report only a single tree. Instead it reports samples from the posterior distribution over phylogenies. Because the clustering of the VAF is performed concurrently with phylogenic reconstruction, PhyloWGS is able to perform accurate reconstruction even when the weak parsimony assumption is violated in a strict subset of the samples available, for example, if the VAF clusters overlap in one sample but not another. Our Markov chain Monte Carlo (MCMC) procedure samples phylogenies from the model posterior that are consistent with the mutation frequencies and does not rule out phylogenies that are equally consistent with the data. From this collection of samples, areas of certainty and uncertainty in the reconstruction can be determined.

Copy-number-variation-based approaches

There are three major differences between CNV-based reconstruction and SSM-based reconstruction. First, because large regions of the genome are affected by CNVs and reads, map** across the regions can be used to estimate average ploidy and accurate quantification of changes in average copy number can be achieved with much smaller read depths (as low as 5 to 7 ×) [15,

$$x = \phi C + (1-\phi) 2, $$

always has at least two different solutions for x>1.

In the absence of other information, like B-allele frequencies [26], parsimony assumptions are relied upon to resolve reconstruction ambiguities. One strategy only attempts to reconstruct the cancerous, subclonal lineage [24,25] with the highest population frequency (also known as the clonal population). From this reconstruction, the proportion of cells in the tumor sample that are cancerous (i.e. the cellularity), as well as the CNVs that are shared by all cancerous cells in the tumor, can be inferred. However, this approach can fail when there are multiple subclonal populations, especially if they share few CNVs [15,20]. Methods that attempt to detect >1 cancerous subpopulation do so by balancing data fit with a complexity term that penalizes additional subpopulations [15,20]. So far, these methods seem to be practically limited to a small number of cancerous subpopulations (i.e., two), and cannot be applied to tumors with substantial rearrangements.

Combining simple somatic mutations and copy number variations

In loci affected by CNVs, computing the population frequency of an SSM from its VAF requires knowing whether the SSM occurred before, after or independently of the CNV. If the SSM occurred before the CNV, and CNV affects the copy number of the SSM, then computing its VAF also requires knowing the new number of maternal and paternal copies of the locus. Figure 2 illustrates a situation where incorporating CNV information is critical for subclonal reconstruction. Without CNV information, the two VAF peaks would be interpreted as two separate subclonal lineages. With CNVs, it becomes clear that the second peak is caused by the amplification of part of the genome that increases the VAF of all SSMs found in the region.

Some subclonal reconstruction methods simply ignore the impact that CNVs have on the relationship between SSM population and allele frequency [16,21]. Other methods that do account for the effect of copy number changes on SSM frequencies [17-19], do so by integrating over all the possible relationships between allele frequency and population frequency without using that the ISA, which was necessary to associate SSMs uniquely to subclonal lineages in the first place, constrains this relationship [26].

For the first time, we describe an automated method, PhyloWGS, which performs subclonal reconstruction using both CNVs and SSMs. By combining information from both CNVs and SSMs, and properly accounting for their interaction, we provide a more comprehensive and accurate description of a subclonal genotype.