 Method
 Open Access
 Published:
Associating somatic mutations to clinical outcomes: a pancancer study of survival time
Genome Medicine volume 11, Article number: 37 (2019)
Abstract
We developed subclone multiplicity allocation and somatic heterogeneity (SMASH), a new statistical method for intratumor heterogeneity (ITH) inference. SMASH is tailored to the purpose of largescale association studies with one tumor sample per patient. In a pancancer study of 14 cancer types, we studied the associations between survival time and ITH quantified by SMASH, together with other features of somatic mutations. Our results show that ITH is associated with survival time in several cancer types and its effect can be modified by other covariates, such as mutation burden. SMASH is available at https://github.com/Sunlab/SMASH.
Background
Somatic mutations, including somatic point mutations (SPMs; e.g., single nucleotide variants or indels) and somatic copy number alterations (SCNAs), are the underlying driving force for tumor growth. In this sense, cancer is a genetic disease. Therefore, association studies between somatic mutations and clinical outcomes may provide insights into tumor biology or personalized treatment selection. However, few efforts have been reported toward this end, partly because most somatic mutations or even genelevel mutations are too rare to conduct meaningful association studies. An alternative to a mutationbymutation or genebygene association study is to summarize mutation information by certain features and then associate such features with clinical outcomes. In this paper, we consider three such features: tumor mutation burden (TMB, i.e., the total number of SPMs), SCNA burden, and the degree of (genetic) intratumor heterogeneity (ITH), which refers to the fact that tumor cells can be grouped in subclones such that the cells within one subclone share similar sets of somatic mutations. ITH is a fundamental characteristic of somatic mutations and has been associated with clinical outcomes such as survival time or immunotherapy treatment response [1, 2].
We estimate TMB by counting the number of nonsynonymous point mutations [3, 4] and estimate the burden of SCNAs using allelespecific copy number estimates derived from ASCAT [5]. While measuring TMB and SCNA burden is relatively straightforward, quantifying ITH is much more challenging. Computational methods have been developed to characterize ITH, e.g., to identify the phylogenetic tree of subclones and the mutations belonging to each subclone [6–11]. However, consensus on the optimal approach for ITH inference and the appropriate approach for quantifying ITH in association studies does not exist. The estimation uncertainty of ITH is often unavoidable because the observed data may be compatible with more than one subclone configuration. Therefore, such uncertainty should be incorporated in association studies.
Counting the number of subclones is a straightforward approach to quantify ITH. Andor et al. [1] assessed the association between the number of subclones and survival time in 12 cancer types using data derived from The Cancer Genome Atlas (TCGA). These investigators did not find any significant associations, except for gliomas. Morris et al. [12] assessed the association between ITH and survival time in nine cancer types and found significant associations for several cancer types. They treated ITH as a binary variable based on whether or not the number of subclones was larger than a threshold. An apparent drawback of the aforementioned two approaches is that the subclone proportion information is lost. For example, if a tumor sample has two subclones with cellular proportions being 99% and 1%. Intuitively, this tumor sample is fairly homogenous and may be better classified as one subclone instead of two subclones. A second drawback of the thresholding approach of Morris et al. [12] was that only a small number of patients (3 to 11 patients across nine cancer types, median of six patients) were classified as having both high ITH and noncensored survival time. As a result, the association results can be highly unstable with respect to ITH inference.
An alternative metric to quantify ITH is mutantallele tumor heterogeneity (MATH) [13], which is defined as 100×MAD/median, where median is the median of the variant allele frequencies (VAFs) of all somatic point mutations within a sample, and MAD is the median absolute deviation of the VAFs. MATH pertains to the ratio between the center and spread of the VAF distribution. This approach ignores the fact that VAF can be affected by SCNAs (see Fig. 1 for an illustration).
Although many methods have been developed for ITH inference, none of them are ideal for largescale association studies. In most solid tumors, a significant proportion of the genome is affected by SCNA, and so, those methods that cannot account for SCNA [8, 14–17] are not appropriate for our purpose. Several methods either explicitly or implicitly require multiple samples per patient [6, 8, 10, 14, 15, 17] and thus cannot be used for our association analysis of TCGA data, where each patient only has one sample.
PyClone [11] is arguably the most popular method for ITH study and has been used in two pancancer studies [1, 12]. However, PyClone is designed for targeted sequencing studies, where a small number of loci are sequenced with ultrahigh coverage (e.g., >1,000 × coverage). Its Bayesian Markov Chain Monte Carlo (MCMC) implementation requires an extended runtime. In addition, PyClone performs clustering of somatic mutations, but does not infer phylogeny.
Many other existing methods for ITH study [7, 9–11] also use Bayesian MCMC implementation and their computational burden makes them undesirable for largescale association studies. Another class of methods use combinatorial approaches [6, 16, 17]. Several approaches do not account for SCNA [16, 17]. SPRUCE [6], a more recent algorithm, jointly models SPMs and SCNAs by multistate perfect phylogeny mixtures. It is designed for multisample study with a small number of mutations or mutation clusters. For example, as shown in their simulations, even with only 5 mutations or mutation clusters, at 500 × coverage, the median number of solutions is between 1000 and 10,000 when there are two samples, and around 100 when there are 5 samples. Other ITH quantification methods either do not provide an easytouse uncertainty measurement [18] or require additional (hardtoget) information such as phasing between sparsely distributed somatic mutations [19].
Given these considerations, we developed a new method for ITH study, called subclone multiplicity allocation and somatic heterogeneity or SMASH. To overcome the limitations of the aforementioned approaches for quantifying ITH, we quantify ITH, as previous studies have done [12, 20], using entropy \(\sum _{s=1}^{S} \vartheta _{s} \log {(\vartheta _{s})}\), where 𝜗_{s} is the proportion of tumor cells that belong to the sth subclone and S is the total number of subclones. We assessed the performance of SMASH and a few other methods in largescale simulated association analysis. Then we used these methods to study the association between survival time and TMB, SCNA burden, frequently mutated genes, and ITH using data on 5898 TCGA tumor samples from 14 cancer types [21].
The major contributions of our work are threefold. First, we propose a new computational method that is designed for largescale studies of ITH with higher computational efficiency. Second, we evaluated the benefit to incorporate uncertainty of ITH estimates in association studies and conclude that there is positive but relatively minor benefit. Third, in the largescale real data analysis, we found several interesting patterns such as the interaction between mutation burden and ITH.
Methods
Assumptions
SMASH is a frequentist approach to identify tumor subclones through clustering somatic mutation read counts, while accounting for copy number alterations. We enumerate all possible phylogenetic trees that are compatible with the observed data and quantify the probability of each phylogenetic tree. We make the following assumptions when enumerating phylogenic trees.

(1)
Primary tumors arise from a founder clone or have unicellular origin.

(2)
Loci harboring SPMs associated with ITH have homozygous reference alleles in normal cells and a mixture of reference and alternate alleles in tumor.

(3)
Each SPM event occurs only once on a single allele and a locus will not undergo more than one point mutation or revert back to its original base.

(4)
At most two descendant subclones can evolve from an ancestral subclone.

(5)
SCNAs are clonal events.
Assumption (1) follows from the clonal evolution theory of tumor growth [22]. Assumption (2) is automatically satisfied because genetic loci with germline mutations are filtered out during somatic mutation calling. Assumption (3) is referred to as the infinite site assumption [23, 24], which is reasonable because the number of mutated loci is very small relative to the size of the genome. This assumption implies that tumor evolution is consistent with a “perfect and persistent phylogeny” [9, 11] such that each subclone has only one parental subclone and all mutations of the parental subclone. Assumption (4) is reasonable when we consider tumor evolution in a refined time scale, and it is helpful to reduce the number of enumerated phylogenies. Assumption (5) is the only restrictive one, and it is a crucial assumption made by ASCAT [5], which is the method we use to infer copy numbers. Assumption (5) is also adopted by PyClone [11] and EXPANDS [18], the two methods that have been used in previous pancancer studies [1, 12]. To the best of our knowledge, Canopy [10] is the only method that can infer both subclonal SCNA and subclonal point mutations. However, Canopy carries a high computational cost and emphasizes multiple sample design, which makes it unsuitable for our study. By assuming clonal SCNA, all subclonal SPMs occur after the SCNA event and thus have a multiplicity of one. On the other hand, clonal SPMs can occur before or after SCNA and thus can have varying multiplicities, depending on the copy number state. We obtain SCNArelated information, including tumor purity, ploidy, and allelespecific copy numbers per SPM through ASCAT [5].
Notation and framework
Let T and \(\tilde {T}\) denote the failure time and the corresponding censoring time, respectively. Define \(X = \min (T,\tilde {T})\) and \(\Delta = I(T \leq \tilde {T})\). Let Z=(Z_{1},…,Z_{p})^{T} represent a pvector of baseline covariates. Let l=1,…,L index each locus harboring a SPM after mutation calling and filtering. The lth SPM is characterized by a pair of alternate and reference read counts derived from the tumor sample denoted by A_{l} and R_{l}, respectively. The summation T_{l}=A_{l}+R_{l} is referred to as the total read depth. The corresponding clonal copy number state is denoted by (C_{l1},C_{l2}), where C_{l1}≤C_{l2}. For a given subject, the observed clinical data consist of (X,Δ,Z), and genomic data are represented by (A_{l},R_{l},C_{l1},C_{l2}) for l=1,…,L.
Assume that the tumor sample of interest has S subclones. These S subclones relate to each other through a phylogenic tree describing the order in which subclones emerged. In Fig. 2, we enumerated all phylogenic trees for one to five subclones that capture the possible linear and branching evolutions between subclones. A possible allocation of somatic mutations across the S subclones can be described by a vector of length S: \(\boldsymbol {q}_{u}^{T} = (q_{u1}, \ldots,q_{uS})\) such that q_{us} is an indicator of whether this mutation occurs in the sth subclone. Each phylogenic tree that we enumerate in this paper is compatible with a set of allocations. Let k index each enumerated phylogenic tree, and let Q_{k} denote a set of allocations of the kth phylogenic tree. For both simulation and real data analysis, we enumerated all phylogenic trees with one to five subclones. In simulation, given a phylogenic tree, each SPM was randomly assigned an allocation with equal probability.
To illustrate, a clonal sample (S=1) would have Q_{1}=(q_{11}), where q_{11}=1 for all SPMs because each SPM is present in all cancer cells. For a sample with two subclones (S=2), only one possible tree A→B exists with a founding subclone A and a new subclone B. Then, the set of allocations are Q_{2}=(q_{21},q_{22}), where \(\boldsymbol {q}_{21}^{\mathrm {T}} = (1,1)\) and \(\boldsymbol {q}_{22}^{\mathrm {T}} = (0,1)\). The SPMs with allocation q_{21} arise in the founding subclone A, and the SPMs with allocation q_{22} arise in the new subclone B. For S=3, we need to distinguish between linear and branching trees. Let \(\boldsymbol {q}_{31}^{\mathrm {T}} = (1,1,1)\), \(\boldsymbol {q}_{32}^{\mathrm {T}} = (0,1,1)\), \(\boldsymbol {q}_{33}^{\mathrm {T}} = (0,0,1)\), and \(\boldsymbol {q}_{34}^{\mathrm {T}} = (0,1,0)\). The linear tree is characterized by Q_{3}=(q_{31},q_{32},q_{33}), whereas a branching tree is characterized by Q_{4}=(q_{31},q_{33},q_{34}). (See Additional file 1: Section C.2 for all enumerated configurations based on the list of subclonal assumptions.)
For a clonal SPM located in a region of SCNA, we need to infer its multiplicity, or the number of mutant alleles. If the SPM occurs before the SCNA, its multiplicity is one of the two allelespecific copy numbers of the SCNA; otherwise, its multiplicity is 1. In contrast, based on our assumption that SCNAs are clonal, the multiplicity of a subclonal SPM is always 1. Let M_{l} be the set of possible multiplicities given the copy number states. Then, M_{l}={mm>0 and m∈unique(1,C_{l1},C_{l2})}, where unique (Z) denotes the unique elements of Z.
With S subclones, let η_{s} denote the proportion of cells in a tumor sample that belong to subclone s, and let η^{T}=(η_{1},…,η_{S}). Tumor samples derived from bulk tissues are practically never 100% pure, and hence, a proportion of normal cells will contaminate the sample. Let \(\phi = \sum _{s=1}^{S} \eta _{s}\) denote a tumor sample’s purity. In addition, write 𝜗_{s}=η_{s}/ϕ and 𝜗^{T}=(𝜗_{1},…,𝜗_{S}). The vector 𝜗 can be interpreted as the set of subclone proportions in the cancer cell population. To characterize ITH within a tumor sample, we utilize the notion of “entropy” or Shannon Index characterized by the expression
which corrects for the normal contamination (ϕ) because normal cells in the tumor do not contribute to subclonal heterogeneity. This characterization states that more subclones generally lead to a greater degree of ITH and allows for two samples composed of an equal number of subclones to have different degrees of ITH. In addition, the largest possible entropy given S subclones is bounded above by log(S), corresponding to equal proportions of each subclone (𝜗_{s}=1/S).
Example: allocation, multiplicity, and cellular prevalence
Here, we give a concrete example to explain the notation: allocation, multiplicity, and cellular prevalence. Suppose that a tumor sample is composed of three subclones forming a branching tree: B←A→C. The respective subclone proportions are denoted by η_{A}, η_{B}, and η_{C}. Thus, the sample purity is ϕ≡η_{A}+η_{B}+η_{C}, and possible cellular prevalences are (η_{A}+η_{B}+η_{C})/ϕ=1, η_{B}/ϕ, and η_{C}/ϕ. Q_{4}=(q_{31},q_{33},q_{34}) characterizes three allocations to consider: q_{31} for clonal mutations; and q_{33} and q_{34} for subclonal mutations that only occur in subclones B and C, respectively. Suppose that each SPM has one of three copy number states with allelespecific copy numbers being (0,2), (1,1), or (1,3). For SPMs with copy number state (0,2), clonal mutations have multiplicity of 2 if they occur before SCNA and multiplicity of 1 if they occur after SCNA. For SPMs with state (1,1), all mutations (clonal or subclonal) have multiplicity of 1. For SPMs with state (1,3), clonal mutations have multiplicity of 1 or 3 if they occur before the SCNA and multiplicity of 1 if they occur after the SCNA. All combinations of allocation and multiplicity are listed in Table 1.
Modeling SPM read counts
Recall that A_{l} and T_{l} denote the alternative read depth and total read depth of the lth SPM. For a prespecified tree structure and copy number estimates, we model A_{l} given T_{l} by a mixture of binomial distributions across possible allocations and multiplicities. Next, we provide details to specify such mixture distributions.
We assume that copy number states and tumor purity were estimated by another algorithm, e.g., ASCAT. For the lth SPM, denote its copy number state (i.e., allelespecific copy numbers) by C_{l}=(C_{l1},C_{l2}). Suppose that there are altogether W unique copy number states: c_{1},..., c_{W}. Given the wth copy number state, assume that there are D_{w} possible combinations of allocation and multiplicity, and denote the dth combination by e_{wd}=(q_{d},m_{wd}), where q_{d} denotes the allocation that depends on the tree structure but not copy number states, and m_{wd} denote the multiplicity that depends on copy number states. We also allow the estimation of proportion of variants unexplained by combinations of U_{l} and M_{l} following a discrete uniform distribution with proportion parameter denoted ε. The mixture proportions of the D_{w} combinations is denoted by \(\phantom {\dot {i}\!}\boldsymbol {\pi }_{w} = (\pi _{w1},\ldots,\pi _{wD_{w}})^{\mathrm {T}}\). Let Θ=(ε,𝜗,{π_{w}}).
Let U_{l} and M_{l} be the random variables for the latent allocation and multiplicity for the lth SPM, respectively, and let E_{l}=(U_{l},M_{l}). Write G_{l}=(T_{l},C_{l},ϕ,Θ). For a single SPM,
where
and \(p_{wd} = \frac {m_{wd} \phi \boldsymbol {\vartheta }^{\mathrm {T}} \boldsymbol {q}_{d}}{(C_{l1}+C_{l2}) \phi + 2(1\phi)}\). In the notation above, \(\boldsymbol {\vartheta }^{\mathrm {T}} \boldsymbol {q}_{d} = \sum _{s=1}^{S} \vartheta _{s} q_{ds}\) is the cellular prevalence of a SPM among the tumor’s cancer cells.
Given tumor purity and copy number states, in addition to a particular phylogenetic tree, the likelihood for L SPMs is proportional to
Maximization of this likelihood is accomplished by introducing the pair of latent variables (U_{l},M_{l}), writing the completedata likelihood, and using an expectationmaximization algorithm, where each iteration of the Mstep for π_{w} has closed form updating equations, while 𝜗 is updated with the quasiNewton Raphson method BroydenFletcherGoldfarbShanno on the expected completedata loglikelihood conditional on the observed data. In the presence of local optima for this observed mixture likelihood, multiple random initializations of 𝜗 are used, while we initialize π_{w} by uniform distribution and ε=10^{−3}.
Inferring the optimal configuration is accomplished using the optimal BIC. Suppose that after running SMASH on L SPMs with every enumerated phylogenetic configuration and applying multiple runs of parameter initialization, we arrive at B models. For model b=1,…,B, let L_{b}, m_{b}, BIC_{b}, S_{b}, and E_{b} denote the log likelihood, model size, BIC, number of subclones, and estimated entropy, respectively, evaluated at the maximum likelihood estimate \(\boldsymbol {\widehat {\Theta }}_{b} = (\widehat {{\epsilon }}_{b},\boldsymbol {{\widehat {\vartheta }}}_{b},\boldsymbol {{\widehat {\pi }}}_{b})\). Define BIC_{b}=2L_{b}−m_{b} log(L); models with larger BIC are preferable to models with smaller BIC. We define the posterior probability of model b by
because BIC provides a largesample approximation to the log posterior probability associated with the approximating model [25, 26]. Let p^{∗}= maxb=1,…,B(p_{b}).
It is possible for two or more configurations to have the same BIC. Therefore, we explore two possible definitions of entropy. The first one is a simple average of entropies across all “optimal BICdecided” models, referred to as “optimally inferred” entropy. The second one is a weighted average of entropies across all models, referred to as “weighted” entropy. These two entropy estimates are
and
The summation incorporated into E_{o} accounts for the situation when various configurations or subclone proportions equally fit the observed data.
SMASH is available as an R package integrating Rcpp [27] and RcppArmadillo [28]. The software and source code can be downloaded at https://github.com/Sunlab/SMASH.
Results
Brief overview of SMASH, PyClone, and PhyloWGS
We compared the performance of SMASH versus two popular and representative methods: PyClone [11] and PhyloWGS [9]. PyClone clusters somatic mutations based on their VAFs. From PyClone output (see Additional file 1: Table S1 for an example), one can estimate the number of subclones by the number of mutation clusters. However, to estimate subclone proportions from VAF clusters, we need to know the phylogenetic tree structure (see Additional file 1: Section C.1 for more details). Since PyClone does not estimate a phylogenetic tree, we cannot use PyClone to estimate subclone proportions and thus cannot estimate entropy that is a function of subclone proportions. Unlike PyClone, PhyloWGS was designed to estimate the underlying phylogenetic tree.
SMASH is a frequentist method to infer ITH using a likelihoodbased framework. SMASH and PyClone assume each subclone shares the same SCNA profile and that SCNAs and tumor purity have been estimated from an existing algorithm, e.g., ASCAT [5] or ABSOLUTE [29]. Unlike PyClone and PhyloWGS, SMASH explicitly enumerates all possible phylogenetic trees (up to k subclones, with default value of k=5) and quantifies the likelihood of each tree configuration (refer to the Additional file 1: Section C.2). For each tree configuration, the model parameters are estimated by an EM algorithm that accounts for unobserved somatic mutation allocation across subclones and multiplicity (i.e., copy number of the mutated allele). We can select the optimal phylogenic tree configurations based on the Bayesian information criterion (BIC) and then calculate entropy based on the optimal configuration. Alternatively, to account for the uncertainty of ITH estimation, we can take a weighted summation of ITH entropies, where the weights are the probabilities of different configurations.
Simulation
To directly compare PyClone and SMASH, we constructed an indicator of high ITH, as done in Morris et al. [12], denoted by H, such that H = 1 when the number of subclones is greater than κ, a predefined integer threshold, and H = 0 otherwise. For SMASH, the number of subclones is estimated using the tree configuration with the best BIC. Because PhyloWGS provides estimates of subclone proportions, we can compare the performance of SMASH and PhyloWGS using both entropy and H.
Setup
To simulate ITH variables, first enumerate the list of tree configurations from one to five subclones, sample the number of subclones denoted S. Then, sample among trees with S subclones with equal probability.

1
Generate subclone proportions for S subclones, denoted as η=(η_{1},…,η_{S})^{T}. Simulate U=(U_{1},…,U_{S})^{T}, where U_{s} is simulated from a uniform distribution defined on interval (−3,1). Then calculate \(\eta _{s} = \exp ({U_{s}})/[1 + \sum _{s'=1}^{S} \exp ({U_{s'}})]\). Tumor purity is \(\phi = \sum _{s=1}^{S} \eta _{s}\), and the subclone proportion for the sth subclone is 𝜗_{s}=η_{s}/ϕ.

2
Calculate entropy \(E = \sum _{s=1}^{S} \vartheta _{s} \log (\vartheta _{s})\), as well as H=I(S>κ), where I is an indicator function and κ=3.
These steps are repeated until the minimum underlying subclone proportion is greater than 0.05, and the minimum difference between the cellular prevalences of two subclones is greater than 0.05 to ensure clusters are separable.
To simulate sequence read counts for the lth SPM given a phylogenic tree configuration, we simulated read depth T_{l} from a negative binomial distribution, sampled copy number state, and then sampled SPM multiplicity and allocation with equal probability. Finally, we generated the number of alternative reads from a binomial distribution. We randomly simulated 5 covariates Z=(Z_{1},…,Z_{5})^{T} to resemble sex, age, and tumor stage indicators. see Additional file 1: Section A.1 for details.
We simulated the first set of survival times conditional on linear terms Z and E (entropy) and the second set of survival times conditional on linear terms Z and H, both from the Cox proportional hazards model with a constant baseline hazard:
where λ_{0}(t)=λ_{0}= exp(−7.0), β_{H}=β_{E}=0.5, and \(\boldsymbol {\gamma }_{Z}^{\mathrm {T}} = (0.55, 0.15, 0.8, 1.7, 2.7)\). Censoring times were simulated from the continuous uniform distribution \(\tilde {T} \sim U(0,\tau)\), and the value of τ was tuned to generate the desired proportion of censored subjects.
We considered 18 simulation setups, with three censoring rates (20%, 50%, and 70%), three sequencing depths from the negative binomial (parameter values μ = 100, 500, and 1000 and δ = 2), and two samples sizes (N= 400 and 800).
Benchmarking
For each ITH method, we applied an extra filtering criterion that each subclone includes at least two mutations that are not part of its parental subclone. PyClone output contains the cellular prevalence for all SPMs, and the SPMs assigned to the same cluster have the same cellular prevalence. Following Morris et al. [12], we removed clusters with only one SPM. Additional file 1: Table S1 provides an example of prefiltered PyClone output with multiple clusters composed of one SPM. Output of SMASH includes the ITH estimates for each tree configuration (i.e., number of subclones, subclone proportions, and mutations belonging to each subclone) (refer to Additional file 1: Table S2 for a prefiltered example). We removed configurations where at least one subclone has only one SPM. Similarly, in PhyloWGS, sampled trees with at least one subclone with only one SPM were excluded.
We used the simulated data to compare the results from five methods: PyClone, PhyloWGS using the optimal tree configuration, SMASH using the configuration with best BIC or weighted summation of entropy/number of subclones, and the ideal situation where true values of entropy or number of subclones are given. Each of the methods was run in two model setups, with the ITH variable being entropy E or indicator of high number of subclones H. In other words, when the true model contains E, we compared the models using E or H, as shown in Fig. 3. Results for when the true model contains H as well as the results for the standard errors of the parameter estimates and coverage probabilities under both models are presented in Additional file 1: Section A.3.
Regardless of the ITH variable used, the bias of parameter estimates remains similar for sample sizes of 400 or 800, and as expected, power increases with sample size (Fig. 3). Given the sample size, bias decreases and power increases as sequence depth increases or censoring rate decreases. Comparing the two ITH metrics, E or H, the entropy metric has lower bias and higher power. The difference in performance between these two ITH metrics decreases as sequencing depth increases.
As mentioned, PyClone’s result does not allow us to calculate entropy. Therefore, we compared the performance of PyClone, PhyloWGS, and SMASH using the indicator metric H. At an average sequencing depth of 100 ×, SMASH has similar or slightly better performance than PyClone or PhyloWGS, in terms of bias and power. At average depths of 500 × or 1000 ×, SMASH shows much better performance than both PyClone and PhyloWGS (Fig. 3). SMASH demonstrates better performance than PyClone or PhyloWGS when inferring the number of subclones (Fig. 4 and Additional file 1: Figure S3). We calculated the Spearman correlation between the estimated number of subclones and the true number of subclones across 800 samples for each of 250 replicates. The median Spearman correlations from SMASH are consistently higher than those from PyClone and PhyloWGS, except for the comparison with PhyloWGS at read depth 100, in which case PhyloWGS performs slightly better. As read depth increases, the advantage of SMASH against other methods becomes more apparent, which is consistent with their relative performance in association studies (Fig. 3). Comparing PhyloWGS and PyClone, PhyloWGS performs better in terms of capturing the relative order of subclone number, reflected by the Spearman correlation comparison (Fig. 4), but PyClone performs better in terms of estimating the number of subclones (Additional file 1: Figure S3).
When we simulated data using entropy as the ITH metric, as expected, models fit using entropy had higher power and lower bias (Fig. 3). However, even when we simulated the data using H, the results using entropy were still better when read depth is low. When read depth is high (e.g., 500 × or 1000 ×), using the estimate of H as the ITH variable gives better results, although the difference between using entropy and H is often not large (Additional file 1: Figure S2).
Another important comparison is whether weighted entropy, which incorporates uncertainty across all fitted configurations, has better performance than entropy from optimal configurations. Weighted entropy does provide more accurate estimation of true entropy than the optimal entropy (Fig. 4). However, in terms of association estimation, the two approaches have similar performance (Fig. 3). Optimal entropy tends to underestimate the association, while weighted entropy tends to overestimate the association, although the biases are small. In terms of power, both entropies appear to perform equally well. Both weighted and optimal entropies from SMASH are more accurate estimates of the true entropy than the estimate from PhyloWGS’s optimal tree.
In our simulation studies, the vast majority of computational time was spent on ITH inference. On average with 100 mutations, SMASH ran in less than 5 min for ITH inference. In contrast, PyClone and PhyloWGS had runtimes ranging from just under 10 min to over 90 min. Additional file 1: Figure S4 presents a summary of computational runtime. Among the three methods with default settings, the order of computational time is SMASH < PyClone < PhyloWGS.
Subclonal SCNA simulation
The previous simulation setup assumed SCNAs are clonal. In Additional file 1: Section A.6, we describe the simulation details to allow for subclonal SCNAs. In this analysis, we treated SCNAs as clonal and calculated the copy number by rounding the weighted average of copy numbers across subclones to the nearest integer. As described in Additional file 1: Section A.5, we simulated copy number scenarios 1 and 2 to mimic two patterns of SCNA abundance in real data.
When the true model contains E, we compared the results of 6 methods, dichotomized indicator H estimated from Pyclone, PhyloWGS, and SMASH, and entropy estimated from PhyloWGS, SMASH with optimal configuration or weighted average (Additional file 1: Figure S8). All three methods using entropy E have similar performances and perform much better than the three methods using the dichotomized indicator H. Coverage probability was maintained at 95% for E estimates but not for H estimates. There were no clear differences in performance between both copy number scenarios. When the true model contains H, magnitudes of association bias using E estimates are generally less than those of H estimates (Additional file 1: Figure S9). Therefore, the overall results were consistent with the earlier simulation setup without subclonal SCNAs: using entropy is preferred even if the true model is based on H, and entropy from SMASH and PhyloWGS have similar performance at 100 × read depth.
Application
Preprocessing pipeline
We downloaded SPM calls by MuTect2 from NCI’s Genomic Database Commons (GDC) [21, 30]. To derive SCNA data, we processed controlledaccess SNP Array 6.0 CEL files corresponding to primary tumors, along with their paired bloodderived normal or solid tissue normal. Specifically, we applied a pipeline involving Birdseed, PennCNV [31], and ASCAT v2.4 [5] to obtain estimates of tumor purity, ploidy, and inferred copy number states. The complete data workflow is shown in Additional file 1: Figure S10. We downloaded SPM and SCNA data on 5898 tumor samples from 14 TCGA cancer types (Additional file 1: Table S3).
Before running PyClone, PhyloWGS, and SMASH, we applied a set of filters to the SPM data by retaining the base substitution SPMs that are located along autosomes and have at least seven reads supporting the alternative allele. Also, those SPMS with inferred total copy number of zero were excluded. Then, we passed the formatted SPM and SCNA data to PyClone, PhyloWGS, and SMASH for ITH inference. After running all three ITH methods, we applied the “at least two mutations per subclone/cluster” criterion that was used in the simulation.
Somatic mutation landscape varies across cancer types
We first summarized tumor purity, ploidy, and somatic mutation rate for each tumor type (Fig. 5). The relative ordering of tumor types by mutation rate is consistent with the results reported in an earlier study [32]. Those cancer types with lower mutation rate (e.g., PRAD, LGG, BRCA, KIRC, GBM, and OV) tend to have more subclonal mutations (top panel of Fig. 5). In all cancer types except OV, more than 50% of somatic mutations are clonal (with cellular prevalence larger than 99%) (Additional file 1: Figure S19). Ovarian cancer appeared to be an outlier with the larger number of subclones. This may be partly due to batch effects. The ovarian cancer samples used whole genome amplification (WGA) before DNA sequencing that may have reduced the quality of DNA samples [33, 34]. On the other hand, some previous work did show a high level of ITH in ovarian cancers [35–37]. Blagden [36] mentioned that the phylogenetic tree of ovarian cancer “has a short trunk and many branches, representing early clonal expansion and high genomic instability.” This was consistent with our finding that ovarian cancer has higher levels of ITH. The ploidy values of most cancer types tended to cluster around 2 and 4 (genomewide duplication). This clustering pattern was less clear for BRCA, suggesting a greater degree of SCNA in BRCA.
We examined the cellular prevalence of 49 genes that are among the top 10 mutated genes for at least one of the 14 cancer types (Additional file 1: Figure S21). Similar to our approach to calculate weighted entropy (refer to the “Methods” section), each mutation’s cellular prevalence was calculated as the weighted average across the sample’s ITH configurations. A gene’s cellular prevalence was calculated as the average cellular prevalence of all mutations on that gene across all samples. TP53 mutations have average cellular prevalences near 1.0 for all cancer types except KIRC, which was the same observation made by Morris et al. [12]. IDH1 mutations were subclonal in GBM and clonal in LGG and SKCM. VHL was uniquely called in KIRC, with a cellular prevalence of 1.0. Except for TP53, the remaining 48 genes have relatively low cellular frequency in OV. This was consistent with the results of an earlier study of 31 ovarian tumor samples from six patients, and they found TP53 was the only gene mutated in all samples, and other known tumor driver genes may be mutated in some but not all samples of a patient [38]. Hierarchical clustering was performed on the 49 genes and 14 cancer types. At least two clusters of cancer types and at least two clusters of genes were apparent. LGG, KIRC, and PRAD form one cluster of cancer types without many mutations on these 49 genes.
The number of subclones by tumor type and ITH method are summarized in Additional file 1: Figure S14. Across all cancers, SMASH consistently identified more subclones than PyClone. Between SMASH and PhyloWGS, the resulting number of subclones was very similar for all tumor types except for OV. PyClone was run on two independent Markov chains on each tumor sample using its default setup with 20,000 MCMC samples drawn, 1000 burnin and retaining every tenth sample with all default prior hyperparameters. PhyloWGS also was run twice but with default arguments. There were slight inconsistencies from the results of the two runs (Additional file 1: Tables S4 and S5). In the next section on association analysis, we used the first run of results from PyClone and PhyloWGS.
Baseline covariates and variable selection
The common set of baseline covariates included age at diagnosis, gender, pathological tumor stage, tumor mutation burden (total number of point mutations, TMB), and genomewide SCNA burden. Specifically, we define genomewide SCNA burden as
where k indexes genome segments, L_{k} is the length of the kth segment, and \((C^{A}_{k},C^{B}_{k})\) are the segmental clonal copy numbers of the minor and major alleles, respectively. The SCNA burden can be interpreted as the distance between the normal and cancer genomes, in terms of copy number. Both TMB and SCNA burden were binned into three equal groups using the 33rd and 66th quantiles as cutoffs.
We investigated possible nonlinear forms of entropy (e.g., dichotomized entropy, polynomial transformation, or log transformation) and the validity of the proportional hazard assumption using R functions fcov() and prop() from R package goftte [39, 40]. Our analysis suggested that the simple linear form of entropy is appropriate. Since our simulation studies showed that the weighted entropy provides better estimates of the true entropy than the optimal entropy (Fig. 4), we chose to conduct the following analysis using weighted entropy.
In addition to baseline covariates, additional covariates to include in each tumor type’s full model were carefully selected. The top four frequently mutated genes were included. Other tumor typespecific covariates were histological subtype for BLCA (papillary vs. nonpapillary), PAM50 subtype for BRCA [41] (Basal, Her2, LumA, or LumB, and the normallike subtype was removed due to its small sample size), tumor grade for KIRC, IDH/CNA status for LGG (IDH wildtype, IDH mutant without chr1p and 19q codeletion, IDH mutant with chr1p and 19q codeletion), and Gleason score and PSA level for PRAD.
We also considered the pairwise interactions of all baseline covariates with weighted entropy. The final model for each tumor type was selected based on stepwise model fitting and assessed with Akaike information criterion (AIC). When the final model contained pairwise interactions involving entropy, then the interactions were retained if their minimum p value was less than or equal to 0.02. Otherwise, the interaction was removed, and our variable selection was rerun without the interaction term. When the final model excluded entropy, it was added back in the final step.
TMB and ITH are associated with survival time in multiple cancer types
In the PRAD cohort, because very few deaths were observed, we only analyzed progressionfree survival (PFS). For all other cancer types, we studied both overall survival (OS) and PFS. We used a p value cutoff of 0.05 to define statistical significance.
For OS, entropy or its interaction with other variables were statistically significant in the final model for 6 of 14 cancer types: BRCA, COAD, HNSC, KIRC, LIHC, LUSC (Fig. 6). Total mutation burden (TMB) was statistically significant for 7 cancer types: BLCA, COAD, GBM, LGG, LUAD, OV, and STAD (Additional file 1: Figure S17). SCNA burden (SCNAB) was statistically significant for LGG and SKCM (Additional file 1: Table S6–S19). Significant associations between genelevel mutation status and OS include TP53 for BLCA, GBM, HNSC, LIHC, LUSC and STAD, TTN for GBM and LUSC, and MUC16 for SKCM (Additional file 1: Table S6–S19).
In addition to these somatic mutationbased predictors, age at diagnosis was statistically significant for all tumor types except LIHC and LUAD. Sex was statistically significant for GBM, HNSC, and LIHC. All GBM tumors are stage IV. Among all other cancer types, tumor stage was associated with overall survival except for LGG and OV. Other tumor typespecific covariates associated with OS include PAM50 for BRCA, tumor grade for KIRC, and IDH/CNV status for LGG (Additional file 1: Table S6–S19).
The model fits for PFS were similar to the ones for OS for most cancer types. For GBM, KIRC, LUSC, OV, SKCM, and STAD, the final model for PFS was the same as the final model for OS survival. Covariates present in one model but not in the other model were highlighted in Additional file 1: Table S6–S19.
We also reported the results when replacing SMASH’s weighted entropy (E(S)) with PhyloWGS’s entropy (E(W)), the dichotomized number of subclones from SMASH (H(S)), PyClone (H(P)), and PhyloWGS (H(W)) (Fig. 6 and Additional file 1: Table S6–S19). H(S), H(P), and H(W) were constructed as indicators of 3 or more subclones. This cutoff was chosen so that there were enough samples with noncensored survival time in the high ITH group. Overall, the associations we detected by H(S), H(P), or H(W) were consistent with the results by E(S) and E(W), and the p values by E(S) tended to be smaller. An exception was in STAD, where H(S) identified significant associations for both OS and PFS that were missed by H(P), H(W), E(W), and E(S).
Our results bring new insights that have not been reported by previous studies [1, 12]. Andor et al. [1] studied 1165 samples of 12 cancer types. They found significant association between ITH (the number of subclones) and survival time in only one cancer type: gliomas (combining two types of cancer from LGG and GBM). Morris et al. [12] studied 3300 tumor samples in 9 cancer types. They used dichotomized number of subclones as ITH measurement (# of subclone >4 for most cancer types), which is very unstable because few samples had more than 4 subclones. They found significant associations between ITH and survival time in 5 out of 9 cancer types: BRCA, HNSC, KIRC, LGG, and PRAD. They also added mutation burden into the Cox model for these five cancer types and found mutation burden was not significant in all five cancer types. We have 5898 TCGA tumor samples from 14 cancer types. We considered both dichotomized number of subclones and entropy as measurements of ITH. While Morris et al. did not find mutation burden to be informative for prognosis, we found it is significantly associated with survival time (or marginally significant) in 7 of the 14 cancer types. What is truly new in our findings is that we consider both ITH measurement and its interaction with other covariates, such as mutation burden, tumor stage, and mutation status of a particular gene. We found a considerable amount of heterogeneity for the results across cancer types.
Discussion
Quantification of ITH
We considered two ITH metrics: entropy and indicator for high number of subclones. When we simulated survival time given entropy, as expected, using entropy instead of the indicator as the ITH metric led to better performance in association analysis (Fig. 3). Interestingly, when we simulated survival time given the indicator, the model with entropy has either higher power (when read depth is 100) or comparable power (when read depth is 500 or 1000) (Additional file 1: Figure S2). In real data analysis, using entropy as the ITH metric also led to more discoveries. Therefore, we recommend using entropy as an ITH metric in association studies. One reason for entropy delivering better results is that, as a continuous variable, entropy is more robust to noise in ITH inference. Specifically, the addition or deletion of a subclone with small cellular proportion may change entropy slightly but may change the indicator variable from 0 to 1. In addition, some information about the degree of ITH is lost when dichotomizing the number of subclones. Of course, an intermediate choice is to use the number of subclones. As shown in Additional file 1: Figure S13, entropy was highly associated with the number of subclones and provides a more refined quantification for samples with the same number of subclones.
Another question that we sought to answer was whether it was beneficial to incorporate the uncertainty of ITH inference in association analysis. Towards this end, we studied two versions of entropy from SMASH, the optimal entropy derived from the mean entropy of the tree configurations with optimal BIC versus the weighted entropy across all estimated tree configurations. The weighted entropy has slightly higher correlation with the true entropy than with the optimal entropy, although these two quantities have similar power to detect associations.
Study design for future ITH studies
Our simulation results suggested that when using entropy as the ITH metric, more power was gained by increasing the sample size from 400 to 800 than by increasing the read depth from 100 to 500 or even 1000 (Fig. 3, Additional file 1: Figure S2). In contrast, when using the indicator H as an ITH metric, increasing read depth can also bring some relatively large power gains (Fig. 3, Additional file 1: Figure S2). One issue that warrants future study is the benefit of having multiple tumor samples per patient.
ITH measurement may be affected by somatic mutation calling accuracy. A previous study [42] showed that the sensitivity of somatic mutation calling is around 0.8–0.9, and the number of false positive mutation calls is around 30 mutations for the whole exome using mutation callers such as Strelka or Mutect. We can further reduce the number of false positives by taking the intersection of mutation calls from multiple callers, with the trade off to reduce sensitivity of mutation calls. Our method is robust to low sensitivity of mutation calls because we use cellular frequency of subclones to estimate entropy, and if, for example, 6 of 10 mutations of a subclone are called, we can still use these 6 mutations to estimate subclone cellular frequency. Therefore, if one suspects a high proportion of false positive mutation calls, one strategy is to restrict the analysis to the mutations called by more than one caller.
Association between survival time and ITH or TMB
In most cancer types, when TMB is included in the final model, it is negatively associated with hazard, and thus higher mutation burden leads to longer survival time (Additional file 1: Figure S18). This may be explained by the observation that tumors with higher TMB are more likely recognized and attacked by the immune system [43]. However, higher TMB is associated with worse survival time in LGG.
TMB is positively associated with entropy measurement of ITH, although the correlation is not strong enough to create any concerns with colinearity when using both variables in a model (Additional file 1: Figure S16). We also observed interactions between TMB and ITH for both OS and PFS in COAD and LUSC. In both cases, association between survival time and entropy is not significant when TMB is low. However, higher entropy is associated with worse survival time when TMB is high. In LUSC, we also observed interaction between entropy and TP53 mutation. When TP53 is mutated, higher entropy is associated with longer survival time for both OS and PFS (Additional file 1: Figure S20). These results suggest that the effect of ITH on survival time may depend on other factors.
Limitations
Our analyses have some limitations. One limitation is the assumption of clonal SCNA. Employing this assumption allows us to use copy number calls from mature and widely used methods such as ASCAT or ABSOLUTE and to maintain high computational efficiency. However, this assumption also risks classifying SPMs in subclonal SCNA regions as SPMs from a new subclone. This risk may not bias the entropy estimate because a new subclone with subclonal SCNA is captured by SPMs. As shown in two simulation settings with subclonal copy number, SMASH has similar performance as PhyloWGS’s when there are high levels of subclonal SCNAs.
Another limitation, shared by all methods for inferring ITH from SPMs, is that we cannot distinguish two subclones whose somatic mutations have very similar cellular prevalence. For example, in Fig. 1, the mutations from subclones A and B have very similar cellular prevalence and hence cannot be distinguished. However, this is a limitation of the input data rather than the methodology. This limitation can be overcome if multiple samples per patient are available.
The infinite site assumption may be considered too strong an assumption. One study demonstrated possible evidence of recurrent mutations in their singlecell sequencing data [44]. Conceptually, if mutations were recurrent, somatic mutations from bulk sequencing could not be utilized for modeling multiplicity and somatic inheritance among subclones. Therefore, ITH inference and association analyses could only be conducted with singlecell sequencing to better infer cellular multiplicities. Though, if only a handful of mutations were recurrent and at a small fraction of cells, their inferred cellular prevalence may slightly decrease relative to an identical nonrecurrent mutation, leading to a biased subclone proportion estimate. This entropy estimate could be treated as being an extra “noisy” estimate. But as long as this biased estimate correlates with the underlying entropy, there may still be power to detect the association between entropy and clinical outcomes.
Conclusions
We have conducted a pancancer analysis to study the associations between somatic mutations and survival time in 14 cancer types. Several types of somatic mutation features are included in our analysis, including mutation burden, copy number alteration burden, mutation status of a few frequently mutated genes, and intratumor heterogeneity (ITH) inferred by our method SMASH. We conclude that using entropy instead of high ITH indicator as the ITH metric leads to higher power in association analysis. The effect of ITH may depend on other somatic mutation features such as mutation burden. Accounting for the uncertainty of ITH inference has some but limited benefit. To improve the power for association analysis, it is much more effective to increase the sample size than generating more reads per sample.
Abbreviations
 ASCAT:

Allelespecific copy number analysis of tumors
 ITH:

Intratumor heterogeneity
 MATH:

Mutantallele tumor heterogeneity
 SCNA:

Somatic copy number alterations
 SMASH:

Subclone multiplicity allocation and somatic heterogeneity
 SPMs:

Somatic point mutations
 TCGA:

The Cancer Genome Atlas
 TMB:

Tumor mutation burden
 VAF:

Variant allele frequencies
References
 1
Andor N, Graham TA, Jansen M, Xia LC, Aktipis CA, Petritsch C, Ji HP, Maley CC. Pancancer analysis of the extent and consequences of intratumor heterogeneity. Nat Med. 2016; 22(1):105–13.
 2
McGranahan N, Furness AJ, Rosenthal R, Ramskov S, Lyngaa R, Saini SK, JamalHanjani M, Wilson GA, Birkbak NJ, Hiley CT, et a.l. Clonal neoantigens elicit t cell immunoreactivity and sensitivity to immune checkpoint blockade. Science. 2016; 351(6280):1463–9.
 3
Goodman AM, Kato S, Bazhenova L, Patel SP, Frampton GM, Miller V, Stephens PJ, Daniels GA, Kurzrock R. Tumor mutational burden as an independent predictor of response to immunotherapy in diverse cancers. Mol Cancer Ther. 2017; 16(11):2598–2608.
 4
Campbell BB, Light N, Fabrizio D, Zatzman M, Fuligni F, de Borja R, Davidson S, Edwards M, Elvin JA, Hodel KP, et al.Comprehensive analysis of hypermutation in human cancer. Cell. 2017; 171(5):1042–56.
 5
Van Loo P, Nordgard SH, Lingjærde OC, Russnes HG, Rye IH, Sun W, Weigman VJ, Marynen P, Zetterberg A, Naume B, et al.Allelespecific copy number analysis of tumors. Proc Natl Acad Sci. 2010; 107(39):16910–5.
 6
ElKebir M, Satas G, Oesper L, Raphael BJ. Inferring the mutational history of a tumor using multistate perfect phylogeny mixtures. Cell Sys. 2016; 3(1):43–53.
 7
Jiao W, Vembu S, Deshwar AG, Stein L, Morris Q. Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinformatics. 2014; 15(1):35.
 8
Zare H, Wang J, Hu A, Weber K, Smith J, Nickerson D, Song C, Witten D, Blau CA, Noble WS. Inferring clonal composition from multiple sections of a breast cancer. PLoS Comput Biol. 2014; 10(7):1003703.
 9
Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q. Phylowgs: reconstructing subclonal composition and evolution from wholegenome sequencing of tumors. Genome Biol. 2015; 16(1):35.
 10
Jiang Y, Qiu Y, Minn AJ, Zhang NR. Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by nextgeneration sequencing. Proc Natl Acad Sci. 2016; 113(37):5528–37.
 11
Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, Ha G, Aparicio S, BouchardCôté A, Shah SP. Pyclone: statistical inference of clonal population structure in cancer. Nat Methods. 2014; 11(4):396–8.
 12
Morris LG, Riaz N, Desrichard A, Şenbabaoğlu Y, Hakimi AA, Makarov V, ReisFilho JS, Chan TA. Pancancer analysis of intratumor heterogeneity as a prognostic determinant of survival. Oncotarget. 2016; 7(9):10051.
 13
Mroz EA, Rocco JW. Math, a novel measure of intratumor genetic heterogeneity, is high in pooroutcome classes of head and neck squamous cell carcinoma. Oral Oncol. 2013; 49(3):211–5.
 14
Miller CA, White BS, Dees ND, Griffith M, Welch JS, Griffith OL, Vij R, Tomasson MH, Graubert TA, Walter MJ, et al.Sciclone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLoS Comput Biol. 2014; 10(8):1003665.
 15
Popic V, Salari R, Hajirasouliha I, KashefHaghighi D, West RB, Batzoglou S. Fast and scalable inference of multisample cancer lineages. Genome Biol. 2015; 16(1):91.
 16
Hajirasouliha I, Mahmoody A, Raphael BJ. A combinatorial approach for analyzing intratumor heterogeneity from highthroughput sequencing data. Bioinformatics. 2014; 30(12):78–86.
 17
ElKebir M, Oesper L, AchesonField H, Raphael BJ. Reconstruction of clonal trees and tumor composition from multisample sequencing data. Bioinformatics. 2015; 31(12):62–70.
 18
Andor N, Harness JV, Mueller S, Mewes HW, Petritsch C. Expands: expanding ploidy and allele frequency on nested subpopulations. Bioinformatics. 2013; 30(1):50–60.
 19
Yuan K, Sakoparnig T, Markowetz F, Beerenwinkel N. Bitphylogeny: a probabilistic framework for reconstructing intratumor phylogenies. Genome Biol. 2015; 16(1):36.
 20
Park SY, Gönen M, Kim HJ, Michor F, Polyak K. Cellular and genetic diversity in the progression of in situ human breast carcinomas to an invasive phenotype. J Clin investig. 2010; 120(2):636–44.
 21
GDC Team. TCGA pancancer data. NCI Genomic Data Commons (GDC) Data Portal. https://portal.gdc.cancer.gov/. Accessed May 2018.
 22
Nowell PC. The clonal evolution of tumor cell populations. Science. 1976; 194(4260):23–8.
 23
Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969; 61(4):893.
 24
Hudson RR. Properties of a neutral allele model with intragenic recombination. Theor Popul Biol. 1983; 23(2):183–201.
 25
Schwarz G, et al.Estimating the dimension of a model. Ann Stat. 1978; 6(2):461–4.
 26
Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Stat Sci. 1999; 14(4):382–417.
 27
Eddelbuettel D, François R, Allaire J, Ushey K, Kou Q, Russel N, Chambers J, Bates D. Rcpp: Seamless R and C++ integration. J Stat Softw. 2011; 40(8):1–18.
 28
Eddelbuettel D, Sanderson C. RcppArmadillo: Accelerating R with highperformance C++ linear algebra. Comput Stat Data Anal. 2014; 71:1054–63.
 29
Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, et al.Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012; 30(5):413–21.
 30
Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, Staudt LM. Toward a shared vision for cancer genomic data. N Engl J Med. 2016; 375(12):1109–12.
 31
Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M. Penncnv: an integrated hidden Markov model designed for highresolution copy number variation detection in wholegenome snp genotyping data. Genome Res. 2007; 17(11):1665–74.
 32
Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al.Mutational heterogeneity in cancer and the search for new cancerassociated genes. Nature. 2013; 499(7457):214–8.
 33
Buckley AR, Standish KA, Bhutani K, Ideker T, Lasken RS, Carter H, Harismendy O, Schork NJ. Pancancer analysis reveals technical artifacts in TCGA germline variant calls. BMC Genomics. 2017; 18(1):458.
 34
Ellrott K, Bailey MH, Saksena G, Covington KR, Kandoth C, Stewart C, Hess J, Ma S, Chiotti KE, McLellan M, et al.Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 2018; 6(3):271–81.
 35
McGranahan N, Swanton C. Clonal heterogeneity and tumor evolution: past, present, and the future. Cell. 2017; 168(4):613–28.
 36
Blagden SP. Harnessing pandemonium: the clinical implications of tumor heterogeneity in ovarian cancer. Front Oncol. 2015; 5:149.
 37
Schwarz RF, Ng CK, Cooke SL, Newman S, Temple J, Piskorz AM, Gale D, Sayal K, Murtaza M, Baldwin PJ, et al.Spatial and temporal heterogeneity in highgrade serous ovarian cancer: a phylogenetic analysis. PLoS Med. 2015; 12(2):1001789.
 38
Bashashati A, Ha G, Tone A, Ding J, Prentice LM, Roth A, Rosner J, Shumansky K, Kalloger S, Senz J, et al.Distinct evolutionary trajectories of primary highgrade serous ovarian cancers revealed through spatial mutational profiling. J Pathol. 2013; 231(1):21–34.
 39
Sfumato P, Boher JM. Goftte: GoodnessofFit for TimetoEvent Data. 2017. https://CRAN.Rproject.org/package=goftte R package version 1.0.5. Accessed December 2017.
 40
Lin DY, Wei LJ, Ying Z. Checking the cox model with cumulative sums of martingalebased residuals. Biometrika. 1993; 80(3):557–72.
 41
Ciriello G, Gatza ML, Beck AH, Wilkerson MD, Rhie SK, Pastore A, Zhang H, McLellan M, Yau C, Kandoth C, et al.Comprehensive molecular portraits of invasive lobular breast cancer. Cell. 2015; 163(2):506–19.
 42
Xu H, DiCarlo J, Satya RV, Peng Q, Wang Y. Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics. 2014; 15(1):244.
 43
Schumacher TN, Schreiber RD. Neoantigens in cancer immunotherapy. Science. 2015; 348(6230):69–74.
 44
Kuipers J, Jahn K, Raphael BJ, Beerenwinkel N. Singlecell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors. Genome Res. 2017; 27(11):1885–94.
Acknowledgements
We appreciate the constructive comments and suggestions from three anonymous reviewers.
Funding
This work is supported in part by NIH grants P01 CA142538, R01 GM105785, R21CA224026, R01 GM126550, R01 GM07335, and R01HG009974.
Availability of data and materials
The datasets analyzed during the current study are available in the NCI GDC repository, https://portal.gdc.cancer.gov/ [21].
Author information
Affiliations
Contributions
WS and DYL conceived the study. PLL performed the data analysis. WS, DYL and PLL wrote the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Correspondence to DanYu Lin or Wei Sun.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Supplementary results and methods, including Additional file 1: TableS1S19 and Additional file 1: FigS1S21. (PDF 3184 KB)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Copy number alteration
 Intratumor heterogeneity
 Somatic mutations
 Subclone
 Tumor mutation burden