Sensitivity to sequencing depth in single-cell cancer genomics

Background Querying cancer genomes at single-cell resolution is expected to provide a powerful framework to understand in detail the dynamics of cancer evolution. However, given the high costs currently associated with single-cell sequencing, together with the inevitable technical noise arising from single-cell genome amplification, cost-effective strategies that maximize the quality of single-cell data are critically needed. Taking advantage of previously published single-cell whole-genome and whole-exome cancer datasets, we studied the impact of sequencing depth and sampling effort towards single-cell variant detection. Methods Five single-cell whole-genome and whole-exome cancer datasets were independently downscaled to 25, 10, 5, and 1× sequencing depth. For each depth level, ten technical replicates were generated, resulting in a total of 6280 single-cell BAM files. The sensitivity of variant detection, including structural and driver mutations, genotyping, clonal inference, and phylogenetic reconstruction to sequencing depth was evaluated using recent tools specifically designed for single-cell data. Results Altogether, our results suggest that for relatively large sample sizes (25 or more cells) sequencing single tumor cells at depths > 5× does not drastically improve somatic variant discovery, characterization of clonal genotypes, or estimation of single-cell phylogenies. Conclusions We suggest that sequencing multiple individual tumor cells at a modest depth represents an effective alternative to explore the mutational landscape and clonal evolutionary patterns of cancer genomes. Electronic supplementary material The online version of this article (10.1186/s13073-018-0537-2) contains supplementary material, which is available to authorized users.


Introductory note
In the following supporting document, we will go through all steps required to generate the results presented in the main manuscript. For reproducibility purposes, this guideline contains all information needed (e.g., software tools, command line arguments, etc.) to generate the obtained results. Please note that all code chunks that begin with '$' are to be run in the shell while the ones without it should be run in R. Importantly, while we restrict this step-by-step tutorial to the data from the Ni et al. (2013) study, as this corresponds to the smallest dataset available, the very same pipeline was applied to the data from the remaining studies.

Data collection and overview
The following publicly available NGS datasets from four distinct single-cell sequencing (SC-Seq) studies were retrieved from the Sequence Read Archive (SRA):

Downloading through the SRA
Load the R packages needed and use SRAdb to retrieve the deposited data. Please note that we will be using $HOME as our default storage and working directory throughout this entire guideline.  Picard can be used to downsample the original data to different depths. For statistical validation, 10 replicates per single-cell (e.g., 10 P01C01E cells downsampled to 25x, 10x, 5x and 1x) will be generated using the following arguments: To visualize the effects of downsampling on sequencing coverage, bedtools can be used to estimate the breadth and depth of coverage of the downsampled SC-Seq data ( Figure 1 of Main Text).

Sensitivity of single-cell callsets towards (bulk-level) variants detection
VCFtools can be used to estimate the amount of gold-standard SNVs present in single-cell calls: # Example for the 1x set:

Overlap with COSMIC database
In order to quantify the proportion of COSMIC variants preserved across sequencing depths, COSMIC coding mutations need to be retrieved from the COSMIC FTP website in VCF format.

Copy-number variant conservation across sequencing depths in SC-Seq data
Aside from SNVs, copy-number variants (i.e., CNVs) are also a prevalent source of genetic variation in cancer cells. On this basis, the GINGKO CNV-calling algorithm can be used to explore the extent to which sequencing depth affects copy-number detection from SC-Seq data. Here, a stand-alone version of GINKGO was applied to all BAMs following the software recommendations: While GINKGO outputs multiple results, we will focus on the SegBreaks file. For each genomic segment, GINKGO reports the presence/absence profiles of CN events for each single-cell. Using this information, we can compare the results for the reference callset against the downsampling experiments to estimate the proportion of CNVs preserved for increasing degrees of downsampling ( Figure 5.A of Main Text).

Clonal genotypes and cluster assignment
Througout this section, the somatic SNVs that have been identified above (section 3) will be used to infer clonal populations from SC-Seq data. For that, the previously published Single-Cell Genotyper tool (SCG ) will be applied to infer clonal populations (i.e., clusters). Since we're analysing "real" cancer data (and for real data we rarely -if ever -know the truth) it should be highlighted that we are not measuring accuracy towards clonal structure prediction. Rather, we are interested in exploring whether the inferences differ with respect to sequencing depth (i.e., consistency). # Here s an example for the 1x set: # You ll need to convert the VCF into a binary/ternary matrix ( The "adjusted Rand-Index", which measures the similarity between two data clusterings, can then be used to compare the clustering consistency across datasets. This can be done in R and illustrated using simple barplots. As highlighted in the main text, a single clonal population was inferred for the Ni et al. dataset at all sequencing depths. Consequently, the adjusted Rand-Index estimated was 1 for all comparisons (Figure 6 of Main Text -only displays the variable datasets).

Supplementary Tables
(A) Barplots illustrating the proportion of germline and somatic SNVs detected in bulk and SC-Seq datasets (Bulk +SC) versus variants exclusively found in single-cell datasets (sSC). (B) Barplots displaying the variant quality scores for different variant "types": gBulk+SC refers to germline variants observed in both bulk and single-cell datasets; sBulk+SC refers to somatic variants observed in both bulk and single-cell datasets; sSC corresponds to variants called solely in single-cell datasets. Barplots showing the proportion of clonal variants identified in the original datasets that were recalled as subclonal in the down-sampled datasets. Numbers above bars indicate the absolute number of clonal variants ascertained in original dataset that were also present in the down-sampled datasets. Error bars indicate 95% confidence intervals. Barplots displaying the Homoplasy Index (HI) across the different sequencing depths. Error bars indicate 95% confidence intervals.