Reducing Indel Errors in Whole-genome and Exome Sequencing

Background


Background
With the increasing use of next-generation sequencing (NGS), there is growing interest from physicians, patients and consumers to better understand the underlying genetic contributions to various conditions. For rare diseases and cancer studies, there has been increasing success with exome/genome sequencing in identifying mutations that have a large effect size for particular phenotypes [1,2]. Some groups have been trying to implement genomic approaches to interpret disease status and inform preventive medicine [3][4][5][6]. However, we are still facing practical challenges for both analytic validity and clinical utility of genomic medicine [7][8][9]. In addition, the genetic architecture behind most human disease remains unresolved [10][11][12][13][14][15]. Some have argued that we should bring higher standards to human-genetics research in order to return results and/or reduce false-positive reports of "causality" without rigorous standards [16,17]. Others have reported that analytic validity for WES and WGS is still a major issue, because the accuracy and reliability of sequencing and bioinformatics analysis is too low for a clinical setting [8,9,[18][19][20].
There is also debate whether we should use whole genome sequencing (WGS) or whole exome sequencing (WES) for personal genomes. Some have argued that a first-tier costeffective WES is a powerful way to dissect the genetic basis of diseases and to facilitate the accurate diagnosis of individuals with Mendelian disorders [21,22]. Others argue that WGS could reveal structural variants (SVs), maintain a more uniform coverage, and is free of exome capture efficiency issues [23,24]. One group directly compared WGS with WES, but investigation of INDEL errors was not the focus of this comparison [19,20].
Substantial genetic variation involving INDELs in the human genome has been previously reported but accurate INDEL calling is still difficult [25][26][27]. There has been a dramatic decrease of sequencing cost over the past few years, and this cost might decrease further with the release of the Illumina HiSeq X Ten sequencers which have capacity for nearly twenty thousand whole human genomes per instrument per year.
However, it is still unclear whether we can achieve a high-accuracy personal genome certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint with a mean coverage of 30X from the Illumina HiSeq X Ten sequencers. In addition, there has been controversy on the use of PCR amplification in the library preparations for NGS, although very few have characterized the PCR errors that might be complicating the detection of insertions and deletions (INDELs).
Previously our group reported low concordance rates of multiple variant calling pipelines, and of all INDELs detected by GATK Unified Genotyper (v1.5), SOAPindel (v1.0) and SAMtools (v0. 1.18), only 26.8% were in the intersection [8]. Some groups also reported low concordance rates for INDELs between different sequencing platforms, further showing the difficulties of accurate INDEL calling [17]. Other efforts have been made to understand the sources of variant calling errors [28]. Common INDEL issues, such as realignment errors, errors near perfect repeat regions, and an incomplete reference genome have caused problems for alignment-based callers [29]. De novo assembly using de Brujin graphs has been reported to tackle some of these limitations [30]. Fortunately, with the optimizations of micro-assembly, these errors have been reduced with a novel algorithm, Scalpel, which outperformed any other INDEL callers available. Based on validation data, positive prediction rate (PPV) of algorithm specific INDELs was high for Scalpel (~80%), but much lower for GATK HaplotypeCaller (v3.0) (~45%) and SOAPindel (v2.01) (~50%) [31].
Thus, we set out to investigate the complexities of INDEL detection on Illumina reads using this most accurate INDEL-calling algorithm. Firstly, we used simulation data to understand the limits of how coverage affects INDEL calling with perfect Illumina-like reads. Secondly, we analyzed a dataset including high coverage WGS and WES data from two quad families (mother, father and two children), in addition to extensive highdepth validation data on an in-house sample, K8101-49685s. In order to further understand the effects of PCR amplification on INDEL calling, we also downloaded and analyzed two WGS datasets prepared with and without PCR from the well-known HapMap sample NA12878. We characterized the data in terms of read depth, coverage uniformity, base-pair composition pattern, GC contents and other sequencing features, in certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint order to partition and quantify the INDEL errors. We were able to simultaneously identify both the false-positives and false-negatives of INDEL calling, which should be useful for population-scale experiments. We notice that homopolymer A/T INDELs are a major source of low quality INDELs and multiple signatures. As more and more groups start to use these new micro-assembly based algorithms, practical considerations for experimental design should be introduced to the community. Lastly, we explicitly address the question concerning the necessary depth of coverage for accurate INDEL calling for WGS. This provides important insights and guidelines to achieve a highly accurate INDEL call set and to improve the sequencing quality of personal genomes. certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. to 28X, and remained basically the same again from 33X to 93X (Supplemental Figure   S1B). This indicates that for large INDELs, insufficient coverage results in more assembly errors, which results in a higher error rate for micro-assembly variant calling.
Based on the simulation data, one would need at least a mean coverage of 30X to maintain a reasonable FDR for micro-assembly based INDEL callers. However, since these results were based on simulation data, which does not include the effects of any sequencing artifacts on INDEL calling, these values establish the upper bound of accuracy and performance compared to genuine sequence data. certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

WGS vs. WES: Low concordance on INDEL calling
We analyzed a dataset including high coverage WGS and WES data from eight samples.
Position-match means two INDELs have the same genomic coordinate, while exactmatch additionally requires that two INDELs also have the same base-pair change(s) (Methods). When we excluded regions with less than one read in either dataset, the mean concordance rates based on exact match and position-match increased to 62%±1.1% and 66%±1.0%, respectively. If we excluded regions with base coverage in either dataset with less than 20, 40, 60, or 80 reads, the mean concordance rate based on exact-match and position-match both continued to increase until reaching a base coverage of 80 reads (Table 1). This showed that some INDELs were missing in either dataset because of low sequencing efficiency in those regions. Although WES data had higher mean coverage than WGS data, we were surprised to see that in regions requiring at least 80 reads, there was only 4%±1.5% of INDEL specific to WES but 21%±3.2% specific to WGS. Regions with excessive coverage might indicate problems of sequencing or library preparation and this highlights the importance of coverage uniformity in WGS. Based on exactmatch, the proportion of the WGS-specific INDELs was 34%±1.4%, which was ~2.5-fold higher than the proportion of WES-specific INDELs (14%±1.2%). This ratio was even higher based on position-match (~3-fold). Reasons for this could be either high sensitivity of INDEL detection with WGS data or high specificity of INDEL detection with WES data. We will further illustrate this in the downstream analysis.

Coverage distributions of different regions in WGS and WES data
We define the proportion of a region covered with at least X reads to be the coverage fraction at X reads for this region.  In the exonic targeted regions, the mean coverages across eight samples were 71X±3.3X and 337X±18.2X for WGS and WES data, respectively ( Figure 2). The coverage fractions at 20X were 98%±0.2% and 75%±0.1% in WGS and WES data, respectively.
We noticed that there was a capture efficiency issue with WES in some regions, as the coverage fraction at 1X was 99.9%±0.08% in WGS data but only 84%±1.1% in WES data, indicating that ~16% of the exonic targeted regions were not captured for sequencing. In the high confidence INDEL regions, the mean coverage across eight samples were 58X±3.4X and 252X±7.0X for WGS and WES data, respectively ( Figure   3). The coverage fractions at 20X were 96%±1.1% and 97%±0.3% in WGS and WES data, respectively. We noticed that there was a significant increase of coverage uniformity for WES in the high confidence INDEL regions, relative to the exonic targeted regions. This increase was even more significant when we looked at the coverage fractions at 50X, which were 58%±6.0% and 86%±0.7% in WGS and WES data, respectively. This suggested that WGS was able to reveal high confidence INDELs at a much lower coverage and a better uniformity across the genome, relative to WES.
Depth coverage distributions were skewed in the WES data, with some regions poorly covered and other regions over saturated with redudant reads.
However, in WGS-specific INDEL regions, the mean coverages across eight samples were 61X±2.9X and 137X±12.1X for WGS and WES data, respectively ( Figure 4).
Compared to the targeted regions, the mean coverage for WES data was significantly reduced in these regions. The coverage fractions at 1X were 99.9%±0.03% and 56%±0.3% in WGS and WES data, respectively. The difference became even larger for coverage fractions at 20X: 94%±1.4% for WGS and 31%±2.1% for WES data. The reason why WES data missed these INDELs could be: ~44% of these regions in WES data were not covered at all and ~69% were covered with fewer than 20 reads. Furthermore, in WES-specific INDELs regions, the mean coverages across eight samples were 41X±5.2X and 172X±10.0X for WGS and WES data, respectively ( Figure 5). The coverage fractions at 20X were 87%±6.1% and 96%±1.0% in WGS and WES data, certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint respectively. We noticed in these regions, the depth coverage of WGS data was significantly lower than WES data because the coverage fractions at 50X decreased to 29%±9.4%, while it was 79%±3.3% for WES data. It was difficult to directly understand the issues with these regions, so we used the high confidence INDEL set as a positive control and proceeded to assess each call set with newly developed quality criteria.

Assessment of the INDEL calls sets from WGS and WES
Based on previous validation data, we selected three combinations of thresholds to define the calling quality of an INDEL call as either high, moderate or low quality based on the following two metrics: the coverage of the alternative allele and the k-mer Chi-Square score of an INDEL (Methods). Using these criteria, 89% of the high confidence INDELs were considered as high quality, 9% as moderate quality, and only 2% as low quality ( Figure 6, Supplemental Table S1). We noticed a similar pattern for WGS-specific INDELs: 78% were of high quality, 15% as moderate quality, and only 7% as low quality. However, for WES-specific INDELs, there was a striking enrichment of low quality events (41%), additionally with a ~4.1-fold decrease of the high quality events In order to understand what was driving the error rates in different data sets, we partitioned the INDELs by the following six regions: homopolymer A (poly-A), homopolymer C (poly-C), homopolymer G (poly-G), homopolymer T (poly-T), short tandem repeats (STRs) except homopolymers (other STRs), and non-STRs. We noticed that for the high quality events, the majority of the high confidence INDELs (70%±1.2%) and WGS-specific INDELs (67%±1.1%) were within non-STRs regions ( Figure 7). certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Further, a majority of the high quality INDELs specific to WES were within poly-A (24%±3.0%) and poly-T regions (30%±3.5%). When we compared the low quality INDELs to the high quality INDELs, there were consistent enrichment of homopolymer A or T (poly-A/T) INDELs in all three call sets, ~2.3-fold for high confidence events, ~2.1-fold for WGS-specific events, and ~1.5-fold for WES-specific events. The WESspecific call set contained a much higher proportion (83%±4.8%) of Poly-A/T INDELs from the low-quality INDELs, relative to the high confidence call set (44%±16.8%), and the WGS-specific call set (45%±9.0%). This suggested that poly-A/T is a major contributor to the low quality INDELs, which gives rise to much more INDEL errors. We explored this further by a comparison of PCR-free and standard PCR-involved WGS data below.

Sources of multiple signatures in WGS and WES data
Another way of understanding INDEL errors is to look at multiple signatures at the same genomic location. Multiple signatures means that for the same genomic location, there are more than one INDELs called. If we assume only one signature can be the true INDEL in the genome, any additional signatures would represent false-positive calls. So if we have a higher number of multiple signatures, it means that these reads contained more INDEL errors or the algorithm tends to make more mistakes in these regions. We combined the call sets from both datasets and identified multiple signatures in the union for each sample. In order to understand the error behaviors in the above assessment, we also partitioned the signatures by the same regional criteria. We noticed that the poly-A/T INDELs are the major source of multiple signatures, which are enriched in WES data (72%±7.4% for WES vs. 54%±9.3% for WGS) ( We tried to understand the source of multiple signatures by the numbers of reads containing homopolymer INDELs inferred by the CIGAR code ( Figure 9). Figure 9 showed that there is a much higher proportion of poly-A/T INDELs in the WES-specific regions from both WGS (56%±7.5%) and WES data (64%±5.8%), relative to other certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint regions. In addition, WES data has ~6.3-fold more reads than WGS data in the regions with INDELs specific to WES data (11251 vs. 1775, Supplemental Table S2). According to Qualimap, a large number of homopolymer indels might indicate a problem in sequencing for that region. Here we particularly identified the effects of these problematic sequencing reads on INDEL calling, which revealed more multiple signatures of poly-A/T INDELs.

Applications of using filtering criteria to reduce false positive de novo INDELs
Supplemental Table S3 shows the probabilities of seeing more than K INDELs from one of the 343 families reported in Iossifov et al. 2012 [32]. Scalpel has a de novo analysis mode; it could re-assemble each region associated with the candidate INDELs across the family members using a more sensitive parameter setting. This setting was indeed more sensitive for detecting de novo INDELs than single-sample calling. Due to this, we used the following more rigorous filtering criteria than the above assessment to exclude any spurious false-positive de novo INDELs: coverage of the alternative allele >10 and Chi-Square score <4. Supplemental Table S4 showed the number of putative de novo INDELs in two families before and after applying this filtering criteria. All of the putative events in the two families were successfully excluded, which was consistent with the validated results in the variant database reported by Iossifov et al. 2012 [32]. We noticed that, in both families, the majority of these false-positive de novo INDELs were poly-A/T relevant (91% for WGS, 78% for WES), which was consistent with the above assessment. This suggested that if we used very sensitive callers, we should control for poly-A/T false-positive de novo INDELs by applying a more rigorous filtering criteria, especially in population-scale sequencing projects, where there is substantial expense with experimental validation.

Validation of INDELs in WGS and WES data on the sample K8101-49685s
We previously selected ~1400 INDELs called with multiple variant callers from the WES data on the sample K8101-49685s. 453 of those were called by Scalpel with this WES data and 372 of them were validated [31]. In this study, we called the INDELs with Scalpel again but from the WGS data on this sample. The WGS data revealed 324 out of these 453 INDELs (72%). This sensitivity at 30X is consistent with what we predict in certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint the later discussion (75% at 30X). Among those, 286 INDELs were validated and 38 INDELs were invalid according to the validation experiment. This showed that 43 (53%) of these false-positive INDELs were not called from the WGS data and thus specific to the WES data. The error rate of INDELS from the WGS data based on the validation subset was 12%, which was the same as the high quality INDELs discussed above.
Furthermore, 126 of those 453 INDELs were specific to the WES data and 43 failed to pass the validation. The error rate of INDELs specific to the WES data is 34%, which was almost the same as the low quality INDELs discussed above. This validation data showed that the high confidence INDELs called by both data-types were indeed of high quality and low error rate, even though this WGS data was with a mean coverage of 30X.
There is ~2.8-fold higher error rate for INDELs specific to the WES data, suggesting that INDELs specific to the WES data in fact are of low quality.

PCR-involved vs. PCR-free: assessment of INDELs calling quality
The concordance rate between PCR-involved and PCR-free data on NA12878 using exact-match and position-match were 71% and 76%, respectively ( Figure 10). These concordance rates were higher than those between WGS and WES, even for regions having at least one read in both datasets. Based on exact-match, the proportion of INDELs specific to PCR-involved data was 18%, which is ~1.6-fold higher than the proportion of INDELs specific to PCR-free data (11%). This ratio was similar based on position-match (~1.7-fold). Like previous assessments, we classified the three call sets with respects to calling quality. We again used the high confidence INDEL call set as a positive control. Figure 11 shows that 89% of the high confidence INDELs are considered as high quality, 9% as moderate quality, and only 2% as low quality.
However, for INDELs specific to PCR-involved data, there is a large proportion of low quality events (61%), and a very limited proportion are of high quality (7%). There were on average 310 INDELs specific to PCR-free data and 538 INDELs specific to PCRinvolved data. Notably, 177 of the PCR-free-specific INDELs and 40 of the PCRinvolved-specific INDELs were of high quality, suggesting that in these specific regions, PCR-free data yielded ~4.4-fold more high quality INDELs than PCR-involved data.
Furthermore, 326 of the PCR-involved-specific INDELs were of low quality, while in the certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint PCR-free-specific call set, 52 INDELs were of low quality. That being said, in regions specific to data types, PCR-involved data yielded ~6.3-fold more low quality INDELs.
Consistent with the comparisons between WGS and WES data, this suggested PCR amplification induced a large number of error-prone INDELs to the library, and we could effectively increase INDEL calling quality by reducing the rate of PCR amplification.
To understand the behaviors of errors in the poly-A/T regions, we partitioned the INDEL call set by the same six regions again. We noticed that for the high quality events, a majority of the high confidence INDELs (68%) were within non-STRs regions ( Figure   12). The proportion of poly-A/T INDELs was small for the high confidence call set (20%), larger for PCR-free-specific call set (35%), and even larger for PCR-involvedspecific call set (51%). This was similar to the WGS and WES comparisons because there would be more poly-A/T INDELs when a higher rate of PCR amplification was performed. A majority of the high quality INDELs specific to PCR-involved data were within poly A (24%) and poly T regions (38%). When we compared the low quality   [28], we could certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint only discover 85% of the high confidence INDELs. At the level of a person's whole genome, in order to achieve a sensitivity of 95% for INDEL detection, we recommend that the sequencing requirement, at least on the Illumina HiSeq2000 platform, should be WGS at a mean coverage of 60X, after removing PCR duplicates ( Figure 13). Of course, due to the above results, a WGS at 60X mean coverage with PCR-free library preparation is even more ideal.
Some groups previously reported that determining heterozygous SNPs requires higher coverage than homozygous ones [33]. The sensitivity of heterozygous SNP detection was limited by depth of coverage, which requires at least one read from each allele at any one site and in practice much more than one read to account for sequencing errors [34].
However, the read depth requirement of INDEL detection in terms of zygosity has not been well understood. To answer this question, we took the high confidence INDELs and partitioned them by zygosities. We first plotted the pair-wise coverage relationship between WGS and WES for each high confidence INDEL. Supplemental Figure S2 shows that the detection of homozygous INDELs starts with a lower coverage, which is consistent in both WGS and WES datasets, although the rest of the homozygotes and heterozygotes were highly overlapping. To further quantitatively understand this phenomenon, we measured the sensitivity again for heterozygous INDELs and homozygous INDELs separately. At a mean coverage of ~20X, the false negative rates of high confidence INDELs was ~45% for heterozygous INDELs and ~30% for homozygous INDELs (Figure 14). Figure 14 also shows that detection of heterozygous INDELs indeed requires higher coverage than homozygous ones (sensitivity of 95% at 60X vs. 50X). Notably, the number of heterozygous INDELs was ~1.6-fold higher than homozygous ones (~1600 vs. ~635 per sample). This re-affirms our recommendation of sequencing personal genome at 60X mean coverage to achieve a high accuracy INDEL call set. certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. As more and more groups are moving on to these new micro-assembly based algorithms, practical considerations for experimental design should be introduced to the community.
Here we present a novel classification scheme utilizing the validation data, and we encourage researchers to use this guideline for evaluating their call sets. The combination of alternative allele coverage and the k-mer Chi-Square score is an effective filter criterion for reducing INDEL calling errors. One could easily apply this criterion to identify both high-quality and low-quality INDELs, which could help to quickly understand the characteristics of an INDEL call. If we use INDEL callers with a very sensitive setting, we should control for homopolymer false-positive INDELs by applying a more rigorous filtering criteria. This is essential for population-scale sequencing projects, because the expense of experimental validation scales with the sample size. For consumer genome sequencing purposes, we recommend sequencing human genomes at 60X mean coverage with PCR-free protocols, which can substantially improve the quality of personal genomes. Although this recommendation might initially cost more than the current standard protocol of genome sequencing used by some facilities, we argue that certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint the significantly higher accuracy and decreased costs for validation with WGS at 60X with PCR-free protocols would ultimately be cost-effective as the sequencing costs continue to decrease, relative to either WES or WGS at a lower coverage. certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Analysis of Simulated Data
We simulated Illumina-like 2*101 paired-end reads with randomly distributed INDELs, which ranged from 1 bp to 100 bp. The simulated reads were mapped to human reference genome hg19 using BWA-mem v0.7-6a using default parameters [35]. The alignment was sorted with SAMtools (v0.1.19-44428cd) [36] and the duplicates were marked with Picard using default parameters (v1.106), resulting in a mean coverage of 93X. We down-sampled the reads with Picard to generate 19 sub-alignments. The minimum mean coverage of the sub-alignments was 4.7X and increased by 4.7X each time, before it got to the original coverage (93X). Scalpel (v0.1.1) was used to assemble the reads and call INDELs from each alignment separately, resulting in 20 INDEL call-sets from these 20 alignments, using the following parameter setting: "--single --lowcov 1 --mincov 3outratio 0.1 --numprocs 10 --intarget". We then computed both the sensitivity and specificity of all and large (greater than 5bp) INDEL detection, respectively. The same versions and the same sets of parameter settings for BWA-mem, Picard, and Scalpel, were also used in the rest of the study, including the analysis of WGS/WES data, PCRinvolved and PCR-free data.

Generation of WGS and WES data
Blood samples were collected from eight humans of two quartets from the Simons Simplex Collection. Both WGS and WES were performed on the same genomic DNA isolated from these eight blood samples. The exome capture kit used was NimbleGen SeqCap EZ Exome v2.0, which was designed to pull down 36Mb (approximately 300,000 exons) of the human genome hg19. The actual probe regions were much wider than these targeted regions, because probes also covered some flanking regions of genes, yielding a total size of 44.1Mb. All of the libraries were constructed with PCR amplification. We sequenced both sets of libraries on Illumina HiSeq2000 with average read length of 100 bp at the sequencing center of Cold Spring Harbor Laboratory (CSHL).  [37,38].

Generating summary statistics of alignment from WGS and WES
We used Qulimap (v0.8.1) to generate summary statistics of the alignment files of interest [39]. The reference genome used here is hg19. There were four region files that we used for this part of the analysis. The first one is the exon region bed file from NimbleGen. We generated the other three region files by expanding 25bp upstream and downstream around loci of high confidence INDELs, WGS-specific INDELs, and WESspecific INDELs, respectively. We followed all of the default settings except for requiring the homopolymer size to be at least five (-hm 5). Finally, we used Matplotlib to certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint generate the figures with the raw data from Qualimap under Python environment 2.7.2 [40].

Generation of MiSeq validation data
In this study, we used the MiSeq validation data on the sample K8101-49685s that we previously reported [31]. For more technical details, please refer to the original paper. In

Classifications of INDEL with calling quality based on the validation data of sample K8101
We previously benchmarked Scalpel with respect to the coverage of the alternative allele (C ! !"# ) and the k-mer Chi-Square scores (χ ! ). Scalpel applied the standard formula for the Chi-Square statistics and applied to the K-mer coverage of both alleles of an INDEL.
where C ! !"# and C ! !"# are the observed k-mer coverage for the reference and alternative alleles, C ! !"# and C ! !"# are the expected k-mer coverage, i.e. C !
Among ~1400 INDELs that we selected for validation, 453 of them were called by Scalpel. Here we used all 453 INDELs to understand the relationship between the false discovery rate FDR and these two metrics (Supplemental Figure S3). Our validation data showed that with the same χ ! , INDELs with a lower C ! !"# tend to have a higher FDR, especially for INDELs with C ! !"# not greater than 10 (Supplemental Figure S3). For certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint INDELs with relatively the same C ! !"# , a higher χ ! also made them less likely to pass the validation. We noticed that the calling quality could be determined by the error rate inferred by these two metrics. To achieve a consistent accuracy for INDELs with different C ! !"# , we classified INDELs from the WGS and WES data and determined the calling quality with the below criteria: High quality INDELs: low error-rate (12%) INDELs meeting any of the three cutoffs: C ! !"# >10 and χ ! < 10.8, or 5 < C ! !"# ≤ 10 and χ ! ≤ 4.5, or C ! !"# ≤ 5 and χ ! ≤ 2; Low quality INDELs: high error-rate (32%) INDELs meeting the following cutoff: C ! !"# ≤ 10 and χ ! >10.8; Moderate quality: The remaining INDELs that do not fall into the above two categories.

Analysis of the effect of new filtering criteria on de novo INDEL calls
The two families in this study were previously reported in a population-scale autism study, with Sanger validation of de novo calls [32]. We used the de novo mode of Scalpel to identify de novo INDELs in these two families again, resulting in one de novo call set for WGS data and another de novo call set for WES data per family. We de novo exonic INDEL per child [32]. If we assume a binomial distribution of the de novo exonic INDELs with an equal chance (p=1/343), the probability of seeing at least X de novo exonic INDELs in a given family in this study can be computed as below: where P( X ≥ k ) is the probability of a given family having k or more de novo INDELs; N is the total number of exonic de novo INDELs reported, i.e. N=85; p is the probability of a hit on a given trial, i.e p=1/343; q=1-p.

Analysis of PCR-free and PCR-involved data of NA12878
We downloaded PCR-free WGS data of NA12878 (access Code: ERR194147), which is publicly available in the Illumina Platinum Genomes project. We also download a PCRinvolved WGS data of NA12878 from the Sequence Read Archive (SRA) (access Code: certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint SRR533281, SRR533965, SRR539965, SRR539956, SRR539947, SRR539374, SRR539357). Both data were generated on Illumina HiSeq 2000 platform. Although the PCR-free data was not supposed to have any PCR duplicates, we observed a duplication rate of 2% as reported by Picard, and we excluded these reads, yielding ~50X mean coverage for both datasets after removing PCR duplicates. We used the same methods for alignment, INDEL calling, and downstream analysis as described above.

Analysis of INDEL detection sensitivity in WGS data
We were interested to know how depth of coverage affects the sensitivity of INDEL detection in WGS data. To accurately measure this sensitivity, one needs a robust call set as a truth set. Fortunately, we had exact-match INDELs concordant between high coverage WGS and high coverage WES data. We therefore measured sensitivity based on these high confidence INDELs, rather than on the whole set of INDELs, which might contain much more false positives. We down-sampled each WGS dataset to mean coverages of ~20X, ~32X, ~45X, and ~57X. We then used Scalpel to call INDELs from the resulting 4 sub-alignment files for each sample and computed the sensitivity at a certain mean coverage (X) for each sample by the equation: Sensitivity at X coverage     Figure 1 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 2 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 3 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 4 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 5 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 6 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 7 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 8 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 9 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 10 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 11 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 12 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 13 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Figure 14 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint                certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figures legends
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Table legends   Table 1. Mean concordance and discordance rates of INDEL detection between WGS and WES data in different regions. The data is shown in the following order: 1) regions without filtering, and regions filtered by requiring base coverage to be at least 2) one read, 3) 20 reads, 4) 40 reads, 5) 60 reads, or 6) 80 reads in both data. The mean discordance rate is calculated based on position-match, which is the percentage of INDELs specific to either dataset. The standard deviation is shown in parenthesis. certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Additional files Additional file 1 -Supplemental figures and tables
This file includes supplemental figure S1-S3, supplemental table S1-S4, and their corresponding figure/table legends. certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Supplemental Figures
Supplemental Figure S1 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not this version posted June 10, 2014. ; https://doi.org/10.1101/006148 doi: bioRxiv preprint Supplemental Table legends   Supplemental Table S1 certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.