Reducing INDEL calling errors in whole genome and exome sequencing data
© Fang et al.; licensee BioMed Central Ltd. 2014
Received: 14 June 2014
Accepted: 16 October 2014
Published: 28 October 2014
INDELs, especially those disrupting protein-coding regions of the genome, have been strongly associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts.
We characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate the sources of INDEL errors. We also developed a classification scheme based on the coverage and composition to rank high and low quality INDEL calls. We performed a large-scale validation experiment on 600 loci, and find high-quality INDELs to have a substantially lower error rate than low-quality INDELs (7% vs. 51%).
Simulation and experimental data show that assembly based callers are significantly more sensitive and robust for detecting large INDELs (>5 bp) than alignment based callers, consistent with published data. The concordance of INDEL detection between WGS and WES is low (53%), and WGS data uniquely identifies 10.8-fold more high-quality INDELs. The validation rate for WGS-specific INDELs is also much higher than that for WES-specific INDELs (84% vs. 57%), and WES misses many large INDELs. In addition, the concordance for INDEL detection between standard WGS and PCR-free sequencing is 71%, and standard WGS data uniquely identifies 6.3-fold more low-quality INDELs. Furthermore, accurate detection with Scalpel of heterozygous INDELs requires 1.2-fold higher coverage than that for homozygous INDELs. Lastly, homopolymer A/T INDELs are a major source of low-quality INDEL calls, and they are highly enriched in the WES data.
Overall, we show that accuracy of INDEL detection with WGS is much greater than WES even in the targeted region. We calculated that 60X WGS depth of coverage from the HiSeq platform is needed to recover 95% of INDELs detected by Scalpel. While this is higher than current sequencing practice, the deeper coverage may save total project costs because of the greater accuracy and sensitivity. Finally, we investigate sources of INDEL errors (for example, capture deficiency, PCR amplification, homopolymers) with various data that will serve as a guideline to effectively reduce INDEL errors in genome sequencing.
With the increasing use of next-generation sequencing (NGS), there is growing interest from researchers, physicians, patients, and consumers to better understand the underlying genetic contributions to various conditions. For rare diseases and cancer studies, there has been increasing success with exome/genome sequencing in identifying mutations that have a large effect size for particular phenotypes  . Some groups have been trying to implement genomic and/or electronic health record approaches to interpret disease status and inform preventive medicine  . However, we are still facing practical challenges for both analytic validity and clinical utility of genomic medicine  . In addition, the genetic architecture behind most human disease remains unresolved  . Some have argued that we should bring higher standards to human genetics research in order to return results and/or reduce false-positive reports of ‘causality’ without rigorous standards ,. Others have reported that analytic validity for WES and WGS is still a major issue, pointing out that the accuracy and reliability of sequencing and bioinformatics analysis can and should be improved for a clinical setting ,, .
There is also debate whether we should primarily in the year 2014 use whole genome sequencing (WGS) or whole exome sequencing (WES) for personal genomes. Some have suggested that a first-tier cost-effective WES might be a powerful way to dissect the genetic basis of diseases and to facilitate the accurate diagnosis of individuals with ‘Mendelian disorders’ ,. Others have shown that targeted sequencing misses many things  and that WGS could reveal structural variants (SVs), maintains a more uniform coverage, is free of exome capture efficiency issues, and actually includes the non-coding genome, which likely has substantial importance  . Some groups directly compared WGS with WES, but thorough investigation of INDEL errors was not the focus of these comparisons ,,,. Substantial genetic variation involving INDELs in the human genome has been previously reported but accurate INDEL calling is still difficult  . There has been a dramatic decrease of sequencing cost over the past few years, and this cost is decreasing further with the release of the Illumina HiSeq X Ten sequencers which have capacity for nearly 18,000 whole human genomes per instrument per year. However, it is still unclear whether we can achieve a high-accuracy personal genome with a mean coverage of 30X from the Illumina HiSeq X Ten sequencers. In addition, there have been questions on the use of PCR amplification in the library preparations for NGS, although very few have characterized the PCR errors that might be complicating the detection of insertions and deletions (INDELs).
Concordance rates among INDELs detected by the GATK Unified Genotyper (v1.5), SOAPindel (v1.0) and SAMtools (v0.1.18) are reportedly low, with only 26.8% agreeing across all three pipelines . Another group also reported low concordance rates for INDELs between different sequencing platforms, further showing the difficulties of accurate INDEL calling . Other efforts have been made to understand the sources of variant calling errors . Common INDEL issues, such as realignment errors, errors near perfect repeat regions, and an incomplete reference genome have caused problems for approaches working directly from the alignments of the reads to reference ,. De novo assembly using de Brujin graphs has been reported to tackle some of these limitations . Fortunately, with the optimization of micro-assembly, these errors have been reduced with a novel algorithm, Scalpel, with substantially improved accuracy over GATK-HaplotypeCaller (v3.0), SOAP-indel (v2.01), and six other algorithms . Based on validation data, the positive prediction rate (PPV) of algorithm specific INDELs was high for Scalpel (77%), but much lower for GATK HaplotypeCaller (v3.0) (45%) and SOAP-indel (v2.01) (50%) .
Thus, we set out to investigate the complexities of INDEL detection on Illumina reads using this highly accurate INDEL-calling algorithm. First, we used simulation data to understand the limits of how coverage affects INDEL calling with Illumina-like reads using GATK-UnifiedGenotyper and Scalpel. Second, we analyzed a dataset including high coverage WGS and WES data from two quad families (mother, father and two children), in addition to extensive high-depth validation data on an in-house sample, K8101-49685s. In order to further understand the effects of PCR amplification on INDEL calling, we also downloaded and analyzed two WGS datasets prepared with and without PCR from the well-known HapMap sample NA12878. We characterized the data in terms of read depth, coverage uniformity, base-pair composition pattern, GC contents, and other sequencing features, in order to partition and quantify the INDEL errors. We were able to simultaneously identify both the false-positives and false-negatives of INDEL calling, which will be useful for population-scale experiments. We observe that homopolymer A/T INDELs are a major source of low quality INDELs and multiple signatures. As more and more groups start to use these new micro-assembly-based algorithms, practical considerations for experimental design should be introduced to the community. Lastly, we explicitly address the question concerning the necessary depth of coverage for accurate INDEL calling using Scalpel for WGS on HiSeq sequencing platforms. This work provides important insights and guidelines to achieve a highly accurate INDEL call set and to improve the sequencing quality of personal genomes.
Analysis of simulated data
We simulated Illumina-like 2*101 paired-end reads with randomly distributed INDELs, which were in the range of 1 bp to 100 bp. The simulated reads were mapped to human reference genome hg19 using BWA-mem (v0.7-6a) using default parameters . The alignment was sorted with SAMtools (v0.1.19-44428cd)  and the duplicates were marked with Picard using default parameters (v1.106), resulting in a mean coverage of 93X. We down-sampled the reads with Picard to generate 19 sub-alignments. The minimum mean coverage of the sub-alignments was 4.7X and increased by 4.7X each time, before it reached the original coverage (93X). Scalpel (v0.1.1) was used as a representative of assembly-based callers to assemble the reads and call INDELs from each alignment separately, resulting in 20 INDEL call sets from these 20 alignments, using the following parameter settings: `--single --lowcov 1 --mincov 3 outratio 0.1 --numprocs 10 intarget. We also used GATK-UnifiedGenotyper (v3.2-2) as a representative of alignment based callers to call INDELs from each set of alignments . We followed the best practices on the GATK website, including all the pre-processing procedures, such as INDEL realignment and base recalibration. Scalpel internally left-normalized all the INDELs so we only used GATK-LeftAlignAndTrimVariants on the INDEL calls from UnifiedGenotyper. We then computed both the sensitivity and false discovery rate (FDR) for both INDEL callers, with respects to all and large (>5 bp) INDELs. The same versions and the same sets of parameter settings for BWA-mem, Picard, and Scalpel, were also used in the rest of the study, including the analysis of WGS/WES data, standard WGS, and PCR-free data.
Generation of WGS and WES data
Blood samples were collected from eight humans of two quartets from the Simons Simplex Collection (SSC) . Both WGS and WES were performed on the same genomic DNA isolated from these eight blood samples. The exome capture kit used was NimbleGen SeqCap EZ Exome v2.0, which was designed to pull down 36 Mb (approximately 300,000 exons) of the human genome hg19. The actual probe regions were much wider than these targeted regions, because probes also covered some flanking regions of genes, yielding a total size of 44.1 Mb. All of the libraries were constructed with PCR amplification. We sequenced both sets of libraries on Illumina HiSeq2000 with average read length of 100 bp at the sequencing center of Cold Spring Harbor Laboratory (CSHL). We also generated WGS (mean coverage =30X) and WES (mean coverage =110X) data from an in-house sample K8101-49685s (not from SSC), which was extensively investigated in the later validation experiment. Exome capture for this sample was performed using the Agilent 44 Mb SureSelect protocol and the resulting library was sequenced on Illumina HiSeq2000 with average read length of 100 bp. All of the HiSeq data from K8101-49685s have been submitted to the Sequence Read Archive (SRA)  under project accession number SRX265476 (WES data) and SRX701020 (WGS data). All of the HiSeq data from eight SSC samples have been submitted to the National Database for Autism Research (NDAR)  under collection ‘Wigler SSC autism exome families’ (project number: 1936).
Institutional review board approval
The Simons Simplex Collection (SSC) is a permanent repository of genetic samples from 2,700 families operated by SFARI  in collaboration with 12 university-affiliated research clinics. SFARI maintains the consent of all individuals in the SSC and the analysis of those samples in this project was supervised under the CSHL IRB review committee. This study of the internal sample K8101-49685s was approved by the CSHL Institutional Review Board, and all participants provided informed written consent.
Analysis of the INDELs from WGS and WES data
We excluded all of the low quality raw reads, aligned the remaining high quality ones with BWA-mem, and mark-duplicated with Picard. We used Scalpel to assemble the reads and identify INDELs under both single mode and quad mode. The single mode outputs all of the putative INDELs per person, and the quad mode outputs only the putative de novo INDELs in the children in a family. We expanded each of the exons by 20 bp upstream and 20 bp downstream in order to cover the splicing sites and we called this set of expanded regions the ‘exonic targeted regions’. The exonic targeted regions are fully covered by the exome capture probe regions. We excluded INDELs that were outside the exonic targeted regions in the downstream analysis.
We left-normalized the INDELs and compared the two call sets for the same person using two criteria: exact-match and position-match. Position-match means two INDELs have the same genomic coordinate, while exact-match additionally requires that two INDELs also have the same base-pair change(s). We called the INDELs in the intersection based on exact-match as WGS-WES intersection INDELs. Further, we named the INDELs only called from one dataset as ‘WGS-specific’ and ‘WES-specific’ INDELs, respectively. Regions of the above three categories of INDELs were partitioned and investigated separately. In particular, we focused on regions containing short tandem repeats (STR) and homopolymers. We used BedTools (v2.18.1) with the region file from lobSTR (v2.04) to identify homopolymeric regions and other STR (dual repeats, triplets and etc.) in the human genome  
Generating summary statistics of alignment from WGS and WES
We used Qualimap (0.8.1) to generate summary statistics of the alignment files of interest . For a certain region, we define the proportion of a region covered with at least X reads to be the coverage fraction at X reads. In addition to the coverage histograms, we also computed the coefficient of variation CV to better understand the coverage uniformity of the sequencing reads. An unbiased estimator of CV can be computed by , where s represents the sample standard deviation and represents the sample mean. In our case, asymptotically approaches to as the sample size (n) of the data is usually greater than 10,000. The reference genome used here is hg19. There were four region files that we used for this part of the analysis. The first one is the exon region bed file from NimbleGen. We generated the other three region files by expanding 25bp upstream and downstream around loci of WGS-WES intersection INDELs, WGS-specific INDELs, and WES-specific INDELs, respectively. We followed all of the default settings in Qualimap except for requiring the homopolymer size to be at least five (-hm 5). Finally, we used Matplotlib to generate the figures with the raw data from Qualimap under the Python environment 2.7.2 .
Generation of MiSeq validation data of sample K8101-49685s
We randomly selected 200 INDELs for validation on an in-house sample K8101-49685s from each of the following categories: (1) INDELs called from both WGS and WES data (WGS-WES intersection), (2) WGS-specific INDELs, (3) WES-specific INDELs. Out of these 600 INDELs, 97 were covered with more than 1,000 reads in the previous MiSeq data set reported by Narzisi et al. Hence, we only performed additional Miseq validation on the remaining 503 loci . PCR primers were designed using Primer 3 to produce amplicons ranging in size from 200 to 350 bp, with INDELs of interest located approximately in the center. Primers were obtained from Sigma-Aldrich in 96-well mixed-plate format, 10 mol/L dilution in Tris per oligonucleotide. 25 L PCR reactions were set up to amplify each INDEL of interest using K8101-49685s genomic DNA as template and LongAmp Taq DNA polymerase (New England Biolabs). PCR products were visually inspected for amplification efficiency using 1.5% agarose gel electrophoresis, and then pooled for ExoSAP-IT (Affymetrix) cleanup. The cleanup product was purified using QIAquick PCR Purification Kit (Qiagen) and quantified by Qubit dsDNA BR Assay Kit (Invitrogen). Subsequently, a library construction was performed following the TruSeq Nano DNA Sample Preparation Guide for the MiSeq Personal Sequencer platform (Illumina). Before loading onto the MiSeq machine, the quality and quantity of the sample was reevaluated using the Agilent DNA 1000 Kit on the Agilent Bioanalyzer and with quantitative PCR (Kapa Biosystems).
We generated high quality 250 bp paired-end reads with an average coverage of 55,000X over the selected INDELs. We aligned the reads with BWA-MEM (v0.7.5a) to hg19, sorted the alignment with SAMtools (v0.1.18) and marked PCR duplicates with Picard (v1.91). The alignment quality control showed that 371 out of the 503 loci were covered with at least 1,000 reads in the data and we only considered these loci in the downstream analysis. Therefore, we have validation data on 160, 145, and 161 loci from the WGS-WES intersection, WGS-specific, and WES-specific INDELs, respectively. As reported by Narzisi et al., mapping the reads containing a large INDEL (near or greater than half the size of the read length) is problematic. This was particularly difficult when the INDEL is located toward either end of a read . To avoid this, we used very sensitive settings with Bowtie2 (--end-to-end --very-sensitive --score-min L,-0.6,-0.6 --rdg 8,1 --rfg 8,1 --mp 20,20) to align the reads because it can perform end-to-end alignment and search for alignments with all of the read characters . We generated the true INDEL call set by two steps: (1) used GATK UnifiedGenotyper to call INDELs from the BWA-MEM alignment, (2) performed manual inspection on the large INDELs from the Bowtie2 alignment (require at least 25% of the reads supporting an INDEL) . The alignments were realigned with the GATK (v2.6-4) IndelRealigner and base quality scores were recalibrated before variants were called with UnifiedGenotyper. Left-normalization was performed to avoid different representations of a variant. An INDEL was considered valid if a mutation with the same genomic coordinate and the same type of variation exists in the validation data. For example, an insertion call would not be considered valid if the variant with the same coordinate in the validation data was instead a deletion. All of the MiSeq data can be downloaded from the Sequence Read Archive under project accession number SRX386284 (Accession number: SRR1575211, SRR1575206, SRR1042010).
Classifications of INDEL with calling quality based on the validation data
where and are the observed k-mer coverage for the reference and alternative alleles, and are the expected k-mer coverage, that is, .
High quality INDELs: low error-rate (7%) INDELs meeting any of the three cutoffs: >10 and X2 < 10.8, or 5 < ≤10 and X2 ≤ 4.5, or ≤5 and X2 ≤ 2;
Low quality INDELs: high error-rate (51%) INDELs meeting the following cutoff: ≤10 and X2 > 10.8;
Moderate quality: The remaining INDELs that do not fall into the above two categories.
Analysis of PCR-free and standard WGS data of NA12878
We downloaded PCR-free WGS data of NA12878 (access Code: ERR194147), which is publicly available in the Illumina Platinum Genomes project. We also downloaded another WGS dataset of NA12878 with PCR amplification during library preparation, and we called it standard WGS data (SRA access Code: SRR533281, SRR533965, SRR539965, SRR539956, SRR539947, SRR539374, SRR539357). Both data were generated on the Illumina HiSeq 2000 platform. Although the PCR-free data was not supposed to have any PCR duplicates, we observed a duplication rate of 2% as reported by Picard, and we excluded these reads, yielding 50X mean coverage for both data sets after removing PCR duplicates. We used the same methods for alignment, INDEL calling, and downstream analysis as described above. INDELs outside the exonic targeted regions were not considered in the downstream analysis.
Analysis of INDEL detection sensitivity in WGS data
This equation measures how many of the WGS-WES intersection INDELs can be discovered as a function of read depth. We also analyzed the WGS-WES intersection INDEL call set in terms of zygosity: WGS-WES intersection heterozygous and homozygous INDEL, subsequently measuring the sensitivity with respect to different zygosities.
Results and discussion
Simulated data: characterizing alignment and assembly based callers at different coverage
The FDRs of Scalpel were robust to the changes in coverage while GATK-UnifiedGenotyper’s FDRs were affected by coverage. For the detection of large INDELs with Scalpel, the FDRs marginally decreased as the mean coverage increased from 5X to 28X, and remained basically the same again from 33X to 93X (Figure 1B). This indicates that for large INDELs, insufficient coverage results in more assembly errors, which results in a higher error rate for micro-assembly variant calling. Based on the simulation data, a mean coverage of at least 30X is needed to maintain a reasonable FDR for Scalpel. In contrast, FDRs of GATK-UnifiedGenotyper are much higher and more unstable at different coverages, especially for large INDELs. Nonetheless, since these results were based on simulation data, which does not include the effects of any sequencing artifacts on INDEL calling, these values establish the upper bound of accuracy and performance compared to genuine sequence data. Previous studies reported that local assembly allows to call INDELs much larger than those that can be identified by the alignment ,,. Consistent with previous reports, our simulated data suggested that assembly based callers can reveal a much larger spectrum of INDELs than alignment based callers, in terms of their size. Furthermore, Narzisi et al. recently reported that Scalpel is more accurate than GATK-HaplotypeCaller and SOAPindel, especially within regions containing near-perfect repeats . Thus, in order to control for artifacts from callers, we chose to use Scalpel as the only INDEL caller in our downstream analysis on the experimental data, which could help to better clarify differences between data types.
WGS vs. WES: Low concordance on INDEL calling
Mean concordance and discordance rates of INDEL detection between WGS and WES data in different regions
Mean coefficients of variation of coverage with respects to the different regions
Exonic targeted regions
WGS-WES intersection INDEL regions
WGS-specific INDEL regions
WES-specific INDEL regions
Coverage distributions of different regions in WGS and WES data
An ideal sequencing experiment should result in a high number of reads covering a region of interest uniformly. Using the eight SSC samples, we investigated the coverage behaviors of the WGS and WES data by the following: distribution of the read depth, mean coverage, coverage fraction at X reads, coefficient of variation (C v ) (See Methods). Hence, ideally one should expect to see a normal distribution of read depth with a high mean coverage and a small C v . Comparisons of the coverage distributions are shown in the following order: (1) Exonic targeted regions, that is, the exons that the exome capture kit was designed to pull down and enrich; (2) WGS-WES intersection INDEL regions, that is, the regions where WGS and WES revealed the identical INDELs based on exact-match; (3) WGS-specific INDEL regions, that is, the regions where only WGS revealed INDELs based on position-match; (4) WES-specific INDEL regions, that is, the regions where only WES revealed INDELs based on position-match.
First, in the exonic targeted regions, the mean coverages across eight samples were 71X and 337X for WGS and WES data, respectively (Figure 3A and B, Additional file 1: Table S1). We noticed that there was a recovery issue with WES in some regions, as the coverage fraction at 1X was 99.9% in WGS data but only 84% in WES data, meaning that 16% of the exonic targeted regions were not recovered, which could be due to capture inefficiency or other issues involving DNA handling during the exome library preparation and sequencing protocols (Figure 3C and D, Additional file 1: Table S2). The coverage was much more uniform in the WGS data than that in the WES data because C v of the WGS data was much lower (39% vs. 109%, Figure 3A and B, Table2). Second, in the WGS-WES intersection INDEL regions, the mean coverage across eight samples were 58X and 252X for WGS and WES data, respectively (Additional file 1: Figure S1A and B, Additional file 1: Table S1). We noticed that there was an increase of coverage uniformity for WES in the WGS-WES intersection INDEL regions, relative to the exonic targeted regions, because C v was lower (109% vs. 97%) (Table 2, Figure 3B, Additional file 1: Figure S1B). We noticed WGS was able to reveal WGS-WES intersection INDELs at a much lower coverage relative to WES, which we attribute to a better uniformity of reads across the genome (C v : 47% vs. 97%, Table 2, Additional file 1: Figure S1A and B). The coverage distributions were skewed in the WES data, with some regions poorly covered and other regions over saturated with redundant reads.
MiSeq validation of INDELs in WGS and WES data on the sample K8101-49685s
Validation rates of WGS-WES intersection, WGS-specific, and WES-specific INDELs
INDELs (>5 bp)
Valid (>5 bp)
PPV (>5 bp)
Number and fraction of large INDELs in the following INDEL categories: (1) WGS-WES intersection INDELs, (2) WGS-specific, and (3) WES-specific
Large INDELs (>5 bp)
Fraction of large INDELs (>5 bp)
Assessment of the INDEL call sets from WGS and WES
Sources of multiple signatures in WGS and WES data
Standard WGS vs. PCR-free: assessment of INDELs calling quality
What coverage is required for accurate INDEL calling?
Some groups previously reported that determining heterozygous SNPs requires higher coverage than homozygous ones . The sensitivity of heterozygous SNP detection was limited by depth of coverage, which requires at least one read from each allele at any one site and in practice much more than one read to account for sequencing errors . However, the read depth requirement of INDEL detection in terms of zygosity has not been well understood. To answer this question, we took the WGS-WES intersection INDELs and partitioned them by zygosities. We first plotted the pair-wise coverage relationship between WGS and WES for each WGS-WES intersection INDEL. Additional file 1: Figure S3 shows that the detection of homozygous INDELs starts with a lower coverage, which is consistent in both WGS and WES data sets, although the rest of the homozygotes and heterozygotes were highly overlapping. To further understand this phenomenon, we measured the sensitivity again for heterozygous INDELs and homozygous INDELs separately. At a mean coverage of 20X, the false negative rates of WGS-WES intersection INDELs was 45% for heterozygous INDELs and 30% for homozygous INDELs, which is consistent with the fact that homozygous INDELs are more likely to be detected at a lower coverage shown above (Figure 12B). This shows that one should be cautious about the issue of false-negative heterozygous INDELs in any sequencing experiment with a low coverage (less than 30X). Figure 12B also shows that detection of heterozygous INDELs indeed requires higher coverage than homozygous ones (sensitivity of 95% at 60X vs. 50X). Notably, the number of heterozygous INDELs was 1.6-fold higher than homozygous ones (1,000 vs. 635 per sample). This re-affirms the need for 60X mean coverage to achieve a very high accuracy INDEL call set.
Despite the fact that both WES and WGS have been widely used in biological studies and rare disease diagnosis, limitations of these techniques on INDEL calling are still not well characterized. One reason is that accurate INDEL calling is in general much more difficult than SNP calling. Another reason is that many groups tend to use WES, which we have determined is not ideal for INDEL calling for several reasons. We report here our characterization of calling errors for INDEL detection using Scalpel. As expected, higher coverage improves sensitivity of INDEL calling, and large INDEL detection is uniformly more difficult than detecting smaller INDELs. We also showed that assembly-based callers are more capable of revealing a larger spectrum of INDELs, relative to alignment-based callers. There are several reasons for the low concordance for WGS and WES on INDEL detection. First, due to the low capture efficiency, WES failed to capture 16% of candidate exons, but even at sites that were successfully captured, there were more coverage biases in the WES data, relative to the WGS data. Second, PCR amplification introduces reads with higher INDEL error rate, especially in regions near homopolymer A/Ts. Lastly, STR regions, especially homopolymer A/T regions were more likely to result in multiple candidates at the same locus. We recommend controlling for homopolymer false INDEL calls with a more stringent filtering criteria. This is essential for population-scale sequencing projects, because the expense of experimental validation scales with the sample size.
Our validation data showed that INDELs called by both WGS and WES data were indeed of high quality and with a low error rate. Even though the WGS data have much lower depth coverage in general, the accuracy of INDEL detection with WGS data is much higher than that with WES data. We also showed that the WES data are missing many large INDELs, which we speculate might be related to the technical challenges of pulling down the molecules containing large INDELs during the exon capture process. Homopolymer A/T INDELs are a major source of low-quality INDELs and multiple signature events, and these are highly enriched in the WES data. This was confirmed by the comparison of PCR-free and standard WGS data. In terms of sensitivity, we calculated that WGS at 60X mean coverage from the HiSeq platform is needed to recover 95% of INDELs with Scalpel.
As more and more groups are moving to use new micro-assembly-based algorithms such as Scalpel, practical considerations for experimental design should be introduced to the community. Here we present a novel classification scheme utilizing the validation data, and we encourage researchers to use this guideline for evaluating their call sets. The combination of alternative allele coverage and the k-mer Chi-Square score is an effective filter criterion for reducing INDEL calling errors without sacrificing much sensitivity. This classification scheme can be easily applied to screen INDEL calls from all variant callers. Since alternative allele coverage is generally reported in the VCF files, the Chi-Square scores can also be computed directly. For consumer genome sequencing purposes, we recommend sequencing human genomes at a higher coverage with a PCR-free protocol, which can substantially improve the quality of personal genomes. Although this recommendation might initially cost more than the current standard protocol of genome sequencing used by some facilities, we argue that the significantly higher accuracy and decreased costs for validation would ultimately be cost-effective as the sequencing costs continue to decrease, relative to either WES or WGS at a lower coverage. However, it is important to point out that with the release of Illumina HiSeq X-Ten and other newer sequencers, the coverage requirement to accurately detect INDELs may decrease because reads with longer read length can span repetitive regions more easily. Besides, bioinformatics algorithms are another important consideration, and we expect the further enhancements of Scalpel and other algorithms will help reduce the coverage requirement while maintaining a high accuracy.
HF analyzed the data and wrote the manuscript. YW optimized the validation experiments and designed the primers. GN assisted in characterizing the simulation and validation data. JAO acted as a consultant for the MiSeq validation analyses. YW and LJB performed the Miseq validation experiments. JR generated the WGS and WES data. MR supervised the generation of the WGS and WES data. II developed the tool for the simulated data. HF, MCS, and GJL designed and analyzed the experiments. GJL developed experimental design for INDEL validation, suggested, reviewed, and supervised the data analysis, and wrote the manuscript. All of the authors have read and approved the final manuscript.
GJL,MCS, MR, and II are faculty members at Cold Spring Harbor Laboratory (CSHL). GN was a post-doctoral fellow at CSHL and is currently employed at the New York Genome Center. JR is a laboratory technician at CSHL. HF, JAO, and YW are graduate students at CSHL and Stony Brook University. LJB is a visiting undergraduate student at CSHL and a undergraduate student at Universidad Nacional Autonoma de Mexico.
Insertions and Deletions
- other STR:
short tandem repeats except homopolymers
polymerase chain reaction
homopolymer A or T
short tandem repeats
whole genome sequencing
whole exome sequencing
The laboratory of GJL is supported in part by funds from the Stanley Institute for Cognitive Genomics at Cold Spring Harbor Laboratory (CSHL). The laboratory of MCS is supported in part by National Institutes of Health award (R01-HG006677) and by National Science Foundation award (DBI-1350041). The CSHL genome center is supported in part by a Cancer Center Support Grant (CA045508) from the NCI. This work was partially supported by a grant from the Simons Foundation (SF235988) to Michael Wigler. We are grateful to all of the families at the participating SFARI Simplex Collection (SSC) sites, as well as the principal investigators (A. Beaudet, R. Bernier, J. Constantino, E. Cook, E. Fombonne, D. Geschwind, D. Grice, A. Klin, R. Kochel, D. Ledbetter, C. Lord, C. Martin, D. Martin, R. Maxim, J. Miles, O. Ousley, B. Pelphrey, B. Peterson, J. Piggot, C. Saulnier, M. State, W. Stone, J. Sutcliffe, C. Walsh, and E. Wijsman). The DNA samples used in this work are included within SSC release 15. Approved researchers can obtain the SSC population data set described in this study by applying at https://base.sfari.org. We thank Sara Ballouz, Wim Verleyen, Jesse Gillis, Ruibang Luo, Shane McCarthy, Zamin Iqbal, David Mittelman, Martin Reese, and the anonymous reviewers for helpful discussions and comments on the paper.
- Gudmundsson J, Sulem P, Gudbjartsson DF, Masson G, Agnarsson BA, Benediktsdottir KR, Sigurdsson A, Magnusson OT, Gudjonsson SA, Magnusdottir DN, Johannsdottir H, Helgadottir HT, Stacey SN, Jonasdottir N, Olafsdottir SB, Thorleifsson G, Jonasson JG, Tryggvadottir L, Navarrete S, Fuertes F, Helfand BT, Hu Q, Csiki IE, Mates IN, Jinga V, Aben KKH, van Oort IM, Vermeulen SH, Donovan JL, Hamdy FC: A study based on whole-genome sequencing yields a rare variant at 8q24 associated with prostate cancer. Nat Genet. 2012, 44: 1326-1329.View ArticlePubMedPubMed CentralGoogle Scholar
- Rope AF, Wang K, Evjenth R, Xing J, Johnston JJ, Swensen JJ, Johnson WE, Moore B, Huff CD, Bird LM, Carey JC, Opitz JM, Stevens CA, Jiang T, Schank C, Fain HD, Robison R, Dalley B, Chin S, South ST, Pysher TJ, Jorde LB, Hakonarson H, Lillehaug JR, Biesecker LG, Yandell M, Arnesen T, Lyon GJ: Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. Am J Hum Genet. 2011, 89: 28-43.View ArticlePubMedPubMed CentralGoogle Scholar
- Biesecker LG, Green RC: Diagnostic clinical genome and exome sequencing. N Engl J Med. 2014, 370: 2418-2425.View ArticlePubMedGoogle Scholar
- Patel CJ, Sivadas A, Tabassum R, Preeprem T, Zhao J, Arafat D, Chen R, Morgan AA, Martin GS, Brigham KL, Butte AJ, Gibson G: Whole genome sequencing in support of wellness and health maintenance. Genome Med. 2013, 5: 58-View ArticlePubMedPubMed CentralGoogle Scholar
- O'Rawe JA, Fang H, Rynearson S, Robison R, Kiruluta ES, Higgins G, Eilbeck K, Reese MG, Lyon GJ: Integrating precision medicine in the study and clinical treatment of a severely mentally ill person. Peer J. 2013, 1: e177-View ArticlePubMedPubMed CentralGoogle Scholar
- Chen R, Mias GI, Li-Pook-Than J, Jiang L, Lam HYK, Chen R, Miriami E, Karczewski KJ, Hariharan M, Dewey FE, Cheng Y, Clark MJ, Im H, Habegger L, Balasubramanian S, O'Huallachain M, Dudley JT, Hillenmeyer S, Haraksingh R, Sharon D, Euskirchen G, Lacroute P, Bettinger K, Boyle AP, Kasowski M, Grubert F, Seki S, Garcia M, Whirl-Carrillo M: Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell. 2012, 148: 1293-1307.View ArticlePubMedPubMed CentralGoogle Scholar
- Hood L, Rowen L: The human genome project: big science transforms biology and medicine. Genome Med. 2013, 5: 79-View ArticlePubMedPubMed CentralGoogle Scholar
- Tarczy-Hornoch P, Amendola L, Aronson SJ, Garraway L, Gray S, Grundmeier RW, Hindorff LA, Jarvik G, Karavite D, Lebo M, Plon SE, Van Allen E, Weck KE, White PS, Yang Y: A survey of informatics approaches to whole-exome and whole-genome clinical reporting in the electronic health record. Genet Med. 2013, 15: 824-832.View ArticlePubMedPubMed CentralGoogle Scholar
- Lyon GJ, Wang K: Identifying disease mutations in genomic medicine settings: current challenges and how to accelerate progress. Genome Med. 2012, 4: 58-View ArticlePubMedPubMed CentralGoogle Scholar
- O'Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, Bodily P, Tian L, Hakonarson H, Johnson WE, Wei Z, Wang K, Lyon GJ: Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 2013, 5: 28-View ArticlePubMedPubMed CentralGoogle Scholar
- Dewey FE, Grove ME, Pan C, Goldstein BA, Bernstein JA, Chaib H, Merker JD, Goldfeder RL, Enns GM, David SP, Pakdaman N, Ormond KE, Caleshu C, Kingham K, Klein TE, Whirl-Carrillo M, Sakamoto K, Wheeler MT, Butte AJ, Ford JM, Boxer L, Ioannidis JP, Yeung AC, Altman RB, Assimes TL, Snyder M, Ashley EA, Quertermous T: Clinical interpretation and implications of whole-genome sequencing. JAMA. 2014, 311: 1035-1045.View ArticlePubMedPubMed CentralGoogle Scholar
- Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014, 32: 246-251.View ArticlePubMedGoogle Scholar
- Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SRF, Consortium WGS, Wilkie AOM, McVean G, Lunter G: Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet. 2014, 46: 912-918.View ArticlePubMedPubMed CentralGoogle Scholar
- Lupski JR, Belmont JW, Boerwinkle E, Gibbs RA: Clan genomics and the complex architecture of human disease. Cell. 2011, 147: 32-43.View ArticlePubMedPubMed CentralGoogle Scholar
- Lyon GJ, O’Rawe J: Human genetics and clinical aspects of neurodevelopmental disorders. The Genetics of Neurodevelopmental Disorders. Edited by: Mitchell K. 2014, 978-1-118-52488-6-Wiley-Blackwell, OxfordGoogle Scholar
- McClellan J, King M-C: Genetic heterogeneity in human disease. Cell. 2010, 141: 210-217.View ArticlePubMedGoogle Scholar
- Ober C, Vercelli D: Gene-environment interactions in human disease: nuisance or opportunity?. Trends Genet. 2011, 27: 107-115.View ArticlePubMedPubMed CentralGoogle Scholar
- Clerget-Darpoux F, Elston RC: Will formal genetics become dispensable?. Hum Hered. 2013, 76: 47-52.View ArticlePubMedGoogle Scholar
- Weiss KM, Terwilliger JD: How many diseases does it take to map a gene with SNPs?. Nat Genet. 2000, 26: 151-157.View ArticlePubMedGoogle Scholar
- Lyon GJ: Personalized medicine: bring clinical standards to human-genetics research. Nature. 2012, 482: 300-301.View ArticlePubMedGoogle Scholar
- MacArthur DG, Manolio TA, Dimmock DP, Rehm HL, Shendure J, Abecasis GR, Adams DR, Altman RB, Antonarakis SE, Ashley EA, Barrett JC, Biesecker LG, Conrad DF, Cooper GM, Cox NJ, Daly MJ, Gerstein MB, Goldstein DB, Hirschhorn JN, Leal SM, Pennacchio LA, Stamatoyannopoulos JA, Sunyaev SR, Valle D, Voight BF, Winckler W, Gunter C: Guidelines for investigating causality of sequence variants in human disease. Nature. 2014, 508: 469-476.View ArticlePubMedPubMed CentralGoogle Scholar
- Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB: Characterizing and measuring bias in sequence data. Genome Biol. 2013, 14: R51-View ArticlePubMedPubMed CentralGoogle Scholar
- Clark MJ, Chen R, Lam HYK, Karczewski KJ, Chen R, Euskirchen G, Butte AJ, Snyder M: Performance comparison of exome DNA sequencing technologies. Nat Biotechnol. 2011, 29: 908-914.View ArticlePubMedPubMed CentralGoogle Scholar
- Lam HY, Clark MJ, Chen R, Chen R, Natsoulis G, O'Huallachain M, Dewey FE, Habegger L, Ashley EA, Gerstein MB, Butte AJ, Ji HP, Snyder M: Performance comparison of whole-genome sequencing platforms. Nat Biotechnol. 2012, 30: 78-82. 10.1038/nbt.2065.View ArticlePubMed CentralGoogle Scholar
- Linderman M, Brandt T, Edelmann L, Jabado O, Kasai Y, Kornreich R, Mahajan M, Shah H, Kasarskis A, Schadt E: Analytical validation of whole exome and whole genome sequencing for clinical applications. BMC Med Genomics. 2014, 7: 20-View ArticlePubMedPubMed CentralGoogle Scholar
- Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J: Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011, 12: 745-755.View ArticlePubMedGoogle Scholar
- Bamshad MJ, Shendure JA, Valle D, Hamosh A, Lupski JR, Gibbs RA, Boerwinkle E, Lifton RP, Gerstein M, Gunel M, Mane S, Nickerson DA: The Centers for Mendelian Genomics: a new large-scale initiative to identify the genes underlying rare Mendelian conditions. Am J Med Genet A. 2012, 158A: 1523-1525.View ArticlePubMedGoogle Scholar
- Eisenberger T, Neuhaus C, Khan AO, Decker C, Preising MN, Friedburg C, Bieg A, Gliem M, Issa PC, Holz FG, Baig SM, Hellenbroich Y, Galvez A, Platzer K, Wollnik B, Laddach N, Ghaffari SR, Rafati M, Botzenhart E, Tinschert S, Börger D, Bohring A, Schreml J, Körtge-Jung S, Schell-Apacik S, Bakur K, Al-Aama JY, Neuhann T, Herkenrath P, Nürnberg G: Increasing the yield in targeted next-generation sequencing by implicating CNV analysis, non-coding exons and the overall variant load: the example of retinal dystrophies. PLoS ONE. 2013, 8: e78496-View ArticlePubMedPubMed CentralGoogle Scholar
- Cech Thomas R, Steitz Joan A: The noncoding RNA revolution trashing Old rules to forge new ones. Cell. 2014, 157: 77-94.View ArticlePubMedGoogle Scholar
- Li S, Mason CE: The pivotal regulatory landscape of RNA modifications. Annu Rev Genomics Hum Genet. 2014, 15: 127-150.View ArticlePubMedGoogle Scholar
- Metzker ML: Sequencing technologies - the next generation. Nat Rev Genet. 2010, 11: 31-46.View ArticlePubMedGoogle Scholar
- Zhu M, Need AC, Han Y, Ge D, Maia JM, Zhu Q, Heinzen EL, Cirulli ET, Pelak K, He M, Ruzzo EK, Gumbs C, Singh A, Feng S, Shianna KV, Goldstein DB: Using ERDS to infer copy-number variants in high-coverage genomes. Am J Hum Genet. 2012, 91: 408-421.View ArticlePubMedPubMed CentralGoogle Scholar
- Meynert A, Ansari M, FitzPatrick D, Taylor M: Variant detection sensitivity and biases in whole genome and exome sequencing. BMC Bioinformatics. 2014, 15: 247-View ArticlePubMedPubMed CentralGoogle Scholar
- Mullaney JM, Mills RE, Pittard WS, Devine SE: Small insertions and deletions (INDELs) in human genomes. Hum Mol Genet. 2010, 19: R131-R136.View ArticlePubMedPubMed CentralGoogle Scholar
- Mills RE, Pittard WS, Mullaney JM, Farooq U, Creasy TH, Mahurkar AA, Kemeza DM, Strassler DS, Ponting CP, Webber C, Devine SE: Natural genetic variation caused by small insertions and deletions in the human genome. Genome Res. 2011, 21: 830-839.View ArticlePubMedPubMed CentralGoogle Scholar
- Mills RE, Luttig CT, Larkins CE, Beauchamp A, Tsui C, Pittard WS, Devine SE: An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 2006, 16: 1182-1190.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H: Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014, 30: 2843-2851.View ArticlePubMedPubMed CentralGoogle Scholar
- Highnam G, Franck C, Martin A, Stephens C, Puthige A, Mittelman D: Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 2013, 41: e32-View ArticlePubMedPubMed CentralGoogle Scholar
- Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012, 44: 226-232.View ArticlePubMedPubMed CentralGoogle Scholar
- Narzisi G, O’Rawe JA, Iossifov I, Fang H, Lee Y-h, Wang Z, Wu Y, Lyon GJ, Wigler M, Schatz MC: Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat Methods. 2014, 11: 1033-1036.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H: Aligning sequence reads, clone sequences and assembly contigs with BWAMEM. arXiv. 2013, 1303.3997
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079.View ArticlePubMedPubMed CentralGoogle Scholar
- DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011, 43: 491-498.View ArticlePubMedPubMed CentralGoogle Scholar
- Iossifov I, Ronemus M, Levy D, Wang Z, Hakker I, Rosenbaum J, Yamrom B, Lee YH, Narzisi G, Leotta A, Kendall J, Grabowska E, Ma B, Marks S, Rodgers L, Stepansky A, Troge J, Andrews P, Bekritsky M, Pradhan K, Ghiban E, Kramer M, Parla J, Demeter R, Fulton LL, Fulton RS, Magrini VJ, Ye K, Darnell JC, Darnell RB: De novo gene disruptions in children on the autistic spectrum. Neuron. 2012, 74: 285-299.View ArticlePubMedPubMed CentralGoogle Scholar
- The Sequence Read Archive. , http://www.ncbi.nlm.nih.gov/sra/
- The National Database for Autism Research., http://ndar.nih.gov/
- The Simons Foundation Autism Research Initiative., http://sfari.org/
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26: 841-842.View ArticlePubMedPubMed CentralGoogle Scholar
- Gymrek M, Golan D, Rosset S, Erlich Y: lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 2012, 22: 1154-1162.View ArticlePubMedPubMed CentralGoogle Scholar
- Willems TF, Gymrek M, Highnam G, The 1000 Genomes Project Consortium, Mittelman D, Erlich Y: The landscape of human STR variation.Genome Res 2014. doi:10.1101/gr.177774.114.,
- García-Alcalde F, Okonechnikov K, Carbonell J, Cruz LM, Götz S, Tarazona S, Dopazo J, Meyer TF, Conesa A: Qualimap: evaluating next-generation sequencing alignment data. Bioinformatics. 2012, 28: 2678-2679.View ArticlePubMedGoogle Scholar
- Hunter JD: Matplotlib: A 2D Graphics Environment. Comput Sci Eng. 2007, 9: 90-95. 10.1109/MCSE.2007.55.View ArticleGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9: 357-359.View ArticlePubMedPubMed CentralGoogle Scholar
- Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV, Altshuler D, Gabriel S, DePristo MA: From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013, 11: 11-10:22.214.171.124.33PubMedPubMed CentralGoogle Scholar
- Kidd JM, Sampas N, Antonacci F, Graves T, Fulton R, Hayden HS, Alkan C, Malig M, Ventura M, Giannuzzi G, Kallicki J, Anderson P, Tsalenko A, Yamada NA, Tsang P, Kaul R, Wilson RK, Bruhn L, Eichler EE: Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Methods. 2010, 7: 365-371.View ArticlePubMedPubMed CentralGoogle Scholar
- Ajay SS, Parker SC, Abaan HO, Fajardo KV, Margulies EH: Accurate and comprehensive sequencing of personal genomes. Genome Res. 2011, 21: 1498-1505.View ArticlePubMedPubMed CentralGoogle Scholar
- McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, Clouser CR, Duncan C, Ichikawa JK, Lee CC, Zhang Z, Ranade SS, Dimalanta ET, Hyland FC, Sokolsky TD, Zhang L, Sheridan A, Fu H, Hendrickson CL, Li B, Kotler L, Stuart JR, Malek JA, Manning JM, Antipova AA, Perez DS, Moore MP, Hayashibara KC, Lyons MR, Beaudoin RE: Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 2009, 19: 1527-1541.View ArticlePubMedPubMed CentralGoogle Scholar
- Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen Y-J, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452: 872-876.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.