The following additional data are available with the online version of this paper.
Additional file 1 is a Table and 10 supplementary figures:
Figure S1, Family pedigrees contained within the 15 sequenced exomes. Of the fifteen exomes that were sequenced, 14 were sequenced from families chosen for future disease discovery related work. Each sequenced individual (numbered) is displayed in the context of his or her constituent family pedigree.
Figure S2, Fraction of target capture region covered versus coverage depth for 15 exomes. All exomes have at least 20 reads or more per base pair in > 80% or more of the 44 MB target region.
Figure S3, Histograms of Illumina read depth at SNV coordinates. Read depth taken from each pipeline's independently aligned BAM file at genomic coordinates of SNVs called by each of the 5 alignment and variant calling pipelines. A) SOAPsnp, B) SNVer, C) SAMtools, D) GNUMAP and E) GATK, respectively. Frequency of read depths for all SNVs (A, B, C, D, and E) as well as for SNVs having depths between 0 and 50 (a, b, c, d, and e) were plotted.
Figure S4, SNV concordance measured at varying Illumina read depth threshold values. SNV concordance for a single exome, "k8101-49685", between five alignment and variant detection pipelines: GATK, SOAPsnp, SNVer, SAMtools, and GNUMAP. Concordance between each pipeline was determined by matching the genomic coordinate as well as the base pair change and zygosity for each detected SNV. Concordance was measured at varying Illumina read depth threshold values in each independently aligned BAM file, ranging from > 0 (no threshold) to > 30 reads.
Figure S5, Histograms of illumina read depth at genomic coordinates of the unique to Complete Genomics SNV calls. Histograms of read depth taken from each of the five Illumina pipeline's independently aligned BAM file at genomic coordinates of SNVs that were found by Complete Genomics but not by any of the 5 Illumina pipelines: GATK, GNUMAP, SNVer, SAMtools and SOAPsnp, A, B, C, D and E respectively. All coordinates fell within the range of the Agilent SureSelect v.2 exons.
Figure S6, SNV concordance for a single exome, "k8101-49685", between two sequencing pipelines: Illumina and Complete Genomics. For the Illumina sequencing, exons were captured using the Agilent SureSelect v.2 panel of capture probes. Complete Genomics SNVs consist of a subset of all SNVs called by CG that fell within the Agilent SureSelect v.2 exons. Concordance was determined by matching the genomic coordinates, base pair composition, and zygosity status for each detected SNV. Concordance was measured between CG SNVs and A) the union of all SNVs called by 5 variant calling pipelines ("Illumina-data calls") and B) only SNVs that all 5 Illumina pipelines collectively called ("concordant Illumina-data calls").
Figure S7, Cross-validation of illumina SNV calls using Complete Genomics SNV calls. SNVs called by each Illumina-data pipeline were cross-validated using SNVs called by Complete Genomics, an orthogonal sequencing technology, in sample "k8101-49685". The percentage of Illumina SNVs that were validated by CG sequencing was measured for variants having varying degrees of Illumina-data pipeline concordance. The same analysis was performed for variants that were considered novel (absent in dbSNP135).
Figure S8, Average indel concordance among 15 exomes between three indel detecting pipelines: GATK, SAMtools and SOAPindel. Concordance was measured between raw, pre-standardized, indel calls. Indels were considered in agreement if the genomic coordinates, length and composition of indels matched between pipelines
Figure S9, Cross-validation of illumina indel calls using Complete Genomics indel calls. Indels called by each Illumina-data pipeline were cross-validated using indels called by Complete Genomics for sample "k8101-49685". The percentage of Illumina indels that were validated by CG sequencing was measured across varying degrees of Illumina pipeline concordance. The same analysis was done for novel indels (indels not found in dbSNP 135).
Figure S10, A comparison between recent versions of various GATK variant calling modules. The similarity between SNV and indel calls made between two versions of GATK, v1.5 and v2.3-9, was measured. SNV and indel calls were made using both the UnifiedGenotyper and HaplotypeCaller modules on the same k8101-49685 participant sample. Pairwise comparisons were made between the GATK UnifiedGenotyper v1.5 and each of the GATK v2.3-9 modules (the UnifiedGenotyper and HaplotypeCaller).
Table S1, Concordance rates with common SNPs genotyped on Illumina 610K genotyping chips.
Additional file 2 contains command-line arguments for bio-informatics pipelines and instructions for accessing data analyzed in this paper.
Additional file 3 contain data production statistics.