Study design
The study design is summarized in Fig. 1 and described in detail below. The pipeline validation data set (n = 10) was used to assess imputation accuracy for common genetic variants (Fig. 1a). The technical concordance cohort (n = 184) was used to assess the correlation between three previously published GPSs for CAD [3], BC [10], and AF [3] from lcWGS and genotyping arrays (Fig. 1b). The diverse ancestry data set (n = 120) was used to assess imputation accuracy for common genetic variants and performance of GPSCAD, GPSBC, and GPSAF (Fig. 1b). The clinical cohort (n = 11,502) was used to assess performance of GPSCAD, GPSBC, and GPSAF in a large European population seeking genetic testing (Fig. 1b).
Data set and cohort selection
The pipeline validation data set included seven globally representative samples from 1KGP populations (HG02155, NA12878, HG00663, HG01485, NA21144, NA20510, and NA19420; Additional file 1: Table S1) [8] and a trio of Ashkenazi samples (NA24385, NA24143, and NA24149) from the GIAB Consortium (Fig. 1a) [9].
The technical concordance cohort included DNA samples from 184 individuals whose healthcare provider had ordered a Color multi-gene panel test (Fig. 1b). All individuals (1) had 85% or greater European genetic ancestry calculated using fastNGSadmix [11] using 1KPG as the reference panel, (2) self-identified as “Caucasian,” and (3) did not have pathogenic or likely pathogenic variants in the multi-gene NGS panel test, as previously described [12] (Additional file 2: Supplementary Methods). Demographics are provided in Additional file 1: Table S2. All phenotypic information was self-reported by the individual through an online, interactive health history tool. Of the 184 individuals, 61 individuals reported having a personal history of CAD (defined here as a myocardial infarction or coronary artery bypass surgery), 62 individuals reported no personal history of CAD, and 61 individuals reported no personal history of CAD but were suspected to have a high GPSCAD based on preliminary analysis. This preliminary analysis included imputation from multi-gene panel and off-target sequencing data, which has been shown to have similar association statistics and effect sizes compared to genotyping arrays [4]. These individuals were included in the technical concordance cohort to artificially create a relatively uniform distribution of GPSCAD in the data set. Correlation coefficients between GPSCAD from lcWGS and genotyping array were calculated after removing the 61 individuals who were suspected to have a high GPSCAD based on multi-gene panel and off-target sequencing data to avoid artificial inflation of the correlation coefficient. Two individuals who reported no personal history of CAD but were suspected to have a high GPSCAD failed genotyping (quality control call rate of < 97%) and lcWGS (overall coverage of < 0.5×), leaving a total of 182 individuals for analyses.
The diverse ancestry data set included a total of 120 samples from the following populations from 1KGP: Han Chinese in Beijing, China (CHB); Yoruba in Ibadan, Nigeria (YRI); Gujarati Indian from Houston, Texas (GIH); Americans of African Ancestry in Southwest USA (ASW); Mexican Ancestry from Los Angeles, USA (MXL); and Puerto Ricans from Puerto Rico (PUR) (Additional file 1: Table S3 and Additional file 3: Figure S1) [8]. Four samples, including NA18917 and NA19147 from the YRI population and NA19729 and NA19785 from the MXL population, were below the target 0.5× coverage and removed from analyses.
The clinical cohort included DNA samples from 11,502 individuals whose healthcare provider had ordered a Color multi-gene panel test (Fig. 1b). All individuals (1) had 90% or greater European genetic ancestry calculated using fastNGSadmix [11] using 1KPG as the reference panel; (2) self-identified as “Caucasian”; (3) provided history of whether they had a clinical diagnosis of CAD, BC, or AF, via an online, interactive health history tool; and (4) did not have pathogenic or likely pathogenic variants detected in the multi-gene sequencing panel test, as previously described [12] (Additional file 2: Supplementary Methods). Demographics are provided in Additional file 1: Table S2.
Whole genome sequencing
DNA was extracted from blood or saliva samples and purified using the Perkin Elmer Chemagic DNA Extraction Kit (Perkin Elmer, Waltham, MA) automated on the Hamilton STAR (Hamilton, Reno, NV) and the Chemagic Liquid Handler (Perkin Elmer, Waltham, MA). The quality and quantity of the extracted DNA were assessed by UV spectroscopy (BioTek, Winooski, VT). High molecular weight genomic DNA was enzymatically fragmented and prepared using the Kapa HyperPlus Library Preparation Kit (Roche Sequencing, Pleasanton, CA) automated on the Hamilton Star liquid handler and uniquely tagged with 10 bp dual-unique barcodes (IDT, Coralville, IA). Libraries were pooled together and loaded onto the NovaSeq 6000 (Illumina, San Diego, CA) for 2 × 150 bp sequencing.
For the pipeline validation data set, all samples underwent WGS with mean coverage of 13.22× (range 7.82× to 17.30×); downsampling was then performed using SAMtools [13] to simulate lcWGS. For the technical concordance cohort, all samples underwent lcWGS with mean coverage of 1.24× (range 0.54× to 1.76×). Imputed genotypes were compared with published, high-confidence known genotypes from 1KGP [8] and the GIAB Consortium [9]. For the diverse ancestry data set, all samples underwent lcWGS with mean coverage of 0.89× (range 0.68× to 1.24×). For the clinical cohort, all samples underwent lcWGS with mean coverage of 0.95× (range 0.51× to 2.57×).
Downsampling
For the pipeline validation data set, aligned reads were downsampled using SAMtools [13] to 2.0×, 1.0×, 0.75×, 0.5×, 0.4×, 0.25×, and 0.1× coverage. For the technical concordance cohort, aligned reads were downsampled to 1.0×, 0.75×, 0.5×, 0.4×, 0.25×, and 0.1× coverage. In a few cases in the technical concordance cohort, the primary samples had fewer reads than the target downsample. In those situations, all of the reads were retained. For example, if the primary sample only had 0.8× coverage, when downsampled to 1.0×, all reads were retained. Downsampling was repeated using two independent seeds in SAMtools. Once the downsampled data was generated, the imputation was repeated to generate imputed genotypes using only the downsampled reads.
Imputation site selection
All data sets and cohorts were imputed to a set of autosomal SNP and insertion-deletion (indel) sites from 1KGP with greater than 1% allele frequency in any of the five 1KGP super populations (African, American, East Asian, European, and South Asian) [8], for a total of 21,770,397 sites. This is hereafter referred to as the “imputation SNP loci.” Multi-allelic SNPs and indels were represented as two biallelic markers for imputation.
Genotype likelihood calculations and imputation
Genotype likelihood calculations and imputation were performed independently for each sample. Sequence reads were aligned with the human genome reference GRCh37.p12 using the Burrows-Wheeler Aligner (BWA) [14], and duplicate and low quality reads were removed. Genotype likelihoods were then calculated at each of the biallelic SNP loci in the imputation SNP loci that were covered by one or more sequencing reads called using the mpileup command implemented in bcftools version 1.8 [15]. Indels or multi-allelic sites were not included in this first genotype likelihood calculation. Reads with a minimum mapping alignment quality of 10 or greater and bases with a minimum base quality of 10 or greater were included. Genotype likelihoods at each observed site were then calculated using the bcftools call command with allele information corresponding to the imputation SNP loci. This procedure discarded calls with indels or calls where the observed base did not match either the reference or the expected alternate allele for the SNP locus.
To convert genotype likelihoods into genotype calls at all imputation SNP loci, two distinct calculations were performed. First, genotypes at imputation SNP loci covered by at least one read were inferred. Genotype calling was performed using the genotype likelihood option implemented in BEAGLE 4.1 [14]. This step is a reference-aware genotype calling step and produces posterior probabilities of genotypes only at sites with at least one read. This algorithm is implemented only in BEAGLE 4.1 [16]. This inference used default BEAGLE 4.1 [16] parameters except with a model scale parameter of 2 and the number of phasing iterations to 0. A custom reference panel was constructed for each sample being imputed by selecting the 250 most similar samples to that sample from 1KGP Phase 3 [8] release using Identity-by-State (IBS) comparison. A reference panel size of 250 was selected to best balance imputation run time and accuracy (Additional file 3: Figure S1). To ensure that IBS values were comparable across samples, a set of regions consistently sequenced at high depth (> 20×) across all samples was utilized. Inclusion of related samples in an imputation reference panel can artificially increase imputation accuracy; therefore, when imputation was performed on samples included in 1KGP Phase 3 release, that sample and any first and second degree related samples (as inferred by the 1KGP data release using genetic data) were excluded from the custom reference panel.
To generate genotypes at all of the remaining untyped sites, a second round of imputation was performed using BEAGLE 5.0 [16]. This imputation used default settings and included the full 1KGP as the imputation reference panel [8]. To note, when performing analysis using 1KGP samples [8], any related individuals were removed. Each sample then had imputed genotype calls at each of the imputation SNP loci. Indels and multi-allelic sites were included in this second genotype likelihood calculation.
Genotyping array
DNA was extracted from blood or saliva samples and purified using the Perkin Elmer Chemagic DNA Extraction Kit (Perkin Elmer, Waltham, MA) automated on the Hamilton STAR (Hamilton, Reno, NV) and the Chemagic Liquid Handler (Perkin Elmer, Waltham, MA). The quality and quantity of the extracted DNA were assessed by UV spectroscopy (BioTek, Winooski, VT).
DNA was genotyped on the Axiom UK Biobank Array by Affymetrix (Santa Clara, CA). Genotypes were filtered according to the manufacturer’s recommendations, removing loci with greater than 5% global missingness and those that significantly deviated from the Hardy-Weinberg equilibrium. In addition, all A/T and G/C SNPs were removed due to potential strand inconsistencies. After applying the above quality filtering and filtering for ambiguous SNP sites, 748,187 SNPs out of an original 830,115 polymorphic sites remained. Each of the remaining SNP orientation was aligned with the hg19 reference sequence to correctly code the reference alleles as allele 1, matching the sequencing data.
To generate genotypes at all of the remaining untyped sites, imputation was performed using BEAGLE 5.0 [16]. This imputation used default settings and included the full 1KGP as the imputation reference panel [8]. To note, when performing analysis using 1KGP samples, any related individuals were removed. Each sample then had imputed genotype calls at each of the imputation SNP loci.
Imputation accuracy and quality assessment
Imputation accuracy for 1KGP and GIAB samples was calculated by comparing imputation results with previously released genotypes, excluding regions marked as low confidence by GIAB.
Imputation accuracy on the genotyped samples was assessed on 470,363 sites that were included in the genotyping array and in the imputation SNP loci at different allele frequency buckets: 257,362 sites with greater than 5% allele frequency, 119,978 sites between 1 and 5% allele frequency, and 93,022 sites with less than 1% allele frequency. Imputation quality was assessed through site-specific dosage r2 comparing with genotype values from the genotyping array.
GPS selection
The GPSs for CAD [3], BC [10], and AF [3] were previously published and selected based on their demonstrated ability to accurately predict and stratify disease risk as well as identify individuals at risk comparable to monogenic disease. GPSCAD contained 6,630,150 polymorphisms, GPSBC contained 3820 polymorphisms, and GPSAF contained 6,730,541 polymorphisms. All loci included in these scores were included in the imputation SNP loci.
GPS normalization
In the clinical cohort, raw GPSs were normalized by taking the standardized residual of the predicted score after correction for the first 10 principal components (PCs) of ancestry [17]. PCs were calculated by projecting lcWGS samples into 10 dimensional PC analyses (PCAs) space using the LASER program [18]. A combination of samples from 1KGP [8] and the Human Origins [19] project were used as a reference for the projection.