There are several potential pitfalls in the construction of PRS that could affect how they perform in real-world clinical populations. One of the most obvious is that they suffer from the same bias that most genetics research experiences: a lack of diversity in the populations recruited for genetic studies [8]. Until recently, over 80% of participants in genetic studies have been of European descent, 14% have been Asian, and just 6% have been from other populations [8]. Disease-associated alleles can have significantly different frequencies between populations as the result of demographic events, such as migrations and population bottlenecks, which can lead to discovery bias. In addition, linkage disequilibrium-based pruning or adjustments performed as part of the construction of the PRS [3] can contribute bias, because of the limited reference haplotype panels for diverse populations. Accordingly, Martin et al. [9] reported that PRS derived from European-based GWAS show biases in different, often unpredictable, directions when tested in non-European cohorts.
A recent report from Kim et al. [10] not only confirms that PRS derived from GWAS of European-ancestry samples can misestimate risk when applied to other populations, but also that the very tools used to genotype the GWAS samples contain bias and contribute significantly to the misestimation of disease risk across populations. These researchers first showed that disease allele frequencies for loci in the National Human Genome Research Institute (NHGRI) catalog of published GWAS studies differ significantly between Europeans and other populations sampled in the 1000 Genomes Project. Second, they observed that Africans exhibit significantly higher risk allele frequencies, a difference that is higher for ancestral risk alleles (i.e., the allele sequence present in hominid common ancestors) than for derived risk alleles (i.e., sequences that arose in the human population more recently). When risk alleles are binned into disease categories, those diseases with a higher proportion of causal ancestral alleles show elevated average risk allele frequencies in Africa. This skew in risk allele frequencies is sometimes discordant with known differences in disease prevalence between populations (e.g., for cardiovascular disease, African-Americans have a higher incidence but a PRS showed lower risk for Africans), implying that genetic disease risks may be misestimated, most significantly for individuals with African ancestry.
Furthermore, the commercial single nucleotide polymorphism (SNP) genotyping arrays used in GWAS have a strong ascertainment bias, as these SNPs were selected from the sequencing data of a small sample of individuals, mostly of European descent. Through simulations, Kim et al. [10] show that this ascertainment bias alone can cause disease risks to be misestimated. On the other hand, simulations using whole-genome sequencing show much reduced (although not completely eliminated) biases in allele frequency differences between Africans and non-Africans, particularly when sample sizes increase. These results suggest that performing GWAS in more diverse samples, which include participants from around the world, is not sufficient to reduce discovery bias [8], because performing such studies with standard commercial SNP arrays would still result in biases. This is an important insight, as SNP arrays are inexpensive and genetic studies planned around the world are cost-constrained. Performing whole-genome sequencing in place of using SNP arrays would alleviate the ascertainment bias problem, but would increase costs by orders of magnitude. How might we resolve this dilemma?