To some extent, this comes down to the question of whether it is prediction or classification that matters. For the millions of individuals who are assembling health action plans, and indulging in the self-knowledge provided by software applications that aim to promote personal fitness, it is more about skewing the odds than predicting the future. Given a prediction, whether or not someone ever gets diabetes or coronary disease is less important than having the knowledge that they have gained influence their behaviors in ways that make it less likely that they will get these diseases. In this light, knowing the aspects of your health in which you are genetically at higher risk than the rest of the population might well be just the incentive you need to keep jogging, to strive for more alcohol-free days, or to screen for cancer more regularly.
What if the millions of sequences came with detailed phenotype data? For example, this could comprise disease status for a range of common diseases and measures on quantitative traits that are risk factors for disease. This is not a far-fetched scenario. The Kaiser Permanente and University of California, San Francisco (UCSF) collaboration [8] has obtained detailed phenotyped and genotyped data for over 100,000 people, and earlier this year, it was announced that the UK Biobank sample of 500,000 people will be genotyped using a single-nucleotide polymorphism (SNP) array [9]. Phenotype and sequence data for a million people will allow the discovery of more risk and trait variants and the creation of multiple-variant profiles that can be used for prediction. But is a million sequences enough? For diseases with a prevalence of about 1%, there will be 10,000 cases among the million. Larger GWAS samples already exist for some diseases, such as Crohn’s disease and schizophrenia. Although these have identified tens to hundreds of risk variants, polygenic profiles explain only a modest proportion of risk in the population, although they can do better than self-reported ‘family history’ [10]. Sequence data (instead of solely relying on common variants from GWAS) will improve the prediction of disease by capturing variation resulting from lower-frequency risk variants. With millions of genomes sequenced, the limitation of disease prediction for many traits is likely to result from imperfect information on environmental effects. The question is therefore not ‘how well can we predict disease’ but ‘how can we incorporate probabilistic predictions of disease in personal or clinical decision making’. There are plenty of challenges on the way, including the generation of accurate sequence data, getting all these data together for analysis, and the statistical analysis of millions of genome sequences. Then, there will be the practical challenges of disseminating the results, not to mention encouraging people to act on them for the benefit of their health.
For quantitative traits and disease, we can expect major advances in our ability to explain the genetic component of disease risk and thus to predict disease. What we do with that information is a sociological concern with major public health implications, and now is the time to contemplate the implications.