Exploring the unknown: assumptions about allelic architecture and strategies for susceptibility variant discovery

Identification of common-variant associations for many common disorders has been highly effective, but the loci detected so far typically explain only a small proportion of the genetic predisposition to disease. Extending explained genetic variance is one of the major near-term goals of human genetic research. Next-generation sequencing technologies offer great promise, but optimal strategies for their deployment remain uncertain, not least because we lack a clear view of the characteristics of the variants being sought. Here, I discuss what can and cannot be inferred about complex trait disease architecture from the information currently available and review the implications for future research strategies.

However, for most common traits the proportion of the overall phenotypic variance explained remains small, limiting the extent to which prediction of individual disease risk is possible. There is growing speculation about the mechanisms that might account for the substantial proportion of trait heritability that remains to be characterized [7]. This speculation has repercussions well beyond recondite theoretical discussion about the genetic architecture of complex traits. With advances in technology (particularly next-generation sequencing) and growing enthusiasm for funding large-scale gene discovery efforts, hypotheses about the nature of this so-called 'genetic dark matter' [7] have a direct bearing on research strategies. Recently, this debate has seemed increasingly polarized between those who feel a continued search for common susceptibility variants is of limited value, because all that remains to be found are variants of vanishingly small effect [8], and those who feel that, pending reductions in costs that will allow high-quality, whole-genome sequence data to be generated in adequately powered sample sizes, there is virtue in persisting with an approach of proven worth [9].
There is good reason to assume that this 'dark matter' is neither an illusion created by inflated estimates of heritability nor the consequence of marked non-additivity of effects [10,11]. If so, then the sum total of genetic variance should largely be explicable in terms of the main effects of all the risk alleles of various types (single nucleotide polymorphisms, indels, copy number variants (CNVs) and inversions), allele frequencies (rare, low-frequency and common) and effect sizes. So far, the only parts of this 'space' explored systematically are those occupied by rare, penetrant alleles (principally through linkage analysis of monogenic phenotypes) and common, mostly low-effect alleles (accessible through GWA analysis). As we seek to make sensible decisions about the direction of future discovery efforts -in terms of the characteristics of the variants we are seeking and the technologies we should use to find them -we need to understand what the exploration of the 'known' genetic landscape can tell us about the parts that remain largely uncharted.

Contrasting views of the genetic landscape
One long-standing view is that complex trait susceptibility is predominantly a matter of common variants [12]. Common variants collectively account for most individual variation in DNA sequence, and the same might be expected to be the case for phenotypic variation. If true, the results of GWA studies so far indicate that most of the as-yet-undiscovered variants must (in Europeans at least) have very small effects, because the high coverage and large sample sizes used will have left few, if any, large common-variant effects undiscovered. Evidence (for example, from large-scale meta-analyses [13] traits, consistent with the notion of a long 'polygenic tail' of small effects, but it remains unclear how much of overall heritability can be explained under this model. The idea that complex-trait susceptibility involves a very large number of variants of modest effect has led some to suggest that the value of all such discoveries is diminished, on the basis that one learns little about the biology of disease if too many genes are implicated [8]. However, for many phenotypes, the overall salience of the loci of greatest effect emerging from GWA studies (the pathways implicated and the relationships to monogenic forms of the same traits) argues forcefully against such a nihilistic interpretation [9,13,14].
The contrasting viewpoint holds that common-trait susceptibility derives mostly from the action of rare or lowfrequency variants [15,16]. Although such variants account for less individual sequence variation than common variants, there may be a disproportionate effect on disease susceptibility. The more recent origin of low-frequency variants may allow alleles with more dramatic phenotypic effects to be represented in the population. Also, largeeffect alleles may cause phenotypic disturbances that are not as easily buffered by compensatory changes during development as are well tolerated, small-effect, commonvariant alleles. Recent evidence that large, rare CNVs are associated with behavioral and psychiatric disease phenotypes [5,17,18] supports this view. Some argue that such a rare variant architecture is precisely what one would expect for diseases causing low reproductive fitness, though this rationalization fails to explain the high yield of common-variant signals reported for other diseases, such as type 1 diabetes, that were, until recently, fatal during early life [19]. It has even been suggested that many of the common-variant associations discovered by recent GWA studies may turn out to be due to the concerted action of multiple low-frequency and rare causal variants. The NOD2 (CARD15) signal for Crohn's disease indicates that this is certainly possible [20]. For many diseases, however, evidence that common-variant associations are consistent across multiple ethnic groups [21] represents a strong counter to such a model: one would expect the linkage disequilibrium patterns around recent rare and lowfrequency causal variants to result in far more inter-ethnic heterogeneity than is actually observed.

The best of both worlds
Although both extreme positions have merit, the likelihood is that, for most diseases, the architecture of predisposition features causal variants that have a wide range of allele frequencies and effect sizes. For most complex traits, the absence of compelling signals from linkage studies conducted in families segregating multifactorial diseases imposes an upper bound to feasible effect sizes; even so, it is easy to show that a limited number of low-frequency susceptibility alleles of medium effect could go a long way to explaining missing heritability. For example, the effect of a low-frequency variant with a population minor allele frequency of 1% and a per-allele odds ratio of 3, when measured in terms of sibling relative risk (a commonly used measure of familial aggregation), exceeds that of the largest common-variant effect known for type 2 diabetes (around TCF7L2). Twenty such variants across the genome would account for most of the unexplained heritability for this condition. Such a constellation of variants could provide a respectable tool for individual disease prediction, and the variants discovered would (because of their relatively large effect size) be valuable resources for detailed molecular and physiological study. The extent to which variants with these characteristics are segregating in the population remains unknown, but this is an area in which the combination of next-generation sequencing techno logies and large-scale association analysis provides a powerful stimulus to discovery. Early results of this approach (such as the identification of low-frequency variants within the IFIH1 gene that have a marked effect on type 1 diabetes susceptibility) are encouraging [22].

Strategy and the 'lumpiness' of the genome
Ultimately, we can expect large-scale, high-depth, genomewide sequencing to enable the systematic exploration of the entire allele-frequency, effect-size space and provide empirical resolution of many of these issues. However, there remain serious financial, logistical and analytical barriers to the implementation of this technology, and the number of such experiments that could be supported by the major funders is, for the time being, limited.
All this means that, for the next few years, the power of nextgeneration sequencing will need to be used carefully if a profusion of underpowered discovery efforts is to be avoided. Efforts targeted to specific genomic regions (around particular candidate genes or pathways or exons across the genome, for example) are attractive because high coverage of the selected areas in large sample sizes can be generated at reasonable cost. Whole-genome se quen cing will, for now, be restricted to low-pass coverage across respectable sample sizes, or high-depth coverage in smaller, highly selected, phenotypically extreme sample sets.
The genomic distribution of disease-effect loci will have a major impact on the success of these alternative approaches (Figure 1). If the low-frequency and rare variants influencing a given trait are disproportionately located in the same loci as the common variants that have been found to date, then targeted follow-up of regions revealed by GWA studies will be a powerful approach, and extending the range and scope of GWA analysis (to other ethnic groups, for example) should be a particularly efficient strategy. If, on the other hand, the 'dark matter' variants have little positional (or biological) overlap with those already known, then genome-wide resequencing is likely to be the only practical way to find them. The evidence so far (overlap between monogenic and multifactorial loci; growing numbers of loci with multiple independent association signals; extensive pleiotropy, and so on [23,24]) provides some support for the former view. Effort in tracking down common susceptibility variants, as well as being valuable in its own right, should therefore guide researchers towards other types of causal variants.

Letting several well designed flowers bloom
With only limited empirical data to guide future locusdiscovery efforts, extrapolation from the modest proportion of genetic variance so far explained is fraught with danger. The menu of possible research strategies is large, but each choice makes some implicit assumption about the characteristics of the variants being sought and the genomic architecture of the disease under consideration. Given uncertainties over the true state of nature, it is difficult to say which approaches will be most productive. This argues for open minds, a healthy disdain for orthodoxy, and careful exploration of the technological and methodological options. At the same time, it is important that the next wave of large-scale discovery efforts is designed so as to test assumptions about trait architecture and technological performance so that lessons of generic value to the field can be learned.

Competing interests
The author declares that he has no competing interests.

Figure 1
Causal variant signals and their genomic distribution. Two possible versions of the state of nature are presented (see text). In one ('even'), causal variants differing in terms of allele frequency (color scale) and effect size (height of bar) are distributed randomly across the genome: the location of common-variant (red/orange) associations of modest effect provides no guide to the location of lower-frequency variants (yellow/green), some of which have quite large effects. In the other ('lumpy'), causal variants congregate around certain genomic positions ('genes'): GWA studies that reveal the location of the common-variant associations will also reveal the positions of lower-frequency variants, and the proportion of disease biology explained by the loci discovered through GWA studies will be far greater than the proportion of variance explained would suggest.