Exploring the unknown: assumptions about allelic architecture and strategies for susceptibility variant discovery
© BioMed Central Ltd 2009
Published: 3 July 2009
Skip to main content
© BioMed Central Ltd 2009
Published: 3 July 2009
Identification of common-variant associations for many common disorders has been highly effective, but the loci detected so far typically explain only a small proportion of the genetic predisposition to disease. Extending explained genetic variance is one of the major near-term goals of human genetic research. Next-generation sequencing technologies offer great promise, but optimal strategies for their deployment remain uncertain, not least because we lack a clear view of the characteristics of the variants being sought. Here, I discuss what can and cannot be inferred about complex trait disease architecture from the information currently available and review the implications for future research strategies.
Genome-wide association (GWA) analysis has provided the first effective strategy to allow a systematic dissection of the genetic basis of common, complex, multifactorial traits [1, 2]. Several hundred loci have been identified to stringent levels of significance . Although for many of these we remain some distance from a complete enumeration of causal mechanisms, there have already been substantial advances in understanding of disease - the role of autophagy in inflammatory bowel disease  and cell adhesion in autism [5, 6], for instance.
However, for most common traits the proportion of the overall phenotypic variance explained remains small, limiting the extent to which prediction of individual disease risk is possible. There is growing speculation about the mechanisms that might account for the substantial proportion of trait heritability that remains to be characterized .
This speculation has repercussions well beyond recondite theoretical discussion about the genetic architecture of complex traits. With advances in technology (particularly next-generation sequencing) and growing enthusiasm for funding large-scale gene discovery efforts, hypotheses about the nature of this so-called 'genetic dark matter'  have a direct bearing on research strategies. Recently, this debate has seemed increasingly polarized between those who feel a continued search for common susceptibility variants is of limited value, because all that remains to be found are variants of vanishingly small effect , and those who feel that, pending reductions in costs that will allow high-quality, whole-genome sequence data to be generated in adequately powered sample sizes, there is virtue in persisting with an approach of proven worth .
There is good reason to assume that this 'dark matter' is neither an illusion created by inflated estimates of heritability nor the consequence of marked non-additivity of effects [10, 11]. If so, then the sum total of genetic variance should largely be explicable in terms of the main effects of all the risk alleles of various types (single nucleotide polymorphisms, indels, copy number variants (CNVs) and inversions), allele frequencies (rare, low-frequency and common) and effect sizes. So far, the only parts of this 'space' explored systematically are those occupied by rare, penetrant alleles (principally through linkage analysis of monogenic phenotypes) and common, mostly low-effect alleles (accessible through GWA analysis). As we seek to make sensible decisions about the direction of future discovery efforts - in terms of the characteristics of the variants we are seeking and the technologies we should use to find them - we need to understand what the exploration of the 'known' genetic landscape can tell us about the parts that remain largely uncharted.
One long-standing view is that complex trait susceptibility is predominantly a matter of common variants . Common variants collectively account for most individual variation in DNA sequence, and the same might be expected to be the case for phenotypic variation. If true, the results of GWA studies so far indicate that most of the as-yet-undiscovered variants must (in Europeans at least) have very small effects, because the high coverage and large sample sizes used will have left few, if any, large common-variant effects undiscovered. Evidence (for example, from large-scale meta-analyses ) is, for many traits, consistent with the notion of a long 'polygenic tail' of small effects, but it remains unclear how much of overall heritability can be explained under this model. The idea that complex-trait susceptibility involves a very large number of variants of modest effect has led some to suggest that the value of all such discoveries is diminished, on the basis that one learns little about the biology of disease if too many genes are implicated . However, for many phenotypes, the overall salience of the loci of greatest effect emerging from GWA studies (the pathways implicated and the relationships to monogenic forms of the same traits) argues forcefully against such a nihilistic interpretation [9, 13, 14].
The contrasting viewpoint holds that common-trait susceptibility derives mostly from the action of rare or low-frequency variants [15, 16]. Although such variants account for less individual sequence variation than common variants, there may be a disproportionate effect on disease susceptibility. The more recent origin of low-frequency variants may allow alleles with more dramatic phenotypic effects to be represented in the population. Also, large-effect alleles may cause phenotypic disturbances that are not as easily buffered by compensatory changes during development as are well tolerated, small-effect, common-variant alleles. Recent evidence that large, rare CNVs are associated with behavioral and psychiatric disease phenotypes [5, 17, 18] supports this view. Some argue that such a rare variant architecture is precisely what one would expect for diseases causing low reproductive fitness, though this rationalization fails to explain the high yield of common-variant signals reported for other diseases, such as type 1 diabetes, that were, until recently, fatal during early life . It has even been suggested that many of the common-variant associations discovered by recent GWA studies may turn out to be due to the concerted action of multiple low-frequency and rare causal variants. The NOD2 (CARD15) signal for Crohn's disease indicates that this is certainly possible . For many diseases, however, evidence that common-variant associations are consistent across multiple ethnic groups  represents a strong counter to such a model: one would expect the linkage disequilibrium patterns around recent rare and low-frequency causal variants to result in far more inter-ethnic heterogeneity than is actually observed.
Although both extreme positions have merit, the likelihood is that, for most diseases, the architecture of predisposition features causal variants that have a wide range of allele frequencies and effect sizes. For most complex traits, the absence of compelling signals from linkage studies conducted in families segregating multifactorial diseases imposes an upper bound to feasible effect sizes; even so, it is easy to show that a limited number of low-frequency susceptibility alleles of medium effect could go a long way to explaining missing heritability. For example, the effect of a low-frequency variant with a population minor allele frequency of 1% and a per-allele odds ratio of 3, when measured in terms of sibling relative risk (a commonly used measure of familial aggregation), exceeds that of the largest common-variant effect known for type 2 diabetes (around TCF7L2). Twenty such variants across the genome would account for most of the unexplained heritability for this condition. Such a constellation of variants could provide a respectable tool for individual disease prediction, and the variants discovered would (because of their relatively large effect size) be valuable resources for detailed molecular and physiological study. The extent to which variants with these characteristics are segregating in the population remains unknown, but this is an area in which the combination of next-generation sequencing technologies and large-scale association analysis provides a powerful stimulus to discovery. Early results of this approach (such as the identification of low-frequency variants within the IFIH1 gene that have a marked effect on type 1 diabetes susceptibility) are encouraging .
Ultimately, we can expect large-scale, high-depth, genome-wide sequencing to enable the systematic exploration of the entire allele-frequency, effect-size space and provide empirical resolution of many of these issues. However, there remain serious financial, logistical and analytical barriers to the implementation of this technology, and the number of such experiments that could be supported by the major funders is, for the time being, limited.
All this means that, for the next few years, the power of next-generation sequencing will need to be used carefully if a profusion of underpowered discovery efforts is to be avoided. Efforts targeted to specific genomic regions (around particular candidate genes or pathways or exons across the genome, for example) are attractive because high coverage of the selected areas in large sample sizes can be generated at reasonable cost. Whole-genome sequencing will, for now, be restricted to low-pass coverage across respectable sample sizes, or high-depth coverage in smaller, highly selected, phenotypically extreme sample sets.
With only limited empirical data to guide future locus-discovery efforts, extrapolation from the modest proportion of genetic variance so far explained is fraught with danger. The menu of possible research strategies is large, but each choice makes some implicit assumption about the characteristics of the variants being sought and the genomic architecture of the disease under consideration. Given uncertainties over the true state of nature, it is difficult to say which approaches will be most productive. This argues for open minds, a healthy disdain for orthodoxy, and careful exploration of the technological and methodological options. At the same time, it is important that the next wave of large-scale discovery efforts is designed so as to test assumptions about trait architecture and technological performance so that lessons of generic value to the field can be learned.
copy number variant
I thank the many colleagues around the world who contributed to the discussions that informed this article.