The quest for genetic risk factors for Crohn's disease in the post-GWAS era

Multiple genome-wide association studies (GWASs) and two large scale meta-analyses have been performed for Crohn's disease and have identified 71 susceptibility loci. These findings have contributed greatly to our current understanding of the disease pathogenesis. Yet, these loci only explain approximately 23% of the disease heritability. One of the future challenges in this post-GWAS era is to identify potential sources of the remaining heritability. Such sources may include common variants with limited effect size, rare variants with higher effect sizes, structural variations, or even more complicated mechanisms such as epistatic, gene-environment and epigenetic interactions. Here, we outline potential sources of this hidden heritability, focusing on Crohn's disease and the currently available data. We also discuss future strategies to determine more about the heritability; these strategies include expanding current GWAS, fine-mapping, whole genome sequencing or exome sequencing, and using family-based approaches. Despite the current limitations, such strategies may help to transfer research achievements into clinical practice and guide the improvement of preventive and therapeutic measures.

in dicat ing that adaptive immunity also plays a role in CD pathogenesis ( Figure 1) [8]. Another interesting association mapped to the FUT2 gene, which encodes secretor type fucosyltransferase and regulates secretion of A and B blood group antigens in intestinal mucosa [9]. Recent functional studies have suggested that fucosylation of mucin proteins is involved in interception and exclusion of bacteria; thus, association of FUT2 with CD might imply a role for the functional state of mucin in CD patho genesis [10]. Although 5 years of GWASs have  [5] Involved in pattern recognition VAMP3 (vesicle-associated membrane protein 3) 1.05 (1.01-1.10) [5] Involved in autophagy and TNF-α metabolism REL (reticuloendotheliosis viral oncogene homolog) 1.14 (1.09-1.19) [5] Transcriptional activator of NF-кB ERAP2 (endoplasmic reticulum aminopeptidase 2) 1.05 (1.02-1.09) [5] Involved in peptide trimming upon NF-кB stimulation; required for the generation of HLA binding peptides UBE2L3 (ubiquitin-conjugating enzyme E2L 3) 0.70 [15] Ubiquitinates, among others, the NF-кB precursor identified a substantial number of CD susceptibility loci, as much as 77% of the estimated heritability for CD is still considered to be unexplained [5]. Thus, one of the current challenges in the study of CD, like other complex diseases, is to identify potential sources of this hidden heritability. These might be additional common variants with very limited effect size, or rare variants with a higher effect size. Part of the hidden heritability may lie in structural variations such as copy number variations (CNVs; a type of structural DNA sequence alteration, including deletions, duplications, insertions and inversions, that results in varying numbers of copies of a particular gene or DNA sequence from one person to the next) or even more complicated mecha nisms, such as epistatic, geneenvironment and epi genetic interactions.
In this review, we discuss the known genetic risk factors for CD, the potential sources of the hidden heritability, and strategies to investigate these.

Further exploration of GWAS results
Thus far, the GWASs performed for CD have implicated many genes, and have thereby provided valuable insights into the etiology of CD. However, there are several ways to explore GWAS results in more depth that might lead to solving a part of the hidden heritability puzzle. The design of GWASs holds several limitations, with the first being the extensive correction needed for multiple testing. Hence, many truepositive findings are discarded because of the stringent significance thresholds, and large amounts of data are therefore ignored. Several methods have been applied successfully to overcome this statistical power issue. A major step to overcoming this problem has been taken by the International IBD Genetics Con sortium (IIBDGC) [11], which performed a novel metaanalysis of six index GWASs and a followup study in independent cohorts. This study increased the number of confirmed CD loci to 71, although the explained herita bility only increased from 20% to 23% [5].
Another way to overcome the lack of power inherent in GWASs is to followup specific SNPs (variation in a single base in the DNA sequence; the most common type of variation in the human genome) identified by them. Following up the top 1,000 lessstrongly associated loci, for example, could yield new true associations. Meta analysis of these results with the results from the index GWASs leads to a gain of power, as shown by a study of celiac disease [12]. Another approach is to prioritize genes from the top associated loci based on interaction or functional analyses. This has proven to be a successful strategy in rheumatoid arthritis, where genes were prioritized based on network analysis or interaction analysis [13]. For CD, Wang et al. [14] used a different prioritizing criterion based on pathway analysis and they uncovered a significant association between susceptibility to CD and the IL12/IL23 pathway, harboring 20 genes. Prioritizing SNPs based on their effect on gene expression (for example, expression quantitative trait locus, a locus at which genetic allelic variation(s) correlates with varia tion in gene expression) led to identification of potentially novel associations of CD with UBE2L3, encoding ubiquitin conjugating enzyme E2L 3 (involved in ubiquitinating the NFкB precursor), and BCL3, encoding Bcell lymphoma 3encoded protein (involved in downregulation of the NFкB pathway) [15].
Results of GWASs and their metaanalyses have revealed that multiple autoimmune diseases have a common genetic architecture [16]. Several studies have been successful in identifying new CD risk variants by testing previously established loci for other The ongoing inflammatory response in the gastrointestinal tract in patients with Crohn's disease (CD) is thought to be caused by an aberrant immune response to commensal microflora in the gut. In patients with CD, defects in first defense mechanisms (that is, disrupted epithelial and mucosal barrier) contribute to increased bacterial penetration (MUC1 and MUC19). Genes involved in pattern recognition (NOD2, TLR4 and CARD9) suggest an increased response of antigen-presenting cells to commensal microbes. Consequently, the NF-кB cascade is activated (TNFSF15), leading to production of proinflammatory cytokines. Association of REL and UBE2L3 suggest an impaired NF-кB negative feedback. Antigen-presenting cells migrate to Peyer's patches (intestinal mesenteric lymph nodes) (TNFSF11) to present antigens and stimulate T-cell proliferation (IL2RA and TAGAP) and differentiation. T cells of patients with CD, in turn, respond more intensely. Th0 cells are stimulated to differentiate into T-cell subtypes regulated by a variety of the produced cytokines and their receptors. Th17 cells are involved in many immune-related diseases, and they are activated through IL-23R, which, in turn, activates the JAK-STAT-TYK (Janus kinase-signal transducer and activator of transcription-tyrosine kinase) pathway that enhances proinflammatory cytokine production (JAK2, STAT3 and TYK2). Th1 and Th17 cells are pro-inflammatory, whereas Treg cells downregulate the immune response. Another major contribution to CD pathogenesis comes from autophagy. In autophagosomes, intracellular components, including phagocytosed microbes, are degraded, after which their antigens are presented to CD4+ cells. Autophagy is at least partly regulated by the CD risk genes ATG16L1, IRGM and VAMP3. The activation of CD4+ cells leads to the production of pro-inflammatory cytokines and the maintenance of the inflammation. All the displayed processes could finally lead to homing of leukocytes to inflammation sites (ICAM1,3, CCR cluster), and neutrophil recruitment. Consequently, chronic inflammation, ulceration and deeper microbial penetrance occur. The known associated genes are shown in red. Table 1 summarizes the associated loci shown here. CCL20, chemokine (C-C motif ) ligand 20; ICOS, inducible T-cell co-stimulator; MDP, muramyl dipeptide; NF, nuclear factor; TCR, T-cell receptor; TGF, transforming growth factor; TGFBR, TGF β receptor; Th, T helper cell; TNF, tumor necrosis factor; Treg, regulatory T cell. immunerelated diseases [17,18]. Festen et al. [19] developed a new method to identify shared risk loci of two immunemediated diseases with a partially shared genetic back ground, namely celiac disease and CD. To increase the statistical power, they performed a combined analysis of GWAS results from celiac disease and CD, and identified TAGAP, which encodes Tcell activation GTPaseactivat ing protein, and PUS-10, which encodes tRNA pseudo uridylate synthase, as new shared loci [19].
The second limitation of the GWAS design is that it does not lead to the identification of causal variants, since the tested SNPs are merely tagging SNPs in linkage dis equilibrium (LD; a nonrandom association of alleles at two or more loci as a result of a recent mutation, genetic drift, selection, or nonrandom mating) with the causal variants. Therefore, the effect sizes of known CD loci may be an underestimation of their actual relative risk. To further investigate the known risk loci and identify new SNPs, either as causal or closetocausal variants, exten sive finemapping is currently being performed by the IIBDGC using a custommade GWA chip. In addition, crossethnicity finemapping has proven successful in exploring conserved haplotype structures (that is, LD blocks) [20]. The most common LD blocks occur in all populations; however, their frequencies vary among different ethnicities [20]. For example, common NOD2 and IL23R variants that are well established in Caucasians could not be replicated in an Indian population, implying that additional variants in these or other candidate genes may play a role in the pathogenesis of CD in Indians [21]. This principle was also successfully applied in analyzing the IL2/IL21 LD block, which is strongly conserved in Caucasians as opposed to Han Chinese, in which the IL2 and IL21 genes reside on two distinct LD blocks. Both IL2 and IL21 could be identified as separate UC risk loci in Han Chinese [22].
Park et al. [23] proposed a method to evaluate statistical power and risk prediction of future GWASs. They estimated that there are, in total, 142 CD suscep tibility loci with effect sizes similar to the loci reported in the current GWASs, and that a sample size of approxi mately 50,000 would be needed to uncover them. However, even if a GWAS with hundreds of thousands of cases were to provide new CD susceptibility loci and explain more of the genetic variance, it seems unlikely that it would capture even half of the estimated herita bility since 142 loci only explain 20% of the sibling relative risk for CD. We can speculate that identification of the true causal variants could amplify the effect size for some of the known loci and could consequently increase the discriminatory power of risk models.
Another potential source of hidden heritability could lie in sample mixups that occur accidentally during sample collection, genotyping or data management. Some genetic variants influence gene expression pheno types (expression quantitative trait loci); this allows checking for concordance between phenotypic measurements and genetic variants that affect these phenotypes. Westra et al. (personal communication) found that 3% of sample mixups decrease the number of loci normally discovered by 23% for a trait with a heritability of 50% and 500 loci explaining the total heritability. Thus, sample mixups may explain part of the hidden heritability and it will be possible to detect them as long as databases encompass sufficient numbers of phenotypes that are strongly determined by known genetic variants.
GWASs are most likely to remain an important approach for investigating the hidden heritability, since the potential of their results can be enhanced by: per form ing metaanalyses (for example, between multiple GWASs or between similar disease phenotypes); followingup prioritized SNPs based on pathway, functional or inter action analyses; studying SNPs that have been associated with other immunerelated diseases; and expanding the design of GWASs to include samples from nonCaucasians.

Low frequency and rare variants
Common variants identified by GWASs represent only a small fraction of the phenotypic variation. Thus, much speculation about the hidden heritability has focused on the contribution of variants with low allele frequencies, defined as 0.5% < minor allele frequency (MAF; propor tion of the less common of two alleles in a population) < 5%, or from rare variants with MAF <0.5%, that are not sufficiently frequent to be captured by current GWA arrays, nor sufficiently penetrant to be captured by traditional, familybased linkage studies [24]. Detecting such variants will be facilitated by advances in highthroughput sequencing technologies and by the wideranging catalog of variants with MAF >1% generated by the 1000 Genomes Project [25]. Current efforts to identify rare variants by sequencing are likely to focus on the regions of most significant GWAS SNPs and around genes already implicated in CD pathogenesis or treat ment. Resequencing of selected susceptibility loci has led recently to the discovery of three IL23R (the gene en coding IL23 receptor) coding variants that offer protec tion against CD [26]. The results of this particular study confirmed an increase in effect size with decreasing variant frequency, although rare variants explained less of the heritability than common variants.
In addition to resequencing efforts, wholegenome/ exome sequencing will be needed to detect rare highrisk variants beyond the LD reach of tag SNPs. Although the costs of nextgeneration sequencing remain high, they are dropping fairly rapidly as the technologies improve and the process time per sample is becoming shorter; so this method is becoming more and more feasible and accessible for researchers. Evaluating such signals and determining the real causal variant will, however, be a difficult task. Feng and Zhu [27] developed an alternative method for searching for rare variants in previously published GWAS datasets. Their method relies on haplo type analysis across the genome and the hypothesis that multiple rare variants can be captured by many haplo types. Using this method, they confirmed nine previously established loci and also discovered four new CD susceptibility loci [27].
Another approach that may prove to be important is performing resequencing studies of individuals with extreme phenotypes in lipid levels; these studies have shown that such individuals seem more likely to be the carriers of rare, yet nonsynonymous, variants [28]. A large number of rare variants may have distinct effects on the phenotype. Therefore, pooling variants of similar effect and locusspecific matching of cases with specific CD subphenotypes and controls throughout the genome may help to reveal some of the hidden heritability [29].

Structural variation
It has been estimated that chromosomal rearrangements (that is, duplications, deletions, insertions and inver sions), collectively named CNVs, comprise 12% of the human genome [30]. Currently, more than 15,000 CNV loci are catalogued in the Database of Genomic Variants [31]. Some CNVs have been linked to complex disorders, such as autism, neuroblastoma and systematic lupus erythematosus [3234]. A recent study suggested that CNVs are enriched in genomic regions containing genes that influence immunity [35]. In particular, low and high copy numbers of the βdefensin gene (HBD2), which acts as an antimicrobial peptide and as a cytokine, have been found to predispose to colonic CD [36,37]. Yet, in a recent study, Aldhous et al. [38] failed to replicate both of the previously published associations. Moreover, they argued that these two associations could be due to measure ment error because of a general deficiency of realtime PCR to distinguish multiple CNV clusters. In addition to the βdefensins, a finemapping study of the IRGM susceptibility locus revealed a 20kb deletion polymorphism immediately upstream of IRGM that was associated with CD risk and IRGM expression [39]. Further more, a recent GWAS of CNVs from the Wellcome Trust Case Control Consortium has confirmed these CNVs for CD, and also discovered new CNVs in the IRGM and human leukocyte antigen (5.1 kb) regions [40]. The Wellcome Trust Case Control Consortium study also showed that the most common CNVs are well tagged by SNPs in current GWAS chips, and that they are unlikely to make much contribution to the hidden heritability in common diseases. More work is needed to elucidate the functional consequences and impact of high copynumber repeats (for example, long interspersed nuclear elements), and of rare CNVs on clinical phenotypes, such as CD.

Family-based approaches
Since the possibility of chipbased GWASs became available, linkage analysis and familybased approaches have been largely discarded. However, now that the opportunities for gene detection by conventional GWASs have been almost exhausted, researchers are shifting back towards familybased approaches. These approaches can be helpful when GWASs fail to detect signals from rare variants and are biased by population stratification, which is defined as a presence of subpopulations in a supposedly homogeneous population. Subpopulations arise from differences in allele frequencies between individuals as a consequence of distinct ancestral and/or demographic origin. Familybased studies may also be advantageous since the low frequency risk alleles (SNPs with MAF <5%) are likely to be more prevalent in large families with several affected members and should therefore be easier to detect. By assessing GWAS data in such families, large regions of identitybydescent may be identified and found to include genes associated with CD; this approach has already proved to be a powerful tool in classical linkage analysis. However, the shared environ ment of family members is an alternative explanation for familial clustering that should be taken into account. Glocker et al. [41] identified lossoffunction mutations in two loci by considering early onset colitis as a mono genic trait in two consanguineous families. They per formed a genetic linkage analysis followed by candidate gene sequencing and identified the IL10RA (the gene encoding IL10 receptor α) and IL10RB (the gene encoding IL10 receptor β) loci as being associated with earlyonset enterocolitis. However, it is most likely that in this particular case a private variant, not present in the general population, is responsible for the disease.
Akolkar et al. [42] found that CD is subject to a parentoforigin effect, indicating that loci affected by genomic imprinting play a role in CD pathogenesis. In genomic imprinting, the expression of an inherited variant is deter mined by the parent from whom that variant is inherited. If the maternal allele, for instance, is inacti vated by genomic imprinting, then expression of the locus is determined by the paternal allele only. If this effect is not taken into account, a significant loss in the statistical power of the study might develop [43].
Familybased approaches may be useful in the search for the hidden heritability since lowfrequency variants accumulate in families with multiple affected individuals; moreover, lowfrequency variants are not affected by popu lation stratification and they also include parentoforigin effects. However, the causal variants identified in such families may prove to be private variants or the shared environment may play a major role.

GWAS aftermath: epistatic, gene-environment and epigenetic interactions
Given that a large proportion of the heritability of CD and its complex architecture is as yet unexplained, one might speculate other aspects of inheritance, such as epistasis, geneenvironment interactions or epigenetic effects, might be involved. GWASs may be missing higherorder genetic effects that arise from the inter action of two or more SNPs [44]. The underlying idea for such epistatic effects is that a significant proportion of the hidden heritability is not due to single common variants, nor to single rare variants, but rather to rare combinations of common variants. Since typical GWASs examine the association of single SNPs with a phenotype, SNPs that contribute epistatically will not be revealed by such an analysis. A recent pairwise analysis of variants related to the IL17-IL23 pathway showed an increasing odds ratio for CD when the 'risk' haplotypes for these genes were combined [45]. Analysis of epistatic inter actions in betterpowered datasets, and the use of more efficient computational approaches that can account for the complex nature of biomolecular networks, may yield new genetic risk factors for CD [46,47].
An even more complex source for the hidden herita bility might lie in geneenvironment interactions, which are defined as the joint effect of one or more genes with one or more environmental factors that cannot be readily explained by their separate marginal effects [48]. The strongest and best replicated environmental risk factor for CD is smoking, which increases both the risk and severity of CD. However, a recent, moderately sized study found remarkable differences in associated loci between smoking and nonsmoking CD patients, thereby implying that a complex geneenvironment interaction must be at work [49]. Another example of the complex interaction between genetic and environmental factors was shown in a study by Cadwell et al. [50] where Atg16L1-deficient mice infected with a specific strain of norovirus developed CDlike phenotypes in a model of intestinal injury induced by dextran sodium sulfate. In particular, struc tural Paneth cell abnormalities and decreased production of antimicrobial granules in the mice resembled those found in CD patients who are homozygous carriers of the ATG16L1 risk alleles. Remarkably, the severity of intes tinal injury induced by dextran sodium sulfate was not only dependent on aberrant Atg16L1 function and norovirus infection, but also on the timing of infection, secretion of the proinflammatory cytokines TNFα and IFNγ, and the presence of commensal bacteria in the mouse intestine.
Other environmental factors, such as appendectomy, diet and domestic hygiene habits, may also play a role in CD, but the evidence for each of these factors is much weaker. To study geneenvironment interactions will require careful consideration of the epidemiologic study design, exposure assessment, and methods of analysis, paying particular attention to ways of harmonizing these features across consortia.
An additional source of the hidden heritability might not lie in the genome sequence itself, but in subtle mechanisms interfering with genome functions, such as gene expression. These mechanisms include histone modification, methylation and gene inactivation, and are covered by the study of epigenetics. However, there is much controversy on this topic. Its role in CD is un known, but there are some hints that methylation plays a role in other complex diseases: type 2 diabetes, rheuma toid arthritis and neurodegenerative diseases [5153]. Epigenetics is also correlated with age, gender and nutri tion, and it is likely that there are other environmental factors to be discovered [54,55]. It has been shown that changes in DNA methylation in mice can be provoked by dietary alterations and subsequently transmitted across generations [56]. Thus, sequence independent epigenetic effects (beyond imprinting) that might be environ men tally induced and transmitted across several generations [57] could represent a revolutionary glimpse into the enigmatic world of the heritability of complex diseases.

Conclusions
CD is a complex genetic disorder with an estimated heritability of 50% and it is characterized by a recurring inflammation of the gastrointestinal tract. Two decades of research have led to the discovery of 71 risk loci, which have improved our understanding of the disease patho genesis. At the moment, approximately 23% of the herita bility can be explained. To fully understand the disease pathogenesis and link current insights to clinically relevant knowledge, it is important to continue our quest to identify more genetic risk factors in CD. In this review, we have presented various potential sources for the hidden heritability of complex diseases given the current knowledge on CD.
It is unlikely that conventional GWASs alone can solve the puzzle of the hidden heritability. They are not power ful enough to detect signals from common variants with low impact, nor extensive enough to capture rarer variants with high impact. The resources of GWASs are expected to be exhausted fairly soon, although new loci have recently been identified by replicating prioritized SNPs and metaanalysis of GWAS results.
Identification of causal variants may elucidate a sub stantial part of the hidden heritability; however, current GWASs are insufficient for the purpose of identifying causal variants since the identified SNPs are merely the surrogates for causal variants. However, finemapping can uncover SNPs closer to the causal variants, since SNPs can then be tested beyond the scope of GWASs. The true causal variants might be identified by whole genome sequencing or exome sequencing. More sources than the linear DNA sequence have to be investigated to unravel the total heritability. Epigenetics and gene environment studies have been shown to be worthwhile, but the study of epistatic effects in CD is still needed, and results from other complex genetic diseases seem to be promising.
To fully unravel the hidden heritability of CD, colla bora tions between genome research centers are crucial, since the solutions to identify the hidden heritability are either costly or require a huge number of cases and controls. The IIBDGC is a good example of what can be achieved by performing large metaanalyses, and it is currently performing dense finemapping and replication studies to identify causal variants and additional risk loci in CD.