Genome-wide association studies with metabolomics

Genome-wide association studies (GWAS) analyze the genetic component of a phenotype or the etiology of a disease. Despite the success of many GWAS, little progress has been made in uncovering the underlying mechanisms for many diseases. The use of metabolomics as a readout of molecular phenotypes has enabled the discovery of previously undetected associations between diseases and signaling and metabolic pathways. In addition, combining GWAS and metabolomic information allows the simultaneous analysis of the genetic and environmental impacts on homeostasis. Most success has been seen in metabolic diseases such as diabetes, obesity and dyslipidemia. Recently, associations between loci such as FADS1, ELOVL2 or SLC16A9 and lipid concentrations have been explained by GWAS with metabolomics. Combining GWAS with metabolomics (mGWAS) provides the robust and quantitative information required for the development of specific diagnostics and targeted drugs. This review discusses the limitations of GWAS and presents examples of how metabolomics can overcome these limitations with the focus on metabolic diseases.

uncover the causal events leading to these diseases, information on the factors that challenge human health and the immediate responses to these challenges is needed. Yet, unfortunately, the dataset is never complete. In most cases, studies of humans are restricted to observations after a disease has occurred, except in clinical cases when individuals with particular diseases are treated or take part in randomized controlled intervention trials. Outside clinical trials, longitudinal studies (observational studies tracking the same individ uals) that analyze phenotypes can also be undertaken. Both of these types of studies are hampered by unknown and uncontrolled exposure to the environment (such as differences in nutrition, medication, environmental endo crine disruptors and lifestyle) even in wellphenotyped cohorts (where weight, height and health status, for example, are known).
Cohorts can be analyzed for specific features such as genomic variance (variants in the DNA sequence) or metric parameters (concentrations or comparative levels) of RNA, proteins or metabolites. If the features analyzed and disease phenotypes coincide (and the frequency of coincidence is biostatistically valid), then it would be possible to identify the pathways involved. Therefore, a current approach to unveiling the etiology and mecha nism of complex diseases is to employ sophisticated analysis methodologies (omics) that allow for the inte gration of multiple layers of molecular and organismal data. Data acquired with omics have already contributed considerably to the understanding of homeostasis in health and disease. Genomewide association studies (GWAS), in particular, have contributed substantially to the field in the past 6 years [1]. This approach has identified numerous genetic loci that are associated with complex diseases. However, the number of genetic mechanisms that have been identified to explain complex diseases has not increased significantly [2].
In this review, I will highlight the current limitations of GWAS and how issues such as the large sample size required can be overcome by adding metabolomics information to these studies. I will explain the principles behind the combination of metabolomics and GWAS (mGWAS) and how together they can provide a more powerful analysis. I conclude by exploring how mGWAS has been used to identify the metabolic pathways involved in metabolic diseases.

Aims and limitations of GWAS
GWAS analyze the association between common genetic variants and specific traits (phenotypes). The phenotypes originally included weight (or body mass index), height, blood pressure or frequency of a disease. More recently, specific traits in the transcriptome, proteome or metabo lome have been included, and these are usually quanti tative (for example, concentration). GWAS can also be used to explore whether common DNA variants are associated with complex diseases (for example, cancer or type 2 diabetes mellitus). The common variants might be single nucleotide polymorphisms (SNPs), copy number polymorphisms (CNPs), insertions/deletions (indels) or copy number variations (CNVs), but most GWAS employ SNPs [3]. At present, SNPs are used most frequently because of coverage of a large fraction of genome, through put of assay, quality assurance and costeffective ness. Because the concept of GWAS is hypothesisfree, the analyses of GWAS are generally genetically unbiased, but they assume a genetic cause that might not be the most significant contributor.
In the past, candidate gene and pedigree analyses were very successful in the study of diseases of monogenetic origin: heritable dysregulation of certain metabolomic traits (inborn errors of metabolism) were among the first to be associated with specific genes [4]. However, these approaches are not useful in complex diseases because candidate regions contain too many genes or there are no groups of related individuals with a clear inheritance pattern of the disease phenotype. Inspired by the success of the Mendelian inheritance (genetic characteristics passed from the parent organism to offspring) approach, a great effort was undertaken to generate a human reference database of common genetic variant patterns based on a haplotype survey the haplotype map (HapMap) [5]. This resource indeed improved, through linkage disequilibrium (LD) analyses, both the quality and the speed of GWAS, but it has not solved the major issue of study outcome. The common limitation of GWAS is that they do not provide mechanisms for disease; in other words, GWAS are unable to detect causal variants. Specifically, a GWAS provides informa tion about an association between a variant (for example, SNP) and a disease, but the connection between a SNP and a gene is sometimes unclear. This is because annotated genes in the vicinity of a SNP are used in an attempt to explain the association functionally. However, proximity to a gene (without any functional analyses) should not be taken as the only sign that the identified gene contributes to a disease.
It should be further noted that the current analysis tools for SNPs do not include all possible variants, but rather only common ones with a major allele frequency greater than 0.01. SNPs with frequencies of less than 1% are not visible (or hardly discernible) in GWAS at present [3], and therefore some genetic contributions might remain undiscovered. So far, associations discovered by GWAS have had almost no relevance to clinical prognosis or treatment [6], although they might have contributed to risk stratification in the human population. However, common risk factors fail to explain the heritability of human disease [7]. For example, a heritability of 40% had been estimated for type 2 diabetes mellitus [8,9], but only 5 to 10% of the type 2 diabetes mellitus heritability can be explained by the more than 40 confirmed diabetes loci identified by GWAS [9,10].

Overcoming the limitations
There are several ways to improve GWAS performance. Instead of searching for a single locus, multiple independent DNA variants are being selected to identify those responsible for the occurrence of a disease [2]. Odds ratios could be more useful than Pvalues for the associations [6] in the interpretation of mechanisms and the design of replication or functional studies. This is espe cially true if highly significant (but spurious) asso ciations are observed in a small number of samples, which might originate from a stratified population. The design of GWAS is also moving from tagging a single gene as a cause of disease to illuminating the pathway involved. This pathway might then be considered as a therapeutic target. In this way, GWAS comes back to its roots. The term 'postGWAS' is used to describe GWAS inspired experiments designed to study disease mecha nisms. This usually involves exploration of expression levels of genes close to the associated variants, or knockout experiments in cells or animals [11]. In other words, postGWAS analyses bring functional validation to associations [12].
Although omics approaches are powerful, they do not provide a complete dataset. Each omic technology pro vides a number of specific features (for example, trans cript level fold change, protein identity or metabolite concentration, concentration ratios). At present, experi mental datasets consisting of thousands of features un fortunately do not encompass all the features present in vivo. With incomplete data, only imperfect conclusions can be expected. However, the coverage of different omics features is expanding rapidly to overcome both genetic and phenotypic limitations of GWAS. As for the genetic aspects, progress in whole genome sequencing (for example, the 1000 Genomes Project [13,14]) is beginning to provide more indepth analyses for less frequent (but still significant), and multiple, coexisting disease loci. In addition, epigenetic features (for example, methylation, histone deacetylation) will soon be expanded in GWAS [1517].
Improvements in the interpretation of phenotypes are likely to come from causal DNA variants showing significant and multiple associations with different omics data [11]. GWAS can be applied to intermediate pheno types (including traits measured in the transcriptome, proteome or metabolome). The resulting associations can identify SNPs related to molecular traits and provide candidate loci for disease phenotypes related to such traits. Diseaseassociated alleles might modulate distinct traits such as transcript levels and splicing, thus acting on protein function, which can be monitored directly (for example, by proteomics) or by metabolite assays. This leads to the conclusion that another way to improve the outcomes of GWAS is the application of versatile and unbiased molecular phenotyping. The choice of molecu lar phenotyping approach will be driven by its quality regarding feature identification, coverage, through put and robustness.

Metabolomic phenotyping for GWAS
Metabolomics deals with metabolites with molecular masses below 1,500 Da that reflect functional activities and transient effects, as well as endpoints of biological processes, that are determined by the sum of a person's or tissue's genetic features, regulation of gene expression, protein abundance and environmental influences. Ideally all metabolites will be detected by metabolomics. Metabolomics is a very useful tool that complements classical GWAS for several reasons. These include quanti fication of metabolites, unequivocal identification of metabolites, provision of longitudinal (timeresolved) dynamic datasets, high throughput (for example, 500 samples a week, with 200 metabolites for each sample), implementation of quality measures [1821] and stan dard ized reporting [22].
Enhancing classical GWAS for disease phenotypes with metabolomics is better than metabolomics alone for unequivocal description of individuals, stratification of test persons, and provision of multiparametric datasets with independent metabolites or identification of whole pathways affected (including codependent metabolites). It is also instrumental in quantitative trait locus (QTL) or metabolite quantitative trait locus (mQTL) analyses. In these studies quantitative traits (for example, weight or concentrations of specific metabolites) are linked to DNA stretches or genes. This information is important for assessing the extent of the genetic contribution to the observed changes in phenotypes.
A part of the metabolome could be computed from the genome [23], but the information would be static and hardly usable in biological systems except for annotation purposes. The time dynamics of the metabolome pro vides a means to identify the relative contributions of genes and environmental impact in complex diseases. Therefore, combining mGWAS expands the window of phenotypes that can be analyzed to multiple quantitative features, namely total metabolite concentrations.
Nontargeted metabolomics provides information on the simultaneous presence of many metabolites or features (for example, peaks or ion traces). Sample through put may reach 100 samples a week on a single NMR spectrometer, gas chromatographymass spectro meter (GCMS) or liquid chromatographytandem mass spectrometer (LCMS/MS) [20,25]. The number of metabolites identified varies depending on the tissue and is usually between 300 (blood plasma) and 1,200 (urine) [26]. The major advantage of nontargeted metabolomics is its unbiased approach to the metabolome. The quantification is a limiting issue in nontargeted metabo lomics as it provides the differences in the abundance of metabolites rather than absolute concentrations. In silico analyses (requiring access to public [2730] or proprietary [31,32] reference databanks) are required to annotate the NMR peaks, LC peaks or ion traces to specific meta bolites. Therefore, if a metabolite mass spectrum is not available in the databases, the annotation is not automatic but requires further steps. These may include analyses under different LC conditions, additional mass fragmen tation or highresolution (but slow) NMR experiments.
Targeted metabolomics work with a defined set of metabolites and can reach a very high throughput (for example, 1,000 samples per week on a single LCMS/ MS). The set might range from 10 to 200 metabolites in a specific (for example, only for lipids, prostaglandins, steroids or nucleotides) GCMS or LCMS/MS assay [33 37]. To cover more metabolites, samples are divided into aliquots and parallel assays are run under different conditions for GC or LCMS/MS. In each of the assays the analyzing apparatus is tuned for one or more specific chemical classes and stable isotope labeled standards are used to facilitate concentration determination. The major advantages of targeted metabolomics are the throughput and absolute quantification of metabolites.
Both approaches (that is, targeted and nontargeted) reveal a large degree of common metabolite coverage [38] or allow for quantitative comparisons of the same metabolites [21,39]. Metabolomics generates largescale datasets, in the order of thousands of metabolites, which are easily included in bioinformatics processing [40,41].

GWAS with metabolomics traits
The outcome of GWAS depends very much on the sample size and the power of the study, which increases with the sample size. Some criticisms of GWAS have addressed this issue by questioning whether GWAS are theoretically big enough to overcome the threshold of Pvalues and associated odds ratios. Initial GWAS for a single metabolic trait (that is, plasma highdensity lipoprotein (HDL) concentration [42]) were unable to detect the genetic component even with 100,000 samples. This indicates low genetic penetrance for this trait and suggests that another approach should be used to delineate the underlying mechanism. More recently, metabolomics was found to reveal valuable information when combined with GWAS. Studies with a much smaller sample size (284 individuals) but with a larger metabolic set (364 featured concentrations) demon strated the advantage of GWAS combined with targeted metabolomics [34]. In this study the genetic variants were able to explain up to 28% of the metabolic ratio variance (that is, the presence or absence of a genetic variant coincided with up to 28% of changes in concen tration ratios of metabolites from the same pathway). Moreover, the SNPs in metabolic genes were indeed functionally linked to specific metabolites converted by the enzymes, which are gene products of the associated genes.
In another study on the impact of genetics in human metabolism [35], involving 1,809 individuals but only 163 metabolic traits, followed by targeted metabolomics (LCMS/MS), it was shown that in loci with previously known clinical relevance in dyslipidemia, obesity or diabetes (FADS1, ELOVL2, ACADS, ACADM, ACADL, SPTLC3, ETFDH and SLC16A9) the genetic variant is located in or near genes encoding enzymes or solute carriers whose functions match the associating metabolic traits. For example, variants in the promoter of FADS1, a gene that encodes a fatty acid desaturase, coincided with changes in the conversion rate of arachidonic acid. In this study, the metabolite concentration ratios were used as proxies for enzymatic reaction rates, and this yielded very robust statistical associations, with a very small Pvalue of 6.5 × 10 179 for FADS1. The loci explained up to 36% of the observed variance in metabolite concentrations [35]. In a recent fascinating study on the genetic impact on the human metabolome and its pharmaceutical implications with GWAS and nontargeted metabolomics (GC or LCMS/MS), 25 genetic loci showed unusually high pene trance in a population of 1,768 individuals (repli cated in another cohort of 1,052 individuals) and accounted for up to 60% of the difference in metabolite levels per allele copy. The study generated many new hypotheses for biomedical and pharmaceutical research [21] for indications such as cardiovascular and kidney disorders, type 2 diabetes, cancer, gout, venous thrombo embolism and Crohn's disease.
A specific subset of the metabolome dealing with lipids termed lipidomics has provided important insights into how genetics contributes to modulated lipid levels. This area is of particular interest for cardiovascular disease research, as about 100 genetic loci (without causal explanation as yet) are associated with serum lipid concentrations [42]. Lipidomics increases the resolution of mGWAS over that with complex endpoints such as total serum lipids (for example, HDL only). For example, a NMR study showed that eight loci (LIPC, CETP, PLTP, FADS1, -2, and -3, SORT1, GCKR, APOB, APOA1) were associated with specific lipid subfractions (for example, chylomicrons, lowdensity lipoprotein (LDL), HDL), whereas only four loci (CETP, SORT1, GCKR, APOA1) were associated with serum total lipids [43]. GWAS have already enabled tracing of the impact of human ancestry on n3 polyunsaturated fatty acid (PUFA) levels. These fatty acids are an important topic in nutritional science in trying to explain the impact of PUFA levels on immunological responses, cholesterol biosynthesis and cardiovascular disease [4447]. It has been shown that the common variation in n3 metabolic pathway genes and in the GCKR locus, which encodes the glucose kinase regulator protein, influences the levels of plasma phos pholipid of n3 PUFAs in populations of European ancestry, whereas in other ancestries (for example, African or Chinese) there is an impact on the influences in the FADS1 locus [48]. This explains the mechanisms of differ ent responses to diet in these populations. GWAS with NMRbased metabolomics can also be applied to large cohorts. An example is the analysis of 8,330 indi viduals in whom significant associations (P < 2.31 × 10 10 ) were identified at 31 loci, including 11 new loci for cardiometabolic disorders (among these most were allocated to the following genes: SLC1A4, PPM1K, F12, DHDPSL, TAT, SLC2A4, SLC25A1, FCGR2B, FCGR2A) [49]. A comparison of 95 known loci with 216 metabolite concentrations uncovered 30 new genetic or metabolic associations (P < 5 × 10 8 ) and provides insights into the underlying processes involved in the modulation of lipid levels [50].mGWAS can also be used in the assignment of new functions to genes. In metabolite quantitative trait locus (mQTL) analyses with nontargeted NMRbased metabolomics, a previously uncharacterized familial com ponent of variation in metabolite levels, in addition to the heritability contribution from the corresponding mQTL effects, was discovered [38]. This study demon strated that the sofar functionally unannotated genes NAT8 and PYROXD2 are new candidates for the mediation of changes in the metabolite levels of tri ethylamine and dimethylamine. Serumbased GWAS with LC/MS targeted metabolomics has also contributed to field of function annotation: SLC16A9, PLEKHH1 and SYNE2 have been assigned to transport of acylcarnitine C5 and metabolism of phos phatidylcholine PCae36:5 and PCaa28:1, respectively [34,35]. mGWAS has recently contributed to knowledge on how to implement personalized medicine by analysis of the background of sexual dimorphism [51]. In 3,300 independent individuals 131 metabolite traits were quantified, and this revealed profound sexspecific asso cia tions in lipid and amino acid metabolism for example, in the CPS1 locus (carbamoylphosphate synthase 1; P = 3.8 × 10 10 ) for glycine. This study has important implications for strategies concerning the development of drugs for the treatment of dyslipidemia and their monitoring; an example would be statins, for which different predispositions should now be taken into account for women and men.

GWAS and metabolic pathway identification
By integrating genomics, metabolomics and complex disease data, we may be able to gain important infor mation about the pathways that are involved in the develop ment of complex diseases. These data are combined in systems biology [52] and systems epidemi ology evaluations [53,54]. For example, SNP rs1260326 in GCKR lowers fasting glucose and triglyceride levels and reduces the risk of type 2 diabetes [55]. In a recent mGWAS [35], this locus was found to be associated with different ratios between phosphatidylcholines, thus pro vid ing new insights into the functional background of the original association. The polymorphism rs10830963 in the melatoninreceptor gene MTNR1B has been found to be associated with fasting glucose [56], and the same SNP associates with tryptophan:phenylalanine ratios in mGWAS [35]: this is noteworthy because phenylalanine is a precursor of melatonin. This may indicate a functional relationship between the phenylalaninemela tonin pathway and the regulation of glucose homeostasis. The third example is SNP rs964184 in the apolipoprotein cluster APOA1APOC3APOA4APOA5, which associates strongly with blood triglyceride levels [57]. The same SNP associates with ratios between different phos pha tidyl cholines in mGWAS [35]: these are biochemically connected to triglycerides by only a few enzymatic reaction steps.

Conclusions
By combining metabolomics as a phenotyping tool with GWAS, the studies gain more precision, standardization, robustness and sensitivity. Published records worldwide illustrate the power of mGWAS. They provide new insights into the genetic mechanisms of diseases that is required for personalized medicine.

Competing interests
The author has no competing interests to declare.