Genetic determinants of metabolism in health and disease: from biochemical genetics to genome-wide associations

Increasingly sophisticated measurement technologies have allowed the fields of metabolomics and genomics to identify, in parallel, risk factors of disease; predict drug metabolism; and study metabolic and genetic diversity in large human populations. Yet the complementarity of these fields and the utility of studying genes and metabolites together is belied by the frequent separate, parallel applications of genomic and metabolomic analysis. Early attempts at identifying co-variation and interaction between genetic variants and downstream metabolic changes, including metabolic profiling of human Mendelian diseases and quantitative trait locus mapping of individual metabolite concentrations, have recently been extended by new experimental designs that search for a large number of gene-metabolite associations. These approaches, including metabolomic quantitiative trait locus mapping and metabolomic genome-wide association studies, involve the concurrent collection of both genomic and metabolomic data and a subsequent search for statistical associations between genetic polymorphisms and metabolite concentrations across a broad range of genes and metabolites. These new data-fusion techniques will have important consequences in functional genomics, microbial metagenomics and disease modeling, the early results and implications of which are reviewed.


Introduction
The last few decades have witnessed a radical change in the biological sciences, as the advent of the 'omics' era has brought large-scale measurement of genes [1,2], transcripts [3], proteins [4,5] and metabolites [6][7][8]. Genetics has undergone a high-throughput revolution, with genomic sciences enabling the rapid acquisition of genome-wide gene-expression profiles, polymorphisms and, more recently, whole genome sequences [9]. These advances have been matched by advances in the measurement of small-molecule metabolites in the associated fields of metabonomics [10,11] and metabolomics [8,12]. Like genomics, these fields aim for comprehensive measurement and analysis of variations, but in the complement of low-molecular-weight compounds within a cell, tissue or biofluid.
For the past 20 years, developments in genomics and metabonomics have progressed on parallel tracks, exchanging experimental designs, data-analysis techniques and applications in basic biology and medicine [13][14][15][16]. Yet genes and metabolites are intrinsically co-informational, each shedding light on complementary biological processes. The genome encodes the metabolic capacities of the cell (the microbiome also influences mammalian metabolism), and changes in the activity and function of enzymes, transporters and transcription factors resulting from genetic variations have a direct impact on the identities and quantities of both intracellular and extracellular metabolites. Metabolite concentra tions are ultimately quantitative, phenotypic traits, the genetics of which are described by the quantitative trait locus (QTL) -a DNA sequence controlling the phenotypic outcome of the quantitative trait, such as a metabolite concentration. Since the origins of biochemical genetics over a century ago, the integrated study of genetics and metabolism has produced significant advances in the understanding of basic biological processes and in the diagnosis and treatment of human disease [17].
Metabolic profiling of single gene mutations [18] and QTL mapping of single metabolic traits [19] both Abstract Increasingly sophisticated measurement technologies have allowed the fields of metabolomics and genomics to identify, in parallel, risk factors of disease; predict drug metabolism; and study metabolic and genetic diversity in large human populations. Yet the complementarity of these fields and the utility of studying genes and metabolites together is belied by the frequent separate, parallel applications of genomic and metabolomic analysis. Early attempts at identifying co-variation and interaction between genetic variants and downstream metabolic changes, including metabolic profiling of human Mendelian diseases and quantitative trait locus mapping of individual metabolite concentrations, have recently been extended by new experimental designs that search for a large number of gene-metabolite associations. These approaches, including metabolomic quantitiative trait locus mapping and metabolomic genome-wide association studies, involve the concurrent collection of both genomic and metabolomic data and a subsequent search for statistical associations between genetic polymorphisms and metabolite concentrations across a broad range of genes and metabolites. These new data-fusion techniques will have important represent early attempts at identifying gene-metabolite associations through omic sciences, by regressing one gene against many metabolites or one metabolite against many genes. In more recent studies, metabolome-wide profiling of biofluids by nuclear magnetic resonance (NMR) spectroscopy or mass spectrometry (MS) is combined with whole-genome profiling of single nucleotide polymorphisms (SNPs) to identify many genemetabolite associations simultaneously, by regressing many metabolite levels against many polymorphisms [20]. While these studies, termed metabolomic quantitative trait locus (mQTL) mapping, or metabolomic genome-wide association studies (mGWAS), are often applied to human genes and human biofluid profiles, these techniques hold the additional promise of investigating interactions between gut microflora genes and biofluid metabolites. By regressing metagenomic sequences against metabolic profiles, metagenome-wide metabolome-wide associations provide insight into the metabolic cross-talk of bacterial and human genomes in the larger human superorganism [21,22]. This review discusses the highly analogous methods and applications of genomics and metabolomics, as well as recent attempts at integrating the two fields towards a more comprehensive and holistic understanding of gene function and the control of metabolic processes.

Developments in instrumentation and experimental design in metabonomics parallel those in genomics
Increases in the speed, accuracy and coverage of genomic analysis have been mirrored by technological developments in the large-scale measurement of low-molecularweight metabolites, using the two major analytical platforms of NMR spectroscopy and MS [23]. While these techniques feature varying strengths in coverage, sensitivity, selectivity for various chemical classes, reproducibility, provision of structural information, and sample-preparation requirements, both stand out in their capacity to measure a large number of small-molecule analytes in an untargeted fashion from complex biological mixtures, such as human biofluids [24].
One of the most popular analytical chemistry techniques, NMR spectroscopy has a long history of application in organic chemistry for structural identification and is used extensively in metabonomics. NMR spectroscopy is characterized by the following key properties, which are fit for purpose: (i) high dynamic range, with several biological nuclei, such as 1 H, 13 C, 15 N or 31 P, being accurately measured over a large range of concentrations; (ii) high linearity of signal intensity with concentration; and (iii) high reproducibility. In particular, 1 H NMR spectro scopy is robust, provides a high degree of structural information for both one-dimensional 1 H and two-dimensional 1 H-1 H NMR, and is flexibly applied to extracts, biofluids and solid tissues using high-resolution magic angle spinning, and in vivo using magnetic resonance spectroscopy. Technological advancements in magnetic field strength with the introduction of 600, 800 and, recently, 1,000 MHz NMR spectrometers, pulse sequence experiments, and cryogenically cooled probes have increased the sensitivity and coverage for smallmolecular-weight metabolites and lipid components from urine, plasma, serum and tissue samples [11].
MS, coupled to either liquid (LC) or gas (GC) chromatography, is also frequently applied to profile the metabolome [25][26][27]. LC-and GC-MS both boast high sensitivity to low-concentration analytes, as well as highresolution chromatographic separation. While these tech niques have faced challenges in compound identi fication, reproducibility and bias towards certain compound functional groups, the rapid pace of technological development in MS has sought to address many of these challenges. Advances in chromatography, including ultrahigh-performance liquid chromatography (UPLC) and multidimensional gas chromatography (GC×GC, 3D-GC), have increased the speed and reproducibility of chromatography-coupled MS [28]. Additionally, advances in MS, including high-resolution time-of-flight (ToF) and quadru pole time-of-flight (qTOF) instruments, along with serial ion fragmentation (MS-MS, MS n ) have improved resolution, coverage and identification of low-molecularweight species [29].
While many comparisons have been made between the utility of these techniques for metabolite analysis [30], advances in both have paralleled advances in speed, accuracy and coverage in genomics. As a result, broadcoverage metabolite profiling using NMR spectroscopy and MS has been increasingly used to answer the same questions as genomics, especially in the search for risk factors for disease at the population level and predictors of drug metabolism in personalized medicine [13][14][15][16]. These are mirrored in parallel developments in experimental design and data analysis.
Just as the introduction of the genome-wide association study (GWAS) began the search for associations between genome-wide polymorphisms with disease phenotypes in large population cohorts [15,31], the metabolome-wide association study (MWAS) has used large numbers of biofluid spectra and statistical regression to search for associations between metabolites present in human biofluids and both quantitative and binary physiological and pathological traits [16]. Two-class experimental designs, in which metabolite associations with disease are identified by statistical regression of metabolic profiles against binary variables for affected and control individuals, are common, and associations between meta bolites and disease have been reported for obesity and insulin resistance [32], prostate cancer [33], autism [34], ulcerative colitis [35] and more [36]. However, the recent application of metabolic profiling to large population cohorts, quantitative traits and population differences (termed MWAS) has revealed metabolite associations with diet and blood pressure [37], region and cardiovascular risk [38] and ethnicity [39].
In addition to studies of disease risk, both metabolites and genes have been queried to predict drug metabolism. Paralleling previous work in pharmacogenomics, the introduction of pharmaco-metabonomics has demonstrated that drug metabolism can be predicted from the metabolite composition of urinary biofluids before drug administration [13,40]. Recent applications of pharmacometabonomics have highlighted metabolic predictors of acetominophen toxicity in animals [13] and humans [41,42], capecitabine toxicity [43] and microbial influences on drug detoxification [41].

Early integration: metabolic profiling of Mendelian traits and QTL mapping of single metabolites
Since the discovery of alkaptonuria by Archibald Garrod in the early 20th century, the measurement of metabolites has been used as a proxy to identify human genetic diseases, especially inborn errors of metabolism. Uniform newborn screening for multiple inborn errors of metabolism, including urea-cycle disorders and aminoand organic-acidurias, using heel-prick testing with GC-MS-MS exemplifies the power of targeted metabolomic analysis to diagnose these diseases and enable the interruption of pathological processes resulting from genetic mutations [44]. More recently, untargeted NMR spectroscopy and MS have been used to diagnose known inborn errors [45][46][47][48], identify novel inborn errors using biofluid profiling [49,50] and identify complex downstream metabolic consequences [51][52][53] and biomarkers of organ pathology resulting from genetic mutations [54][55][56].
Untargeted metabolic profiling of biofluids, especially urine and serum, is a powerful technique for diagnosing inborn errors of metabolism with often non-specific clinical presentation, as metabolic intermediates accumulated in biofluid compartments can be easily identified [18,57,58]. As a result, diagnosis of suspected inborn errors has been reported for many Mendelian diseases [45][46][47][48][49][50], especially using NMR spectroscopy. In some cases, metabolic profiling of biofluids from patients with suspected inborn errors has led to the discovery of previously undescribed diseases, with the identification of causal genes following the description of metabolic perturbations, as occurred with aminoacylase 1 deficiency and beta-ureidopropionase deficiency [49,50].
While mutations in enzymes and transporters can often be readily diagnosed by biofluid profiling, and a strong mechanistic link is easily inferred between the disruption of a metabolic pathway and resulting accumulation or depletion of metabolic intermediates, many Mendelian diseases result in more complex, progressive organ-specific or multi-organ pathology [59]. In these cases, metabolic profiling has been applied to identify sites of lesions, describe progression and attempt to identify proxy small-molecular biomarkers of the disease. Examples of this include the identification of markers of autosomal dominant polycystic kidney disease [55], comparison of the urinary profiles of several genetic forms of renal Fanconi syndrome [56] and description of abnormal brain metabolism in Smith-Lemli-Optiz syndrome [54]. Additionally, metabolite flux analysis using isotopically labeled metabolites suggests an additional way to apply metabolic profiling techniques to study the impact of genetic mutations [59].
The genomic corollary of using metabolic profiling to study Mendelian genes is QTL mapping of single metabolic traits. The advent of whole-genome SNP analysis and the use of QTL mapping for quantitative traits, such as height [60], led to interest in identifying genetic loci associated with quantitative variation in individual metabolite levels [19]. Examples of this include mapping of serum leptin levels to genes on human chromosome 2 in multiple human populations [61,62], associations with plasma triglyceride levels [63,64] and identification of genetic variants associated with high-density lipoprotein (HDL) levels [65,66]. Recent studies have investigated associations between serum lipid fractions and polymorphisms, constituting an intermediate between traditional metabolite QTL mapping and mQTL/mGWAS [67]. Like many QTL mapping studies, attempts to identify loci significantly associated with biofluid levels of single metabolites frequently indicate a large number of genetic associations. This is almost certainly an indication of complex, multigenic control processes regulating energy metabolism and homeostasis, and the identification of large numbers of multiple loci provides an important documentation of genes involved in complex metabolic pathways. However, a large number of loci each contributing to a potentially small percentage of observed variance in metabolite levels complicates direct interpretation of genotype-phenotype relationships in these cases. Figure 1 shows a schematic illustration of genemetabolite correlations in biochemical genetics, traditional QTL mapping, and mQTL/mGWAS.

Identifying the genetic determinants of the metabolome: mQTL and mGWAS
GWAS currently requires increasingly large cohorts to ensure discovery of new genes associated with disease phenotypes [68]. Although this approach is very efficient, the biological relevance of these associations can be difficult to assess. The identification of phenotypes related to disease mechanism, onset and progression represents a promising research avenue.
The systematic search for molecular endophenotypes (that is, internal phenotypes) that can be mapped onto the genome began with the quantitative genetic analysis of gene-expression profiles, referred to as genetical genomics [69] or expression QTL (eQTL) mapping [70]. Treating genome-wide gene-expression profiles as quantitative traits was originally developed in model organisms and applied to humans [70,71]. In eQTL mapping, cis-regulatory associations between genomic variations and gene-expression levels are discovered by integrated analysis of quantitative gene-expression profiles and SNPs. The identification of a SNP at a gene locus affecting its own expression represents a powerful selfvalidation. However, eQTL mapping presents a series of drawbacks: (i) frequently analyzed cell lines often have altered gene expression, and access to biopsy samples from organs directly relevant to pathology is often impossible; and (ii) due to the gene-centric nature of eQTL mapping, this approach bypasses the biological conse quences of the endophenotypes generating the association.
Immediately following the success of the eQTL mapping approach [70], in which cis-regulatory associations between genomic variations and gene-expression levels are discovered by integrated analysis of quantitative gene-expression profiles and SNPs, metabolic profiles were included as endophenotypic quantitative traits. This led to mapping of multiple quantitative metabolic traits directly onto the genome to identify mQTLs in plants [72,73], then in animal models [74,75]. In mQTL mapping, individuals are genotyped and phenotyped in parallel and the resulting genome-wide and metabolomewide profiles are then quantitatively correlated (Box 1). mQTL mapping presents a significant advantage over gene-expression products such as transcripts [70] or proteins [76]: the ever-increasing coverage of the metabolome allows a glimpse at the real molecular endpoints, which are closer to the disease phenotypes of interest. Following the success of mQTL mapping in plants [72,73] and then in mammalian models [75], this approach was quickly followed by the development of mGWAS in humans cohorts ( [77][78][79][80][81][82][83], see also the review by J Adamski [84]).
One of the distinctive features of mGWAS is the intrinsically parallel identification of associations between monogenetically determined metabolic traits and their causative gene variants (see Table 1 for a list of human mQTL-metabolite associations).
The mechanistic explanation of gene/metabolite associations identified by mQTL mapping can be difficult. The simplest case corresponds to associations between genes encoding enzymes and metabolites, which are either substrates or products of the enzyme they are associated with [74,75] (Figure 2). This corresponds to a direct cis-acting mechanism. Also, one of the interesting discoveries from results obtained by Suhre et al. is that a number of gene variants causing metabolic variation correspond to solute transporter genes, as the majority of the genes in this category belong to the solute carrier (SLC) family [78,81,82]. Again, this corresponds to a direct mechanistic link. In other cases, the link between gene variants and their associated metabolites can demon strate pathway, rather than direct, connectivity, such as polymorphisms in enzymes associated with  More opaque associations may be trans-acting in a broader sense: the causative gene variant can be a molecular switch, and the metabolite it is associated with is in fact regulated indirectly by this molecular switch (further down in the regulation events). This is particularly the case when the causative gene variant encodes a transcription factor, inducing the medium-to long-term expression of entire gene networks, or when the gene variant encodes a kinase or a phosphatase regulating entire pathways on much shorter time-scales. Unlike cis-acting mQTL/metabolite associations, which can be seen as self-validation of the causative gene at the locus, trans-acting mQTL associations present the challenge of identification of the most relevant causative gene at the locus. If a SNP is associated with a metabolite, the closest gene at the locus is not necessarily the most relevant candidate, and further investigation of a larger biological network, such as protein-protein interactions [85], may be necessary to identify mechanistic relationships between genetic variants and downstream metabolism. Despite these challenges, which are familiar to practitioners of biochemical genetics, statistical identification of gene-metabolite associations by mQTL and mGWAS promises to significantly advance current under standings of gene function, metabolic regulation and mechanisms of pathology.

A glimpse of our extended genome with microbiome-metabolome associations
The functional genomic association studies and dataintegration approaches described above rely predominantly on mammalian genome sequences and their annotation (excluding MWAS, which makes use only of metabolite profiling data and does not, as such, require genomic data). However, human phenotypes result from the interaction of several sets of genes: the karyome, the chondriome and the microbiome, respectively corresponding to eukaryotic chromosomes, mitochondrial chromo somes and, finally, gut bacterial prokaryotic chromosomes. The latest human gut microbiome gene catalogue identified 3.3 million non-redundant genes [86], which was dubbed 'our other genome' , and the bacterial species composition of the gut microbiome varies from one individual to another, but this variation is stratified, not continuous, and suggests the existence of stable bacterial communities, or 'enterotypes' [87].
The classical identification of associations between gut bacteria and metabolites has been performed on a caseby-case basis for decades. However, the correlation of metabolic profiles with multiple gut bacterial abundance profiles was initiated a few years ago with the introduction of bacteria/metabolite association networks [21]. Semi-quantitative characterizations of microbial populations using denaturing gradient gel electrophoresis (DGGE) and fluorescent in situ hybridization (FISH) have yielded associations with obesity and related metabolites [88]. Recently, the introduction of highthroughput sequencing of bacterial 16S rDNA profiles and correlation with metabolic profiles has greatly increased the coverage and quantification of microbial species [89]. The correlation of metabolic profiles with 16S rDNA microbiome profiles provides a strategy for the identification of co-variation between metabolites and bacterial taxa, and such associations point to the production or regulation of metabolic biosynthesis by these microbes.
Given these early successes, the integration of metabolome-wide experimental profiles with metagenomewide metabolic reconstruction models obtained from full microbiome sequencing should provide a clear insight into the functional role of the gut microbiome, especially the synthesis of metabolites and resultant impacts on human metabolism. This critical need for a marriage between metabolomics/metabonomics and metagenomics has been clearly identified for several years [90]. How

Box 1. Mathematical modeling for mQTL identification
The statistical analysis involved in mQTL mapping and mGWAS does not currently differ substantially from the statistical methods used to identify genetic loci associated with single quantitative traits. mQTL and mGWAS involve independent QTL mapping of each metabolite identified by metabolic profiling, though accurate analysis is dependent upon proper preprocessing of both genomic and metabonomic data. Associations are identified using techniques such as Haley-Knott regression implemented in the R/QTL package, which uses local information about surrounding markers [103], or typical univariate association tests such as χ 2 or Cochrane-Armitage trend tests implemented in PLINK [104]. The results of mQTL and association mapping are typically displayed using a logarithm of odds (LOD, -log 10 (P value)) score, which allows establishment of genome/metabolome LOD score maps [74,75], or more classical Manhattan plots [77,78,81,82] (Figure 2).
The main challenge in mQTL data modeling is multiple correlation testing. Assuming the use of high-resolution metabolic profiles (1,000 to 10,000 features) and genome-wide SNP coverage (600,000 SNPs), a typical metabolome-wide GWAS can apply between 600,000,000 and 6,000,000,000 univariate tests. Given the number of tests involved, there are numerous opportunities for false discoveries and multiple testing corrections are required to account for this. Genome-wide significance levels can be estimated using Bonferroni correction [77], but also using Benjamini and Hochberg or Benjamini and Yakutieli corrections [105]. Finally, permutation and resampling methods also provide empirical estimates for false discovery thresholds [74,79].
new experimental data change our understanding of our commensal microflora remains to be seen.

Future directions -the rise of sequencing and consequences for genome-metabolome data fusion
Genomics is currently undergoing yet another revolution, as next-generation sequencing technologies increase the accuracy, coverage and read-length, and drastically decrease the cost of whole-exome sequencing (WES) and whole-genome sequencing (WGS). The introduction of third-generation sequencing technologies in the near future promises to continue this trend [91]. Consequently, the near term promises a dramatic expansion in the availability of sequence data both in the laboratory and in the clinic. The relevance of the explosion of sequence data to the continued integration of metabonomic and genomic data is twofold: first, an opportunity for metabonomics to contribute to the increased clinical presence of omics sciences led by genome sequencing; and second, a challenge to develop methods of integrating metabolic profiles with sequences rather than polymorphisms.
The introduction of WES and WGS into the clinic is already well underway, with success stories including discoveries of new Mendelian disorders [92,93] and successful therapy designed on the basis of mutation discovery [94]. Of known and suspected human Mendelian diseases, molecular bases have been identified for over 3,000, with another approximately 3,700 phenotypes suspected of having a Mendelian basis [95,96]. As sequencing identifies an increasing number of variants with associations to disease, the rate-limiting step in genomic medicine will move from discovery to functional annotation of sequence variants. Metabolite profiling, along with other high-throughput measurement and data-analysis technologies, may find increasing acceptance in medicine, as investigators rush to keep up with a deluge of sequence data. Increasingly, routine genome sequencing will create a significant resource for largescale population studies like those currently used to identify gene-metabolite associations, and will critically include rare variants not captured by polymorphism data.
Recent results demonstrate the potential power of integrating WES/WGS with metabolic profiling and mQTL/ mGWAS. In 2011, a publication in the New England Journal of Medicine reported the discovery by WES of a novel human Mendelian disease in which rare mutations in the NT5E gene result in arterial calcifications due to loss of CD73 (encoded by NT5E) function, which converts AMP to adenosine in the vasculature [97]. Within the year, an independent metabolome-wide GWAS study published in Nature reported a statistically significant association between a SNP near the NT5E locus (rs494562) and inosine concentration in human serum, as part of a much larger set of gene-metabolite associations (Table 1) [81]. While in this case the publication of the statistical association followed the description of the human phenotype, future genetic studies will be greatly aided by gene-metabolite co-variation discovered by association studies.
Despite the significant opportunity represented by lowcost sequencing for the integration of genomic and metabonomic data and the identification of gene-metabolite associations, several challenges stand in the way of routine paired analysis of sequences and metabolic profiles. The first of these is the challenge of discovering significant associations with low sample numbers. Many of the successes reported in clinical sequencing have made use of data from a small number of patients, and sometimes only a sole patient. In these cases, potentially disease-causative variants are often identified using filtering strategies rather than statistical analysis [98,99]. While the diagnosis of human inborn errors of metabolism from single-patient biofluid NMR spectra demonstrates the potential of metabolic profiling to work with low sample numbers, lack of statistical validation means that the 'biological signal' in these cases must be quite marked. A second challenge is a dearth of tools for statistical analysis of sequence data. While QTL mapping using SNPs is well established, statistical techniques for QTL mapping with both rare and common variants are just beginning to be introduced [100]. It is likely that increased availability of large-scale population sequence data from initiatives such as the 1000 Genomes Project [101,102] and ClinSeq [103] will spur the development of statistical methods that can be deployed to identify genemetabolite associations.
Of the omics sciences, genomics and metabolomics are uniquely complementary, the strengths of each addressing Shown here are the SNP-metabolite associations with the highest statistical significance, as in [77,79,[81][82][83]. Associations with metabolite concentration were reported for a total of 28 unique SNPs, as shown above. Associations with ratios of multiple metabolites were reported for an additional 30 unique SNPs, but are not included in this table.
weaknesses of the other. Genes are (mostly) static, an 'upstream' blueprint controlling dynamic biological processes. The identities and quantities of 'downstream' metabolites capture both genetic and environmental influences, and can be measured serially to assess variation through time. Genomic studies often struggle to establish a firm link between genetic variants and phenotypic observations, and while metabonomics provides a closer proxy to phenotype, it is often difficult to infer underlying causality from variations in metabolism. Together, the integrated application of genomics and metabonomics promises a bridging of the gap between genotype and phenotype through intermediate metabolism, to help annotate genes of unknown function, genetic controls of metabolism, and mechanisms of disease.