Integrating post-genomic approaches as a strategy to advance our understanding of health and disease

Following the publication of the complete human genomic sequence, the post-genomic era is driven by the need to extract useful information from genomic data. Genomics, transcriptomics, proteomics, metabolomics, epidemiological data and microbial data provide different angles to our understanding of gene-environment interactions and the determinants of disease and health. Our goal and our challenge are to integrate these very different types of data and perspectives of disease into a global model suitable for dissecting the mechanisms of disease and for predicting novel therapeutic strategies. This review aims to highlight the need for and problems with complex data integration, and proposes a framework for data integration. While there are many obstacles to overcome, biological models based upon multiple datasets will probably become the basis that drives future biomedical research.

G Ge en ne et ti ic c a an na al ly ys si is s i in n t th he e p po os st t--g ge en no om mi ic c e er ra a In 1990, the human genome project was established to sequence the human genome [1], with the aim of applying the acquired genomic data to improve disease diagnosis and determine genetic susceptibility [2]. The publication of the first draft sequence of the human genome in 2001 [3] was thus followed by a rapid growth of different approaches to extract useful information from the genomic sequence. These approaches included, but were not limited to, the analysis of genetic variation (genomics), gene expression (transcriptomics), and gene products (proteomics) and their metabolic effects (metabolomics).
Each of these post-genomic approaches has already contributed to our understanding of specific aspects of the disease process and the development of diagnostic/ prognostic clinical applications. Cardiovascular disease [4,5], obesity [6][7][8], diabetes [9][10][11], autoimmune disease [12,13] and neurodegenerative disorders [14,15] are some of the disease areas that have benefited from these types of data. Taking the metabolic syndrome as an example, our knowledge on all aspects of the disease has grown. The metabolic syndrome is the result of a complex bioenergetic problem characterized by disturbances in lipid, carbohydrate and energy metabolism and blood pressure. In combination, these metabolic factors contribute to an increased susceptibility to cardiovascular disease, morbidity and mortality [16]. Genome-wide association (GWA) studies have identified possible genes involved in each aspect of the syndrome: namely type 2 diabetes [11], obesity [17] and hyperlipidaemia [18]. The findings have confirmed the role of certain candidate genes as well as the polygenetic nature of the syndrome. Not surprisingly, replicate GWA studies of type 2 diabetes revealed that the genes associated with disease, among others, are involved in beta-cell function and adipocyte biology [11,17,19]. In contrast, genes found to be associated with obesity appear to be those that are predominantly involved in central appetite regulation [20][21][22] as key contributors to positive energy balance.
Genetic association studies in epidemiology have highlighted a number of issues. Firstly, many common disease states are related to either many genetic polymorphisms of small effect or, in selected cases, to a few of large effect. The involvement of multiple genes with unequal contributions to disease hints of complex gene-gene and gene-environment interactions. The understanding of such interactions becomes a daunting task when other modulating factors remain unknown. Secondly, some common diseases such as type 2 diabetes [12] appear to be relatively less genetically determined compared to diseases such as rheumatoid arthritis [12] and obesity [23]. In these situations, our understanding of pathophysiology requires additional data outside of genomic information. Thirdly, the initial failures to find robust replicable associations between most of the identified genetic variants and common complex diseases suggest that genomic analysis alone will not account for all of the heritability and phenotypic variation [9,24]. For this reason, there is a growing need to incorporate information derived from environmental studies and post-genomic data into genetic analysis.
A Ad dv va an nt ta ag ge es s o of f c co om mb bi in ni in ng g m mu ul lt ti ip pl le e t ty yp pe es s o of f d da at ta a It is clear that the genetic approach captures only one layer of the complexity inherent within human biology. There is thus a need to integrate multiple 'omics' datasets when aiming to unravel the molecular networks underlying common human disease traits [25]. Attempts have been made to combine two datasets in relation to the clinical phenotype, and this is reflected in the combination of terms found in the literature, for example metagenomics, pharmacogenomics and epigenetics. Many of the postgenomic approaches linking the genetic association data with other 'omics' layers focus on the use of 'omics'-derived phenotypic data as quantitative traits. The utility of such approaches has been previously applied, by combining genetics and metabolomics, in plant functional genomics [26]. More recently, such approaches have also been applied to human datasets. For example, Papassotiropoulos and colleagues [15] identified clusters of cholesterol-associated susceptibility genes for Alzheimer's disease by combining genetics with sterol profiling, while Gieger and colleagues [27] used ratios of metabolites to identify the function of putative genes. In another study, proteomics was linked to quantitative trait loci (QTL) in an attempt to identify changes in function rather than quantity of the protein [28].
By combining multiple types of techniques, including genetics, transcriptomics, proteomics and metabolomics, we are expecting a shift toward 'environmentome' research, where all available information from periconception to disease onset, using both longitudinal and cross-sectional experimental designs, can be obtained [9]. The measurement of traits that are modulated but not encoded by the DNA sequence, commonly referred to as intermediate phenotypes, is of particular interest. These intermediate phenotypes include not only biochemical (metabolites) and genomic (gene expression) traits, but also an individual's microbial (gut microflora) [29,30] and social traits. It is conceivable that by comprehensively examining an individual's 'environmentome', we would be able not only to understand both the genetic and environmental determinants of disease, but also to develop 'feasible' personalized medicine, that is, tailor specific personalized interventions to the individual's own environmental profile. As a pioneering example of this kind, OrešiŁ and colleagues [10] investigated metabolic profiles of children between birth and type 1 diabetes onset in a large birth cohort, and established that specific metabolic phenotypes, not dependent on human leukocyte antigen (HLA)-associated genetic risk, precede the first autoimmune response. The excitement of this research is the expectation that these early metabolic phenotypes may be validated as specific diagnostic and prognostic markers of disease, with therapeutic implications.
E Es st ta ab bl li is sh hi in ng g d di is se ea as se e c ca au us sa al li it ty y a as s a a f fr ra am me ew wo or rk k f fo or r d da at ta a i in nt te eg gr ra at ti io on n The goal of inferring disease causality and disease mechanisms from integrated data is complicated by the fact that measuring more variables may provide a better characterization of the process but still does not contribute directly to our understanding of cause and effect. In fact, given the progressively increasing number of variables that we can measure, the odds of finding spurious associations that do not reflect true causality are much higher. Confounding and reverse causality are among the main sources of bias for failures to replicate apparently robust associations between risk factors and diseases [31]. Confounding specifically refers to a spurious causal effect inferred from the association between a risk factor and a disease due to the existence of some common causes, that is, confounding factors to both of them. This type of spurious causal effect can be removed if we have enough knowledge about the most likely confounding factor candidates. However, the truth is that for most epidemiological studies confounding factors are unknown and difficult to measure, especially in case-control studies. Reverse causality, the second source of bias, refers to an alternative explanation for the observed association between a risk factor and disease, which states that the 'risk factor' is a result of the disease, rather than vice versa. The problem of reverse causality is particularly prevalent in retrospective case-control studies.
One example of a potential confounding association is the established epidemiological evidence of a strong link between obesity and insulin resistance. This association has recently been brought into question from the identification of specific clinical settings where fat mass dissociates from insulin resistance [32,33]. This implies that adipose tissue expansion typically associated with obesity per se may not be the cause of metabolic complications. A potential alternative explanation may be related to an individual's ability to optimally store fat. In the presence of caloric excess, a person is likely to remain metabolically healthy despite obesity, provided their adipose tissue can continue to expand and safely store fat [34]. Therefore, while the epidemiological evidence associates the risk of metabolic complication with increased body weight, this relationship may not be direct and may not necessarily reflect a truly biologically relevant process.
A randomized control trial (RCT) is the golden standard for excluding the spurious association that arises from confounding and reverse causality. A RCT involves random allocation of risk factors to subjects, such that distribution of known and unknown confounders in the different groups is roughly equal, that is, the risk factors become disassociated from any confounders due to the randomization. Furthermore, since the initial randomization is done preceding the disease response, this renders reverse causality highly unlikely. However, the use of RCTs to determine causality is often not possible due to enormous ethical, financial or technical difficulties.
An alternative to RCTs could be Mendelian randomization, which has been proposed as a practical strategy to overcome the problem of experimental bias while significantly reducing the difficulties inherent to RCTs [35,36]. The experimental design of Mendelian randomization aims at providing a potential way to discern true causality from spurious associations, provided that several basic assumptions are valid (Figure 1). The idea of Mendelian randomization originated from Katan's letter to The Lancet [37], where the main objective was testing the hypothesis that low serum cholesterol increases the risk of cancer versus the alternative one that the cancer induces a lowering of cholesterol, that is, a hypothesis testing against reverse causality. Using a language of graphical models [38], Mendelian randomization could be formulated in a triangulation representation as shown in Figure 1. The essence of Mendelian randomization is the use of a genetic variant as a proxy for the random assignment of a risk factor to subjects, given that the inheritance of the genetic variant in a population is also random according to Mendel's second law. Mendelian randomization may provide a rational approximation to RCTs that can be used to identify real causal factors contributing to diseases. D Da at ta a i in nt te eg gr ra at ti io on n b ba as se ed d u up po on n M Me en nd de el li ia an n r ra an nd do om mi iz za at ti io on n We envisage that the potential of combining different postgenome approaches for discovering disease causality and mechanisms could be integrated within the framework of Mendelian randomization. In order to apply this idea to distinguish between association and causation, we need to first justify the three core assumptions that underlie the applicability of Mendelian randomization (Figure 1). Two of the three assumptions (1 and 3) depend on unobserved confounding factors and, therefore, cannot be formally tested from observable data. Therefore, the three associations that are needed in the Mendelian randomization model, that is, the genotype-phenotype association, the phenotype-disease association, and the genotype-disease association, require a certain degree of initial characterization. Clearly, these initial models will need to be continually refined as new data challenge the validity of the assumptions. The downstream impact of these assumptions is not trivial, as a failure to detect robust associations could invalidate the power of Mendelian randomization. While this may imply that Mendelian randomization requires our complete understanding of the biological system, in practice some apparent violations may not actually negate its biological implications [36,39]. Applied carefully, Mendelian randomization can become a useful framework for data integration.
In determining truly positive associations in the presence of a large number of variables and relatively few samples, one needs to resort to novel statistical techniques that can handle such complexity. Bayesian statistical methods can be seen as an alternative to conventional hypothesis testing and appear better able to deal with large post-genomics datasets. In contrast to conventional P-value-centered statistics, a Bayesian approach provides a measure of the probability of a hypothesis being true by taking all evidence in an explicit way. This is clearly a desirable feature as it allows different forms of data to be combined into a unified hypothetical model. Competing models are then entered into a selection framework such that the hypotheses that are most supported by data are favored. For example, using the language of a causal Bayesian network [40,41], Mendelian randomization can be explicitly represented in the graphical model as (1) genotype is independent of the confounder; (2) genotype is associated with phenotype; (3) genotype is independent of disease conditioning on phenotype and confounder. If these assumptions are valid, then an observed association between genotype and disease would imply the causality from phenotype to disease.

Genotype Phenotype Disease
Confounder shown in Figure 1; in which the directions of the arrows (or edges) between the nodes indicate non-reversible causal relationships and reflect the three core assumptions made. The plausibility of the graphical model can then be tested through Bayesian rules, with the evidence provided by all available 'omics' data from different studies. A pioneering example of using a Bayesian network to infer disease causality can be found in reference [42], where three possible model networks that characterize the relationships between QTLs, RNA levels and disease traits were evaluated. However, it should be noted that most of the current applications of Bayesian networks consider phenotypes and disease traits as discrete rather then continuous variables; this is due to the computational difficulties of model selection from an extremely large model space.
M Ma aj jo or r m me et th ho od do ol lo og gi ic ca al l c ch ha al ll le en ng ge es s w wi it th h c co om mp pl le ex x d da at ta a i in nt te eg gr ra at ti io on n While the use of heterogeneous high-dimensional postgenomic data carries many potential benefits, several challenges exist in the areas of biological interpretation, computing and informatics, which will need to be addressed to take full advantage of the wealth of postgenomic data. See Box 1 for the key issues.

C Co on nc cl lu us si io on ns s
Over the last few years, biomolecular research has progressed from the completion of the human genome project to functional genomics and the application of this knowledge to advance our understanding of health and disease. It is clear that genomic information alone, although crucial, is not sufficient to completely explain disease states, which involve the interaction between genome and environment. Post-genomic approaches attempt to contribute to our understanding of this interaction, with each approach capturing a different angle of the global picture. Intuitively, the next step forward is to integrate these datasets, an approach that, if successful, could be much more informative and predictive than working exclusively on a single platform.
Associating and correlating variables between datasets as a means of integrating the large datasets is wrought with issues such as extracting biological meaning (biology is not always linear and is often context dependent) and determining causality and spurious associations.We propose that data integration should be built upon a model, such as a Bayesian model, that takes into account the non-linearity and context-dependent nature of human biology. We further propose that a putative biological relationship between individual data points, identified through association studies, can be efficiently tested (and validated) using strategies, such as Mendelian randomization, that approximate the design strengths of a RCT. While there are clearly obstacles that need to be overcome, biological models based upon multiple datasets are likely to become the basis that drives future research.
A Ab bb br re ev vi ia at ti io on ns s GWA, genome-wide association; HLA, human leukocyte antigen; QTL, quantitative trait loci; RCT, randomized controlled trial; SNP, single nucleotide polymorphism.

Box 1.
• No model is perfect, and inevitably assumptions have to be made. It is likely that initial models built around Mendelian randomization will not accurately model f or epistasis, pleiotropy, copy number variants, genegene interactions or protein-protein interactions.
• Computational power is becoming a bottleneck when building complex models from heterogeneous highdimensional data. For example, the inclusion of a single nucleotide polymorphism (SNP) into a model will require large computational power to correct for linkage disequilibrium. Similarly, the more genetic, mRNA, protein or metabolite data are included, the more permutations are present to be built into and cross-validate the models.
• It would be difficult for a single center to generate the complete spectrum of data required for such complex integration. Data from different experimental paradigms and from different populations are required for cross-validation and optimal model selection. As such, datasets generated from different centers need to be standardized in terms of nomenclature and structure. Efforts along these lines can be seen, for example, in transcriptomics [43], proteomics [44] and metabolomics [45].
• A new breed of scientist with a working knowledge of different post-genomic approaches, disease pathophysiology and mathematical modeling will be needed during the initial attempts at data integration. For example, experimental design and subject selections (such as appropriate controls) will need to be tailored to utilize the strengths of each profiling platform and optimize the final dataset for modeling. This needs to be followed by appropriate model interpretation that takes into account all the assumptions and limitations of the experimental and modeling processes. It is likely that such 'integrative' researchers will identify new insights and unexpected limitations during data integration, thus providing an additional element of 'quality control' over the final model.

C Co om mp pe et ti in ng g i in nt te er re es st ts s
The authors declare that they have no competing interests.
A Au ut th ho or rs s' ' c co on nt tr ri ib bu ut ti io on ns s All authors contributed equally to this work.
A Au ut th ho or rs s' ' i in nf fo or rm ma at ti io on n JT is a postdoctoral researcher in MO's group, focusing on developing applications of Bayesian statistics to integration of heterogeneous genomic and post-genomic data. MO is research professor of systems biology and bioinformatics. His main research areas are metabolomics applications in biomedical research and integrative bioinformatics. CYT is a clinical research fellow in AVP's group, focusing on a systems-biology approach to studying obesity-related metabolic complications. AVP is a reader in metabolic medicine at Cambridge University.
A Ac ck kn no ow wl le ed dg ge em me en nt ts s