Simulating gene-environment interactions in complex human diseases

Because little is currently known about how genes interact with environmental factors in human diseases, and because of the large number of possible interactions between and within genetic and environmental factors, it is difficult to simulate samples for a disease caused by multiple interacting genetic and environmental factors. A recent article by Amato and colleagues in BMC Bioinformatics describes a mathematical model to characterize gene-environment interactions and a computer program that simulates them using biologically meaningful inputs. Here, I evaluate the advantages and limitations of the authors' approach in terms of its usefulness for simulating genetic samples for real-world studies of gene-environment interactions in complex human diseases.


Introduction
Simulated datasets with known underlying disease mechanisms have been widely used to develop efficient statistical methods for deciphering the complex interplay between the genetic and environmental factors respon sible for complex human diseases, such as hypertension, diabetes and cancer [13]. Although genetic and environ mental risk factors have been identified for various human diseases, little is currently known about how genes interact with environmental factors in these diseases. Because the number of possible interactions between and within genetic and environmental factors is large, it is difficult to specify and simulate samples for a disease caused by multiple interacting genetic and environ mental factors. Consequently, existing studies have focused on simple models with loworder inter actions between a few genetic and environmental factors, using specialized simulation programs. Here, I discuss a recent article by Amato and colleagues in BMC Bioin formatics [4], which describes a mathematical model to characterize geneenvironment interactions (GxE) and a computer program that simulates them using biologically meaningful inputs. I evaluate the usefulness of the authors' method for simulating samples with GxE for future studies.

Specifying a GxE model for disease risk
A disease model is needed before a sample can be simulated. If the number of genetic factors that cause a disease is G, we can denote each genetic factor by g i (where i = 1,…,G), and each of these will have three diploid genotypes. Similarly, with E environmental factors, we can denote these x j , and each would have b j possible discrete values (where j = 1,…,E). A complete GxE model would then have 3 G × Π E j=1 b j possible items for each combination of genetic and environmental factors. In addition, the model would require this same number of parameters to specify the risk associated with each item. Although such models can be used to specify arbitrary genegene and geneenvironment interactions, estimat ing a large number of parameters from empirical data is challenging and usually not feasible.
Amato et al. [4] propose a statistical model, called the MultiLogistic Model (MLM), that is designed to describe disease risk in datasets that simulate case control samples. MLM, which is a natural extension of logistic models used by others [2,3], allows the specifi cation of disease risks caused by all genetic factors and by interactions between genotype and all environmental factors. It reduces the required number of parameters to 3 G × (1 + E) by making the following assumptions: that the log odds ratios of environmental factors are additive; and that the different environmental factors are independent and additive. The latter assumption means that only 1 + E parameters are required for each com bina tion of genotypes because the impact of b j levels of exposure for each environmental factor is represented by one parameter and no interaction between environmental factors is allowed. These assumptions limit the applica tion of MLM in studies with correlated environmental factors (for example, smoking and drinking [5]). The simplified model therefore cannot be used to model complex GxE structures, such as the development of lung cancer caused by smoking and genetic factors because the impact of smoking is highly correlated with age, which is a common covariate in such models. Despite this limitation, the MLM approach could be made more generally applicable by making adjustments, such as by applying principal component analysis, to the environmental factors to ensure their independence.
Even with the reduction of parameters afforded by the assumption of independence, the number of parameters in an MLM is still large if multiple genetic and environ mental factors are involved. For example, an MLM requires 18 parameters when there are two genetic factors and one environmental factor. This is why the authors [4] focused on a version of MLM with only one genetic factor and one environmental factor (giving only six parameters), which they implemented in Matlab in their program GeneEnvironment iNteraction Simulator (GENS). Furthermore, users of GENS can choose from four simpler models of GxE ( Figure 1): a genetic model (no environmental factors, three parameters), an environ mental model (no genetic factors, two parameters), a geneenvironment interaction model (genotypes do not directly affect disease risk, four parameters), and an additive model (environmental factors have the same effect in all genotypes, four parameters). These models, although incomplete, should be sufficient for most theoretical studies of GxE models with one genetic factor and one environmental factor.
Because changing an interaction item might change many properties (such as the marginal effects of a model) in an unpredictable way, it is difficult for users to adjust parameters in a GxE model to control for key epidemio logical features of a disease such as population incidence. Amato et al. [4] used an innovative system, the KnowledgeAided Parameterization System (KAPS), to translate user input in familiar epidemiological terminologies, such as model of inheritance, into the parameters used in MLM, which makes it easy for users to specify model parameters that are epidemiologically sensible. Other constraints, such as relative risk between homozygotes and heterozygotes, are added to facilitate the search for suitable parameters. KAPS works well for models with one genetic and one environmental factor because the number of epidemiological variables that users need to input is similar to the number of model parameters. For a general GxE model with multiple genetic and environmental factors, the number of epidemio logical features of the disease and individual genetic and environmental factors will be far less than the number of model parameters because of the large number of interaction terms in MLM. Because multiple models with different interaction terms could have the same epidemiological features, additional constraints are required to limit the number of plausible models, and a complex search algorithm might be needed to parameterize MLM with sensible interaction parameters. An example of fitting a more complex model was presented by Moore et al. [6], who used a genetic algorithm to discover, among a large number of plausible theoretical models, a special set of highorder genegene interaction models in which genes influence disease risk only through interactions with other genes, without any main effects.

Applicability of the simulation tool
Various different methods have been used to simulate casecontrol samples based on penetrance models. Before applying their GxE model, Amato et al. [4] simulated a population to determine the affection status of each individual (that is, whether or not that individual is affected by the disease). There are two possible ways of doing the simulation. The first method is to simulate a large population and then select casecontrol or other types of samples, such as pedigrees, from it. This approach allows maximum flexibility in the specification

Figure 1. Four GxE interaction models provided by the Knowledge-Aided Parameterization System (KAPS): (a) a genetic model, (b) an environmental model, (c) a gene-environment interaction model and (d) an additive model. Each curve represents the relationship between disease risk (y-axis) and
an environmental factor (x-axis) for individuals with a particular genotype (AA, Aa or aa) at a disease-predisposing locus (DPL). The environmental model has only one curve because the relationship between environmental exposure and disease risk is identical for all genotypes. Adapted from Amato et al. [4]. of a penetrance model and is usually used in a forward time approach in which a population is simulated by evolving from a founder population forward in time under the influence of multiple genetic and demographic forces [7]. This method can be inefficient if the disease is so rare that a large population needs to be simulated to obtain enough cases. If, alternatively, a disease model is simple enough, Pr(g i , x j | affection status) (that is, the probability that the genotype is g i and the environmental factor value is x j given the affection status) can be deter mined from Pr(affection status | g i , x j ) together with other parameters, such as frequencies of these factors and disease prevalence. If this is the case, genotype and environmental factors can be simulated directly and only the required number of cases and controls needs to be simulated. This second approach has been used by many simulation programs, such as HapSample [8], hapgen [9] and GWAsimulator [10]. As a compromise between these two approaches, a rejectionsampling algorithm can be used to simulate samples without simulating a large population (for example, genomeSIMLA [11]). This method repeatedly simulates individuals, assigns affec tion status, and collects cases and controls until enough samples have been simulated. This approach is suitable for situations in which environmental factors can be independently simulated for each individual, and could be used to improve the efficiency of GENS. Because genotypes at the diseasepredisposing loci (DPL) of a genetic disease might not be available, many statistical methods rely on linkage disequilibrium (LD) between DPL and their surrounding markers to indirectly map the DPL. GENS does not consider LD between DPL and surrounding genetic markers, so more sophisticated simulation methods are needed to simulate linked markers using genetically related individuals. Existing approaches include: resampling from existing data (for example, HapSample [8] or hapgen [9]); reconstructing from statistical properties obtained from existing sequences (GWAsimulator [10]); simulating a complete genealogy (coalescent tree) of a sample or population (for example, cosi [12], GENOME [13]); and evolving forward in time from a population [7,11]. The power, flexibility, perfor mance and quality of simulated samples vary greatly from program to program. For example, forwardtime methods are most flexible because they can follow the evolution of a real population closely, but they are inefficient because they simulate all ancestors, including those who do not have offspring in the simulated population. GWAsimulator [10] retains the shortrange LD structure of the human population (or more specifically, the HapMap sample) but discards longrange LD because the method simulates haplotypes according to shortrange LD patterns obtained from the HapMap sample [14] using a slidingwindow approach.
Although it is generally possible to simulate large populations using these methods and then apply the GxE disease model proposed by Amato et al. [4], several obstacles remain. For example, many coalescentbased simulation methods [13,15] simulate markers with varying location and allele frequency, so it is difficult to apply a fixeddisease model to replicate simulations. If a forwardtime approach is used to simulate samples with the same set of markers, sample frequencies of the DPL will vary because of the impact of random genetic drift, unless special algorithms are used to control allele frequencies [7]. Even if samples with the same allele frequencies are simulated, the individuals generated may not have enough genetic variations to allow adequate modeling with a GxE model because of insufficient combinations of genetic and environmental factors. For example, from a sample of 20,000 sequences of 40 tightly linked markers over a 100 kb region on chromosome 17 simulated using hapgen [9], there were only 74 unique haplotypes because all the haplotypes were derived from the 63 unique haplotypes in the HapMap CEU sample [14] using an imputation approach.

Conclusions
Amato et al. [4] have provided a mathematical model for the specification of interactions between genetic and environmental risk factors. Their simulation program GENS can be used to generate simple, independent case control samples with clear epidemiological interpre ta tions and can be used to validate a statistical method or compare the performance of several statistical methods under specific assumptions. However, because realworld studies usually involve a large number of linked markers, a useful statistical method should be able to identify informative variables (DPL and environmental factors) from a large number of markers and covariates [16], or be efficient enough to be used to search for GxE signals exhaustively [17]. The performance of statistical methods that detect GxE in complex human diseases, including sensitivity, specificity and ability to handle linked loci, should be tested against simulated samples of long genome sequences with realistic disease models and LD patterns. Although progress has been made in both the simulation of long genome sequences [10,15] and GxE disease models [4], the combination of these two approaches would produce realistic samples that could greatly aid the study of GxE in complex human diseases.