DNA methylation loci associated with atopy and high serum IgE: a genome-wide application of recursive Random Forest feature selection

Background The prevalence of allergic diseases are increasing worldwide, emphasizing the need to elucidate their pathogeneses. The aims of this study were to use a two-stage design to identify DNA methylation levels at cytosine–phosphate–guanine (CpG) sites across the genome associated with atopy and high serum immunoglobulin E (IgE), then to replicate our findings in an independent cohort. Methods Atopy was assessed via skin prick tests and high serum IgE. Methylation levels were measured from whole blood using the Illumina Infinium HumanMethylation450 BeadChip from 18-year-old women (n = 245) and men (n = 122) in the Isle of Wight birth cohort. After data cleaning and processing, and removing probes with possible single nucleotide polymorphisms, DNA methylation levels from 254,460 CpG sites from the 245 women were subjected to recursive Random Forest feature selection for stage 1. The sites selected from stage 1 were tested in stage 2 for associations with atopy and high IgE levels (>200 kU/L) via logistic regression adjusted for predicted cell-type proportions and sex. Sites significantly associated with atopy in stage 2 underwent replication tests in the independent Swedish birth cohort BAMSE (n = 464). Results In stage 1, 62 sites were selected, of which 22 were associated with atopy in stage 2 (P-value range 6.5E−9 to 1.4E−5) and 12 associated with high IgE levels (P-value range 1.1E−5 to 7.1E−4) at the Bonferroni adjusted alpha (0.05/62 = 0.0008). Of the 19 available sites, 13 were replicated. Conclusions We identified 13 novel epigenetic loci associated with atopy and high IgE that could serve as candidate loci for future studies; four were within genes with known roles in the immune response (cg04983687 in the body of ZFPM1, cg18219873 in the 5′UTR of PRG2, cg27469152 in the 3′UTR of EPX, and cg09332506 in the body of COPA). Electronic supplementary material The online version of this article (doi:10.1186/s13073-015-0213-8) contains supplementary material, which is available to authorized users.


Background
The prevalence of allergic disease is increasing worldwide; approximately 40 % of the population of industrially developed countries are considered to be affected [1]. Many of these allergic diseases appear to have a hereditary component but are also influenced by environmental stimuli [2], and the origin of the immune response, including allergen sensitization, is thought to start during the fetal period [3]. It is well recognized that environmental stimuli during critical prenatal and postnatal periods can permanently alter metabolism and influence the risk of allergic diseases [4], yet the specific molecular mechanisms through which this occurs are poorly understood [1,5].
Epigenetics, changes in gene activity not caused by alterations to the sequence of DNA, may clarify some of these mechanisms because much of cell lineage and tissuespecific gene expression is tightly regulated by epigenetic programming [1]. One of the most commonly studied epigenetic mechanisms is DNA methylation (DNA-M), the covalent addition of a methyl group to a cytosine followed by a guanine (cytosine-phosphate-guanine; CpG).
Changes in DNA-M affect gene transcription and have been associated with disease [6]. Some of DNA-M's roles in the development of the immune system, immune cell-fate, and allergic diseases have been unlocked, but substantial gaps in knowledge still exist [1].
Atopy is defined as a positive reaction to a skin prick test (SPT) or immunoglobulin E (IgE) production in response to allergens [7]. IgE plays an important role in many, but not all, allergic diseases, for example, asthma, rhinitis, and eczema [7,8]. High levels of IgE in the blood are associated with both the risk and severity of asthma, and cord blood IgE levels have been studied as possible predictors of asthma and other atopic allergic diseases [4]. Atopy is therefore connected to allergic disease, although many of the details of this relationship are still unknown. Epigenetic epidemiology can help to clarify the role that DNA-M plays in atopy by confirming candidate loci and revealing novel loci associated with atopy [5].
Advances in genetic biotechnology have made it feasible to measure DNA-M throughout an individual's epigenome and, consequently, epigenetic assessments are becoming feasible in larger epidemiologic studies [9]. A growing challenge with epigenetic epidemiology is that a vast amount of data is generated and new statistical techniques are necessary to make sense of it. This is because of small-n-large-p (few observations relative to the number of predictors) and because traditional methods are not optimized for identifying complex biological processes. Because of the large-scale data generated for each completed methylation assay, techniques to select a subset of informative variables are needed [10][11][12], particularly in cases of sparse data in which the vast majority of predictors are uninformative [13].
Random Forest (RF), developed by Leo Breiman, is a machine learning algorithm used for classification that can handle the data issues discussed above [14]. A forest composed of classification trees is grown using randomly selected bootstrap samples of the data to form training and testing sets of study participants. At each node within each tree, the training set is partitioned into different classes with the split determined by a subset of randomly chosen predictors. These two levels of randomness, random selection of training/testing sets and random testing of predictors, allow the RF to produce robust classification predictions. Once the forest is grown using the training sets, the observations in the testing sets are classified via the forest and misclassification rates can be used to evaluate the accuracy of the forest [14].
Utilization of RF to analyze array data has increased in recent years [15][16][17][18]; it is an ideal method for classification with methylation data for several reasons. Unlike most traditional methods, RF can be used for feature selection when the number of variables exceeds the number of observations, even when most variables are uninformative; RF can use both numeric and categorical variables; the algorithm can be optimized fairly easily by varying the RF parameters; and adding trees does not cause the model to become over-fit [14,15,19]. In addition, biological processes are probably not linear in nature; rather, they involve interactions between many different molecules. Thus it is likely that methylation changes at a combination of CpG sites could influence disease states. RF allows for the identification of multiple interacting predictors and identifies which of these are most important without imposing a structure or model on the way that it takes place.
Despite its increased presence in the analysis of genomic data, few applications of the algorithm have utilized the variable importance measures (VIM) and its potential for feature selection [18]. While RF lends itself to a variety of applications, we focused on using it for feature selection.
In this study, we implemented a two-stage discovery study within the Isle of Wight (IOW) birth cohort to first select a set of atopy-candidate CpGs from epigenome-wide data using a subsample and then to test which of these sites were significantly associated with atopy as defined by positive SPT or high total serum IgE levels in the joint sample. Then, to validate our findings, we ran replication tests in the independent Swedish cohort BAMSE.

The Isle of Wight birth cohort
The IOW birth cohort was established to study the natural history of allergic disease among children born between 1 January 1989 and 28 February 1990 on the Isle of Wight, UK. The study was approved by the local research ethics committee (now named the National Research Ethics Service, NRES Committee South Central -Southampton B; 06/Q1701/34) and written informed consent was provided by the infants' parents. After exclusion of adoptions, perinatal deaths, and refusals, 1,456 children (95 %) were enrolled. Participants were followed-up at ages 1 (n = 1,167), 2 (n = 1,174), 4 (n = 1,218), 10 (n = 1,373), and 18 years (n = 1,313); detailed questionnaires were administered at each follow-up. Details of the birth cohort have been described elsewhere [20,21]. At age 18 years, 245 women and 122 men were randomly selected from the cohort for genomewide DNA methylation screening as part of another study assessing trans-generational inheritance of atopy.

Data collection and DNA methylation
At the 18-year follow-up, most of those who were seen in-person received SPTs using a standard method [22] and with a battery of common allergens. Inhalant allergens tested were house dust mite, cat, dog, Alternaria alternata, Cladosporium herbarum, grass pollen mix, and tree pollen mix. Food allergens tested were cows' milk, soy, hens' egg, peanut, and cod. Of the 245 women included in the epigenetic analyses, SPTs were conducted on 242 of them; 120 of the men received SPTs. Blood samples for DNA extraction and serum for measurements of IgE levels were also collected at 18 years of age. Total serum IgE was measured in a subset of male and female serum samples collected at age 18 (n = 144) using Immunocap (Phadia, Uppsala, Sweden), designed to measure IgE between 2.0 to 1,000 kU/L. DNA was extracted from whole blood using a standard salting-out procedure [23]. DNA concentration was determined by the Qubit quantitation kit (Life Technologies Ltd, Paisley, Renfrewshire, UK). One microgram of DNA was bisulfite-treated for cytosine to thymine conversion using the EZ 96-DNA methylation kit (Zymo Research, Irvine, CA, USA), following the manufacturer's standard protocol. Genome-wide DNA methylation was assessed using the Illumina Infinium HumanMethylation450K BeadChip (Illumina, Inc., San Diego, CA, USA), which interrogates >484,000 CpG sites associated with approximately 24,000 genes. The BeadChips were scanned using a BeadStation, and the methylation levels (β value, described below) were calculated for each queried CpG locus using the methylation module of GenomeStudio software (Illumina, Inc.). Arrays were processed using a standard protocol as described elsewhere [24], with multiple identical control samples assigned to each batch to assess assay variability and samples randomly distributed on microarrays to control against batch effects.

Data cleaning
The program for data cleaning was written in R (R Development Core Team, 2012). Quality control (QC) measures were employed to improve the reliability of data prior to analysis. In our study, the detection P-value reported by GenomeStudio was used as a QC measure of probe performance. Probes with detection P-values > 0.01 in >10 % of the samples were removed [25]. The methylation data were then preprocessed and technical variations removed via peak-correction using the Bioconductor IMA (Illumina Methylation Analyzer) package. Excluding control probes and probes with poor detection P-values yielded 383,998 remaining probes; 9,650 CpGs on the sex chromosomes were also removed. The arrays were processed in two batches; batch number was recorded as a categorical variable, which was used in ComBat to adjust for inter-array variation [26,27]. Because the female and male samples were assessed in different batches, some sites that survived QC in the female sample did not survive QC in the male sample. A very conservative approach was utilized for addressing intraprobe single nucleotide polymorphisms (SNPs); to ensure that our findings were not biased by SNPs affecting methylation levels, we excluded all probes with potential SNPs in the binding region or at base-pair extension (119,888 probes) according to the dbSNP database (version 137), resulting in a final set of 254,460 CpGs for analysis. Removing all probes with possible SNPs was necessary with our variable selection method because the selection of any variable is conditional upon the effects of other selected variables, thus inclusion of SNP-biased probes can affect the inclusion of other unbiased probes.

Variable definitions
Participants were defined as being atopic, the primary outcome variable for this study, if they had a positive SPT to at least one of the tested allergens [7]. Positive SPTs were determined by a mean wheal diameter of 3 mm greater than the negative control; SPT results were deemed inconclusive if the positive control resulted in a diameter less than 3 mm. To internally validate our findings from the SPT analyses, we also tested the same statistical models but with dichotomous serum IgE levels (IgE ≥ 200 kU/L versus IgE < 200 kU/L), which has been shown to be predictive of allergy [28], as an alternate outcome variable.
Methylation levels for each queried CpG were calculated as β values. These represent the proportions of methylated (M) over methylated (M) and unmethylated (U) sites (β = M/[c + M + U], with constant c introduced to prevent the possibility of a zero in the denominator), and can be interpreted as percent methylation; β values close to 0 or 1 tend to suffer from severe heteroscedasticity. The β values were utilized for RF, described below, which is a non-parametric method and does not assume a normal distribution. However, for parametric statistical analyses, such as logistic regressions used for validation and replication, we utilized M-values, which address the issue of heteroscedasticity and thus perform better. M-values were calculated from the β values via log 2 [β/(1 − β)] [29]. Prior to running parametric models, boxplots and histograms were used to verify approximate normality and identify potential outliers.
Pearson's chi-squared tests were used to determine if prevalence of atopy and high IgE differed between the female and male samples, within the epigenetic sample, and between the epigenetic sample and the entire cohort. P-values were compared against an α-level of 0.05. We implemented a two-stage genome-wide approach [30]: stage 1 analyses selected a set of atopycandidate loci from genome-wide DNA-M within a subsample (n = 245), and stage 2 analyses tested those loci for associations with atopy and an alternate marker of atopy, high IgE, in the joint sample (n = 367). The specific methods within each stage are detailed below. The normalized DNA-M microarray data, as well as covariates and outcomes used in both stage 1 and stage 2 analyses, described below, are available via the University of Southampton ePrints Soton (DOI: 10.5258/SOTON/379389).

The BAMSE cohort
Sites that were significantly associated with atopy in stage 2 analyses were selected for replication in the Children, Allergy, Milieu, Stockholm, Epidemiology (BAMSE), a prospective population-based cohort study of children recruited at birth and followed during childhood. Details of the study design, inclusion criteria, enrolment, and data collection are described elsewhere [31]. In brief, 4,089 children born between 1994 and 1996 in four municipalities of Stockholm County were enrolled. Longitudinal sensitization and questionnaire data were collected through to age 8. The baseline and follow-up studies were approved by the Regional Ethical Review Board, Karolinska Institutet, Stockholm, Sweden, and the parents of all participating children provided informed consent. Blood samples collected at 8 years were screened with Phadiatop [a mixture of common inhalant allergens: birch, timothy, mugwort, cat, dog, horse, mold (Cladosporium herbarum), and house dust mite (Dermatophagoides pteronyssinus)] and fx5 (a mixture of common food allergens: cow's milk, egg white, soy bean, peanut, cod fish, and wheat) (ImmunoCAP, Phadia AB, Uppsala, Sweden). Atopy was defined as a positive Phadiatop or a positive fx5 test with specific IgE antibody levels ≥0.35 kUA/L. Furthermore, epigenome-wide DNA methylation was measured in 472 children using DNA extracted from blood samples collected at the 8 year follow-up [32]. For this, 500 ng DNA per sample underwent bisulfite conversion using the EZ-96 DNA Methylation kit (Shallow; Zymo Research Corporation, Irvine, CA, USA). Samples were processed with the Illumina Infinium HumanMethylation450 BeadChip (Illumina, Inc.). Data pre-processing (signal correction and data normalization) and QC were performed using standard criteria described elsewhere [33]. This study included those with valid DNA-M samples and that were non-missing for atopy-status or adjustment covariates (N = 464).

Statistical analysis (IOW)stage 1
The randomForest package in R was used to implement the RF algorithm [34]. The output from the RF includes the out-of-bag error rate (OOB-ER), class-specific misclassification rates, and VIMs. The OOB-ER is the overall misclassification rate of the complete forest. Class-specific misclassification rates, which are also calculated from the out-of-bag samples, are the rates at which the classes of the outcome variable are misclassified, in our case atopic classification and non-atopic classification. VIMs are measures of the amount of information a variable contributed to the classification throughout the forest. Hapfelmeier and Ulm, whose proposed feature selection method used OOB-ER or another cross-validated error measure, acknowledge that the VIM depends on the data and the underlying research question [35]. We used the mean decrease Gini (MDG) as VIM because it was shown to be more robust to small deviations to the data when compared to the mean decrease accuracy (MDA) [36].
Prior to implementing the recursive RF [15] described below, we explored how prediction accuracy of the forest was influenced by altering the parameters sampsize, mtry, and ntree, so these could be optimally set for the recursive RF implementation described below. The sampsize parameter controls whether to use balanced or imbalanced sampling to generate the training datasets; mtry specifies the number of variables to be randomly selected and tested at each node of each tree; and ntree determines the number of trees to be grown in a forest. Using the default values for mtry (√p, where p is the number of variables available) and ntree (500), we compared the OOB-ER and class-specific misclassification rates for an imbalanced RF grown without sampsize and a balanced RF grown with sampsize = (50,50). Specifying sampsize = (50,50) meant that 50 observations from those with atopy and 50 observations from those without atopy were randomly selected when creating the training set for each tree. Once we determined whether or not to utilize the sampsize parameter, we tracked the prediction accuracy of the RF at different combinations of mtry (√p, 2*√p, 0.05p, 0.1p, and 0.5p) and ntree (200, 300, 400, 500, 1,000, and 2,000). Once the optimal parameter values were selected, the recursive RF was implemented.
The general methodology of the recursive RF for feature selection has been proposed and utilized elsewhere [13,15,17,18,35,37], though not with high-throughput epigenetic data. Using this approach we aimed to reduce the data from all CpG sites retained after pre-processing and cleaning to a more manageable size by eliminating variables that contributed little predictive information for atopy. The recursive RF loop was initiated by running a RF with all CpGs included as potential predictors. Then the variables were sorted by their VIM, the bottom half of the CpGs with the lowest VIMs were removed, and the RF was run again, using this subset of CpG sites (Fig. 1). This process was repeated while tracking the RF OOB-ER and class-specific misclassification rates at each iteration. The process was stopped when the atopyspecific misclassification rate increased, because we were most concerned with correct classification of those with atopy. The variables from the iteration prior to the increase in misclassification were selected for further analyses. Each CpG site that was selected by the recursive RF was annotated with information about what gene the CpG site was within, when applicable.
The sites selected from the recursive RF were then tested for univariate associations with atopy via logistic regression. Given that methylation levels vary significantly by cell type, peripheral blood samples are composed of multiple different cell types, and allergic diseases often influence the proportions of circulating immune cell types, we considered the potential confounding effect of cell-type differential for each participant. However, logistical limitations prevented the acquisition of cell differential at the original time of blood sample collection. Thus, we utilized the methylation data to predict cell differential [38,39], then calculated the percent differences between the crude and cell-type adjusted β coefficients from the logistic regression models, to observe the impact of cell type on the association between methylation levels and atopy.

Statistical analysis (IOW)stage 2
Boxplots of β values stratified by atopy status were used to ensure that the distributions of methylations levels in the female and male samples were similar and could be combined. Two observations (in cg12819873 and in cg13168187) were identified as strong outliers, and recoded as missing. The joint sample was then used for all stage 2 analyses in which each CpG was tested for its individual association with atopy and high IgE, adjusted for important covariates. These tests were conducted with logistic regression in which high IgE and atopy were the dependent variables and M-values for CpGs were the primary independent variables, while cell type proportions and sex were included as covariates. CpG sites that were significantly associated with both atopy and high IgE at the Bonferroni corrected α were subjected to set analyses, used to show the combined effect of DNA-M on atopy [40].

Statistical analyses (BAMSE)replication
For each site that was significantly associated with atopy in stage 2, we conducted multiple logistic regression models in the BAMSE cohort. Atopy status was the dependent variable and M-values for CpGs were the independent variables, while cell type proportions, sex, asthma treatment within the last 12 months, and batch (bisulfite treatment date) were included as covariates (identified as confounders in the regression model). Successful replication was defined as having the same direction of association and a P-value < 0.05. These sites were submitted for functional annotation analyses in DAVID [41,42].

Results
All IOW participants were age 18 years at the time of epigenetic screening for DNA-M and administration of SPTs. Of all participants that underwent epigenetic screening, three females and two males did not receive SPTs and thus were not assessed for atopy status. Although there were some differences in the prevalence of atopy and high IgE between the epigenetic subsample and the full cohort, none of these differences were statistically significant. The prevalence of atopy within the Fig. 1 Recursive RF feature selection process. The feature selection process started with a large dataset: all CpGs that survived data cleaning and preprocessing, and were not potentially affected by probe SNPs. The cycle in black (conducting the Random Forest, collecting evaluation measures, assessing stop criteria, and reducing the data) repeated until the atopy-specific misclassification rate showed a marked increase, indicating that some excluded sites were important in classifying atopic participants. Thus, once an increase in atopy-specific misclassification was observed, the cycle stopped and sites from the previous iteration were selected for follow-up testing. OOB-ER out-of-bag error rate, RF Random Forest, VIM variable importance measure epigenetic subsample was not significantly different (P-value = 0.0972) between the female sample, used in stage 1, and the male sample which was pooled with the female sample in stage 2. However, prevalence of high serum IgE (≥200 kU/L) was significantly (P-value = 0.0469) lower in the female sample (23.8 %) than in the male sample (41.9 %). These differences in high IgE did not affect the analytic methods because serum IgE was only used as a secondary and alternate biomarker of atopy to provide further confidence in our top results (Table 1).
For stage 1 we conducted a recursive RF algorithm with feature selection very similar to balanced iterative RF, described in detail in the methods section [18]. Prior to implementing the full algorithm we optimized the RF parameters by testing multiple combinations mtry and ntree. We selected an mtry of 0.1p, which was observed to be effective in a similarly large scale RF analysis with sparsity [17], and an ntree of 500 that allowed the error rates to stabilize, but limited computational time.
The initial RF in the recursive implementation was fitted with all CpG sites (p = 254,460) that survived data cleaning, pre-processing, and removal of probes potentially containing SNPs. At each step in the reduction, the dataset was reduced by half; by the 15th iteration the data was reduced to a total of 15 CpG sites. The OOB-ER achieved its lowest point (overall misclassification of 8.67 %) at the 11th iteration, which included 248 CpGs (Table 2). However, we reduced the data further to the 13th iteration, which resulted in the lowest misclassification of atopics (14.47 %) and included 62 CpGs. From the first iteration to the 13th, the OOB-ER improved from 38.42 to 9.50 %, while the misclassification for atopics and non-atopics improved from 78.95 to 14.47 % and 19.87 to 7.22 % respectively. After the 13th iteration, each of the misclassification rates increased, thus the CpGs (p = 62) from this iteration were selected for stage 2 analyses.
All 62 selected CpG sites were annotated with relevant genetic information (Table 3). We used logistic regression to describe the individual associations of all the selected CpG sites. Only cg09570585 and cg10016610 had P-values > 0.05 (P-values = 0.06353 and 0.09771, respectively). Prior to implementing stage 2 analyses, we tested whether any of the selected sites may have been selected due to confounding by cell type. Many of the associations were altered by adjusting for proportions of CD8 + T cells, CD4 + T cells, natural killer cells, B cells, monocytes, and granulocytes (Table 4). Thus all further associations were adjusted for cell type.
Prior to running the stage 2 joint analyses we compared the distribution of methylation levels in the male and female samples stratified by atopy status. The distributions ( Fig. 2 and Additional file 1) were similar between the two samples for most loci and thus we proceeded with pooling the data. However, since the distribution of methylation levels did differ by sex for some loci, we included sex as a covariate in the stage 2 analyses.
For stage 2, we tested each of the 62 CpG sites for their associations with atopy and high serum IgE levels in the joint sample. Only 50 of the 62 sites were present in both samples, because the female and male samples were analyzed as separate batches and 12 of these sites were removed from the male sample during data cleaning and pre-processing. The sites that were only present in the female sample were still analyzed in stage 2.
Of the 62 sites, 22 had statistically significant associations with atopy (P-value range 6.5E−9 to 7.9E−4) ( Table 5). At four of these sites, higher levels of DNA-M were associated with increased likelihood of atopy [odds ratio (OR) range 2.66-8.08]. For the other 18 sites, lower levels of DNA methylation were associated with higher likelihood of atopy (OR range 0.311-0.065). We also found that 12 sites had statistically significant associations with both atopy and high IgE (P-value range 1.1E−5 to 7.1E−4) in the IOW. Set analyses [40] showed a mild but statistically significant joint effect of DNA methylation on atopy at the 10 IgE-associated and atopy-associated sites shared between men and women (estimate 0.0016, 95 % confidence interval 0.0003-0.023).
Finally, 19 of the 22 sites (data on three sites were not available in BAMSE) associated with atopy in IOW were studied in an independent cohort. Of the 19 sites tested, 13 were significantly associated (P-values < 0.05) with atopy in BAMSE and had comparable ORs (Table 5): The 13 replicated sites were investigated for functional annotation in DAVID and for individual biological relevance via literature review. Functional annotation of the 10 genes (ZFPM1, PRG2, KIAA0494, EPX, ATL3, LSM14B, COPA, ZNF862, ASCC1, and PVALB) associated with the 13 replicated CpG sites (Table 6) revealed three statistically significant annotations: polymorphism, eosinophil, and asthma. The most interesting of these findings involved two genes (EPX and PRG2) in the KEGG pathway for asthma (Benjamini P-value = 0.00056) and associated with the eosinophils annotation (Benjamini P-value = 0.0087).

Discussion
Our methodological approach and the biological relevance of our findings are noteworthy to researchers studying epigenetic mechanisms in atopy. We selected 62 CpG sites from a starting set of 254,460, resulting in vastly improved classification of atopics (from 78.95 to 14.47 % error) and non-atopics (from 19.87 % to 7.22 % error) when compared to the RF on the full dataset. Of particular note was the large proportion of CpG loci that were statistically significant at a Bonferroni-adjusted α for atopy (35 %) and high IgE (19 %) within the IOW sample and the large proportion (13 of 19) of sites that were successfully replicated in the BAMSE cohort.
Our findings are the latest in a series of recent work that supports the application of RF for genome-wide association studies (GWAS) and in allergic diseases. The recursive RF process we utilized was similar to methods proposed elsewhere [13,15,17,35]. It has been used by Menze et al. [37] and Anaissi et al. [18] but, to the best of our knowledge, has never been implemented in epigenomics. Goldstein et al. presented one of the first successful applications of RF for GWAS, demonstrating its ability to identify genes known to be associated with the multiple sclerosis as well as genes with previously unknown disease associations [13]. Xu et al. successfully identified SNPs predictive of asthma exacerbations in children via RF [16]. These findings indicate the promising nature of the use of RF for feature selection in future epigenome-wide studies.
The true challenge with high-throughput techniques is in connecting the results to biological processes, which are complex and can involve combinations of many genes working together. We investigated the biological roles of the ten genes associated with the 13 replicated CpGs sites: ZFPM1, PRG2, KIAA0494, EPX, ATL3, LSM14B, COPA, ZNF862, ASCC1, and PVALB. For each of these genes, we performed a search of the literature for possible roles in atopy and conducted functional annotation in DAVID.
Among the replicated loci, a number of their associated genes were involved in intriguing processes that may have a role in atopy. ZFPM1 (also known as FOG-1) is a binding factor for the transcription factor GATA-1 and has been primarily studied for its role in the differentiation of erythroid, megakaryocyte, and mast cells [43]. However the consequences of FOG-1 expression appear to be dependent on its   [45]. Also, two differentially methylated regions in ZFPM1 were recently identified in association with asthma [46]. The multifaceted roles of ZFPM1 in immune-cell activity and allergic disease suggest that this is an interesting yet possibly overlooked gene in atopy and atopic diseases. Eosinophils are subtypes of granulocytes that are heavily involved in inflammatory responses and atopic asthma through the mechanism of airway inflammation [47]. EPX encodes eosinophil peroxidase, a protein expressed by eosinophils. Previous investigations found that both serum and urine levels of EPX were elevated in children who had positive SPTs, as well as those with allergic diseases such as asthma, allergic rhinoconjunctivitis and atopic dermatitis [48][49][50][51]. A recent epigenomewide study found multiple CpG sites, including one within PRG2, which were associated with high versus low total IgE, primarily driven by eosinophils. Interestingly, this study also found that the methylation levels in isolated eosinophils differed among asthmatics with high total IgE, asthmatics with low total IgE, and controls, suggesting that eosinophils from persons with allergic hypersensitivity or asthma may have different epigenetic profiles compared to eosinophils from non-allergic individuals [52]. Also, a recent genome-wide expression study of peripheral blood mononuclear cells found that PRG2 expression was up-regulated in response to dust-mite exposure, suggesting a possible role in the adaptive immune response [53].
A GWAS of atopic asthma implicated SNPs that were in linkage disequilibrium with SNPs in COPA, though these did not achieve genome-wide significance [54]. More recently, four deleterious variants within COPA have been linked to an autoimmune disease characterized by high-titer autoantibodies, interstitial lung disease, and inflammatory arthritis [55]. These mutations may induce stress on the endoplasmic reticulum leading to defective intracellular protein transport between the golgi and the endoplasmic reticulum; such defects have been linked to autoimmune and lung-disease. Interestingly, mutant COPA also appears to drive CD4 + T-cells toward T H 17 phenotype via increased expression of IL-1β, IL-6 and IL-23 [55]. Thus, mutant COPA does appear to affect immune pathways which can lead to autoimmune disease and our findings suggest that differential epigenetic regulation of COPA may play a role in hypersensitivity, though further research is necessary to elucidate this role.
LSM14B may be involved in mRNA translation [56,57]. Some of the genes encode proteins that perform structural roles in different areas of the body. ATL3 participates in tethering, creating a tubular connective network of  membranes in the endoplasmic reticulum, which is the site where ribosomes build proteins from DNA transcripts. The functional annotation results implicated the genes EPX and PRG2 in eosinophil activity and in the KEGG pathway for late hypersensitive responses in asthma. Some of the genes (KIAA0494, ATL3, LSM14B, ASCC1, and PVALB) did not have any apparent role in immune response. These findings should be interpreted within the limitations of the study. Although we provide evidence in support of associations between 13 CpG sites and atopy, variations in methylation at these sites may not cause allergic sensitization. The cross-sectional nature of this sample prohibited us from distinguishing between which DNA-M variations at CpG sites may have caused, been caused by, or just been markers of sensitization. However, associations in any of these directions may yield important insights into the development, persistence, and consequences of allergic sensitization. Some of the CpG sites that were selected could not be replicated and some that were replicated were not involved in any known biological processes related to atopy or allergy. The unsuccessful replication could be due to falsepositive findings from the discovery analyses, or differences in how atopy was assessed between the discovery and replication cohorts. The lack of biological roles for these CpG sites could be explained by selected CpG sites possibly being highly correlated with other CpG sites that truly influence atopy status, or by CpG sites having roles in unknown, but still important, biological pathways involved in atopy.
Correlated predictors may present an issue that we were unable to address [13,17]. If the methylation level at a biologically important CpG site was highly correlated with methylation levels at other unimportant loci, the inclusion of those unimportant loci in a forest would decrease the VIM of the important CpG site and may result in its exclusion during data reduction. This would result in a statistically strong but biologically ambiguous result. It is possible that some of our results that were not biologically consistent with allergic disease were due to this issue. Applying an approach similar to linkage disequilibrium and haplotype identification from genetic studies may improve the prediction accuracy of the forest and save computational time [17], but such applications have not been studied with genome-wide DNA-M arrays at this time. Furthermore, there is no consensus with respect to which VIM is best for large-scale data with correlated predictors. We used MDG, which was also utilized by Menze et al. to recursively eliminate unimportant predictors [37]. Calle and Urrea found that MDA was unstable when there were small alterations to the data, but that MDG was robust to such changes [36]. However, MDG does not perform as well if the Results of 62 logistic regressions between methylation M-values and atopy for each selected CpG. We present crude associations as well as associations adjusted for predicted cell proportions of CD8 + T cells, CD4 + T cells, natural killer cells, B-cells, monocytes, and granulocytes. β 1 represents the value of the regression coefficient for the CpG site in that statistical model. The percent change in β-values (%Diff β ) was calculated as [(crude β 1adjusted β 1 )/crude β 1 ] and was used to evaluate whether cell type influenced the selection of each CpG site scales of the variables differ widely or if they have different numbers of categories [36,58], which would be an important consideration for researchers incorporating both DNA-M and SNP data in a single dataset. More work needs to be done to determine which VIMs perform best under the typical characteristics of genomewide DNA-M studies: sparsity, skewed continuous predictors, very large n and very small p, statistical interactions, or correlations between predictors. Despite the issues of correlated predictors, such variables can still provide useful information. DNA-M loci that are merely surrogates of actual CpGs associated with atopy can still serve as biomarkers of disease, but do not serve to improve our understanding of the etiology of atopy. Some of the CpGs that we identified with the recursive RF but that did not meet our replication criteria may in fact be biologically relevant in atopy. We would not expect all biologically relevant findings to be included in the functional annotation results of our gene list for two reasons: first, our gene list of 10 genes is quite small because DAVID is optimized for lists between 100 and 2,000 genes in length [41]; and second, functional annotation relies on current knowledge of gene functions, and may not correctly classify the functions of novel loci. Also, these sites were selected with RF, which allows for complex interactions to be identified [13]. CpGs that were selected via RF due to unknown interaction effects may not have had an independent association with atopy and thus could not have survived our stage 2 analyses with strict multiple testing adjustments to significance levels.
Despite correcting for cell proportions (CD8 + T cells, CD4 + T cells, natural killer cells, B cells, monocytes, and granulocytes) in our regression analyses, the predicted cell proportions for the low-frequency cell types, such as T-cell subtypes, may be less accurate than those of the higher frequency cell types and these predictions did not distinguish eosinophils from other granulocytes. Given the importance of T-cell subtypes (T H 1, T H 2, and T H 17) and eosinophils (a subset of granulocytes) in atopic responses, this may have resulted in some residual confounding. However, given the inability to collect actual cell differentials in this study, the predictions we used likely accounted for the majority of cellular heterogeneity in our blood samples.
The lack of independence between the samples used for RF feature selection (stage 1) and the samples used for determinations of statistical significance (stage 2) was another limitation, and may have led to some over-fitting during stage 2 analyses. Also, 12 CpG sites that were selected in stage 1 were not present in the male sample. Although these were still evaluated in the stage 2 analyses, the lack of full methylation data reduced the power to identify significant findings at these 12 sites. However, the strong replication results in the BAMSE cohort would suggest that the majority of our findings were not due to random chance or over-fit to the IOW sample.
Not all of our findings were replicated; six sites that were tested did not successfully replicate and three sites could not be tested because the data were unavailable. The six non-replicated sites may represent falsepositives from our stage 2 analyses or could be due to differences in the measurement of atopy status between the two cohorts. One limitation of the replication study was that atopy was defined as at least one positive SPT to any allergen in the IOW; whereas atopy was defined as specific IgE antibody ≥ 0.35 kU A /L to any allergen in BAMSE. The associations with high serum IgE in the IOW support that at least some of the unsuccessful replications may have been due to these differences in measurement. All 13 sites that replicated in BAMSE had at least nominal associations with high serum IgE in IOW (P-values < 0.05), whereas only one of the six sites that did not replicate in BAMSE had an association with high serum IgE in IOW (P-value < 0.05). These findings suggest that the only sites that could be replicated in this study may be involved in IgE-mediated allergic sensitization. Also, some atopy-associated CpG sites in IOW, which were measured at 18 years old, may not have been able to replicate in BAMSE, measured at 8 years of age, because methylation levels can be age dependent [59]. It is possible that some of these six sites may have replicated had the outcome of atopy status been measured with the same method and at the same age in both cohorts. Thus, although these six sites were not considered positive findings in this study, future epigenetic studies that utilize SPTs to evaluate sensitization, and evaluate sensitization in young adults close to age 18, may consider attempting to replicate these sites. The three sites for which data were not available in the replication cohort should also be considered for future replication studies. The CpG site (cg09249800) in ACOT7, which was strongly associated with both atopy and high IgE in the IOW cohort, is particularly interesting because others have identified differentially methylated regions within this gene associated with asthma [46]; thus, it may play a role in allergic sensitization or allergic diseases.

Conclusions
Utilizing a two-stage design with a well-characterized but sparsely implemented RF feature selection method followed by logistic regression for both atopy and an alternate marker of atopy (high IgE), we identified a number of CpG sites associated with atopy. Most importantly, 13 sites were replicated in an independent cohort for atopy status: cg04983687 in the body of ZFPM1, cg12819873 in the 5′UTR of PRG2, cg07908654 (intergenic), cg06824199 in the body of KIAA0494, cg27469152 in the 3′UTR of EPX, cg27468224 (intergenic), cg13233042 in the body of ATL3, cg13197551 in the 3′ UTR of LSM14B, cg09332506 in the body of COPA, cg17041511 (intergenic), cg07970948 in the body of ZNF862, cg25854298 in the body of ASCC1, and cg17971837 in the TSS1500 of PVALB. Three of the 22 sites associated with atopy in IOW were not available for testing in the BAMSE cohort, so may be of interest for follow-up in future studies of DNA-M and atopy: cg09249800 in the body of ACOT7, cg07765167 in the TSS1500 of MRPL45, and cg24836822 in the body of KCNH2. These CpG sites and their associated genes could be treated as under-studied candidates for future studies of atopy; particularly cg04983687 in ZFPM1, cg12819873 in PRG2, cg27469152 in EPX, and cg09332506 in COPA.