Genotype effects contribute to variation in longitudinal methylome patterns in older people

Background DNA methylation levels change along with age, but few studies have examined the variation in the rate of such changes between individuals. Methods We performed a longitudinal analysis to quantify the variation in the rate of change of DNA methylation between individuals using whole blood DNA methylation array profiles collected at 2–4 time points (N = 2894) in 954 individuals (67–90 years). Results After stringent quality control, we identified 1507 DNA methylation CpG sites (rsCpGs) with statistically significant variation in the rate of change (random slope) of DNA methylation among individuals in a mixed linear model analysis. Genes in the vicinity of these rsCpGs were found to be enriched in Homeobox transcription factors and the Wnt signalling pathway, both of which are related to ageing processes. Furthermore, we investigated the SNP effect on the random slope. We found that 4 out of 1507 rsCpGs had one significant (P < 5 × 10−8/1507) SNP effect and 343 rsCpGs had at least one SNP effect (436 SNP-probe pairs) reaching genome-wide significance (P < 5 × 10−8). Ninety-five percent of the significant (P < 5 × 10−8) SNPs are on different chromosomes from their corresponding probes. Conclusions We identified CpG sites that have variability in the rate of change of DNA methylation between individuals, and our results suggest a genetic basis of this variation. Genes around these CpG sites have been reported to be involved in the ageing process. Electronic supplementary material The online version of this article (10.1186/s13073-018-0585-7) contains supplementary material, which is available to authorized users.

Background DNA methylation is a widely studied epigenetic modification with a role in the regulation of gene expression [1]. Local levels of DNA methylation differ within and between individuals. This variation in local methylation is associated with both genetic and environmental factors [2][3][4][5]. The majority of DNA methylation studies in human are based on cross-sectional cohorts. Such studies have reported that methylation levels at many CpG sites in the genome correlate with age [6][7][8][9][10]. Therefore, age is frequently treated as a covariate and adjusted for in a linear regression framework in which differences of DNA methylation between cell types, tissues and diseases are tested [11][12][13]. One implicit assumption behind this correction is that the rate of change at a methylation CpG site across time is constant between individuals, which may not be true. Several studies have revealed that there is a potential change in variability of DNA methylation with age [14,15], indicating that the rate of change of DNA methylation is different between individuals.
Estimation of the variation in such trajectories of DNA methylation with age between individuals is possible in a longitudinal analysis. Previous longitudinal analyses have investigated the relationship between SNPs and longitudinal DNA methylation [16,17]. Other studies focused on the association between DNA methylation and a phenotype measured on the same individual [18][19][20]. Differences between individuals in the pattern of change of DNA methylation over time were not considered in these studies. Here, we estimate the variation in trajectories of DNA methylation change among individuals in a longitudinal analysis. This approach may elucidate how epigenetic marks change differently between individuals and whether this variation is associated with genetic factors and biological function.
In this study, we estimated between-individual variation in the rate of change of DNA methylation at 344,000 loci in a longitudinal sample of older people from the Lothian Birth Cohorts 1921 and 1936. For each CpG site, we estimated the variation in the rate of change in each individual. Furthermore, the identification of such probes facilitates the estimation and partitioning of the variation underlying DNA methylation changes, for example, the contribution of genetic factors. We identified genetic loci that are associated with differences in the longitudinal changes in DNA methylation across individuals.

Methylation data
DNA was extracted from whole blood samples in Lothian Birth Cohort 1921 (LBC1921) at MRC Technology, Western General Hospital, Edinburgh (LBC1921), and the Wellcome Trust Clinical Research Facility (WTCRF), Western General Hospital, Edinburgh (LBC1936), using standard methods. Methylation typing of 485,512 probes was performed at the WTCRF. Bisulphite converted HD Methylation protocol and Tecan robotics (Illumina). Raw intensity data were background-corrected and normalized using internal controls, and methylation M values were generated using the R minfi package [21]. Detailed further quality control steps are given in Additional file 1.

Batch and covariate adjustment
Our analysis was based on the M value of DNA methylation. We regularized the M value by constraining it to be in the interval between − 9.96 and 9.96 (corresponding to the interval 0.001 to 0.999 of the beta-value). Furthermore, for each probe, we removed individuals with DNA methylation three standard deviations above and below the mean M value to exclude outliers. On average, 34 (out of 2894 samples, 1.2%) outliers were removed for each probe. DNA methylation (M value) in most (79.5%) of these outliers is in the range between − 6.6 and 6.6 (corresponding to the interval 0.01 to 0.99 of the beta-value), suggesting the "abnormal" DNA methylation values in the majority of outliers are not extreme values caused by the transformation from beta-value to M value. Covariates including sex, age and cell counts (CC), and batch effects including position in array (PIA), hybridization date (HD), set ID (SI), plate ID (PI) and array ID (AI, both PI and AI were regarded as random effects), were corrected for each probe. We used the residuals after this adjustment for further analysis. If y j is the DNA methylation value for probe j, then we used the residuals from the model

LBC genotype and imputation
Individuals from LBC1921 and LBC1936 were genotyped on Illumina 610-Quad Beadchip arrays. Full details of genotyping procedures are given elsewhere [22]. Standard QC filters were applied, and remaining genotyped SNPs were phased using HAPI-UR [23] and imputed using 1000 Genomes Phase I Version 3 [24] with Impute V2 [25]. Raw imputed SNPs were filtered to remove any SNPs with low imputation quality as defined by an R 2 < 0.8. Subsequent quality control removed SNPs with MAF < 0.01, those with HWE P < 1 × 10 −6 and a missing rate > 10%. After filtering, 7,760,689 SNPs remained for further analysis.

Estimation of random slope effects
For each probe, we fitted a mean level and a rate of change of DNA methylation for each individual and tested whether the variance due to these random effects was significantly larger than zero, using the mixed model where y ij is the methylation residual after QC steps, i is the ith individual, j represents the jth observation in individual i, u 1 is the mean effect, u 2 is the mean age effect, a i and b i are the random intercept (mean level of DNA methylation in each individual) and random slope, t ij represents standardized age (mean = 0 and variance = 1) and e ij is the random error. The random effects of e ij , a i and b i are assumed to follow a normal distribution. σ 2 a and σ 2 b are the variances of a i and b i , respectively. σ ab is the covariance between a i and b i , it was set to be zero under the assumption of independence between a i and b i . The likelihood ratio test (LRT) was used to test if σ 2 a and σ 2 b are equal to zero. We obtained a P value from the LRT using a χ 2 (1) distribution and then dividing the P value by two. This can be justified since under the null hypothesis, in 50% of cases, the test statistic is zero (or, follows a χ 2 (0)), and in 50% of cases it follows a χ 2 (1) [26]. Probes with a P value smaller than 1.5 × 10 −7 (0.05/344,000) for the random slope are defined as rsCpG.

Covariance between a i and b i
We applied an extended model that included the covariance σ ab between the random intercept and random slope to quantify the effect of covariance on the estimation of random effects. We calculated the Pearson corrections between the estimated random slope effects before and after incorporating the covariance term. All correlations were larger than 0.7 in the 1507 rsCpGs (mean correlation = 0.93). These results indicated that the introduction of the covariance term did not alter the results substantially.

Quadratic effect
We investigated the effect of modelling a quadratic average trajectory by adding a squared term for age (t 2 ) in the model ij þ e ij . All the correlations of the probes with and without fitting this additional term were found to be larger than 0.98 in 1507 rsCpGs.

Estimating the confidence interval of the correlation based on bootstrapping
Considering the background correlation of DNA methylation between CpG sites, we used bootstrapping to calculate the 95% confidence interval of the correlation (of the estimated variances) between two groups of individuals. We resampled 344,000 pairs of variances with replacement from the original data and estimated the correlation based on these pairs. We repeated this step 30,000 times to calculate the 95% confidence interval of the correlation.

GWAS on random effects
Based on the random effects estimated from the above mixed linear model, we performed a series of genome-wide association studies by using the random effects as the dependent variables with the software PLINK2 [27]. All QC-ed SNPs were used, and P value threshold for the significance was Bonferroni corrected (P < 5 × 10 −8 /1507).

Permutation analysis of the random slope test statistics
The mean and median test statistic across CpG sites for the effect of the random slope was very large, with a λ inflation value (mean test statistic) of 11.0. To verify if the results are inflated under the null hypothesis, we permuted ages across individuals and waves 500 times. For each round, we re-fitted the full model on the permuted data, and the mean chi-square among the probes was calculated. The mean of this distribution was around 0.73 (SD = 0.32), which shows no significant difference (P = 0.48) with the expected value of 0.5 under the null hypothesis. This indicates the statistical significance of the estimated effects of a random slope is not caused by the violations of the assumptions of the distribution of the test statistic under the null hypothesis.

Mapping CpG Islands and differently methylated region (DMR)
Genomic positions of the CpG island were obtained from the UCSC Genome Browser [28]. Annotation information of the differently methylated region (DMR) was from the Illumina DNA methylation annotation file (GEO ID GPL13534). The significance of enrichment analysis was assessed by permutation.

PANTHER over-representation test
We used all 25,537 genes that are adjacent to the QC-ed 344,000 probes as the background. Eighteen thousand six hundred seven of the genes overlapped with the gene list in the PANTHER database. Ten thousand twenty out of 1235 rsCpG nearby genes were in the PANTHER database. The enrichment test was based on Fisher's exact test, and protein classes with a false discovery rate (FDR) smaller than 0.05 were selected.

Heritability of probes
We utilized the heritability of the significant probes estimated in the Brisbane Systems Genetics Study (BSGS) cohort [29,30] to validate the genetic contribution to these probes. The significance of the difference from the null distribution of the mean heritability was estimated based on the average heritability of 1507 randomly selected probes from 30,000 permutation tests.

SNP by age effect on DNA methylation
The SNP effect on random slope can be defined as slope i = β k × d ik + e i , i is the ith individual, β k is the effect size of SNP k on random slope, and d ik is the dosage of SNP k in individual i. Since the DNA methylation of individual i at time point t ij is y ij = slope i × t ij + e ij = (β k × d ik + e i ) × t ij + e ij = β k × d ik × t ij + e i × t ij + e ij (main effects are ignored here), the SNP effect on random slope can be interpreted as SNP by age effect on DNA methylation. To compare the power in detecting these two effects, we simulated 3000 individuals, each with three age points sampled at round 60, 70 and 80 years old. DNA methylation of individual i at time point t ij was simulated by using y ij = (β k × d ik + e i ) × t ij + e ij . d ik was sampled from (0,1,2) assuming Hardy-Weinberg equilibrium, the minor allele frequency of the SNP ranges from 0.05 to 0.5, the random error e ij and random slope e i (not explained by SNP) are assumed to follow a standard normal distribution. The effect size β k of the SNP on the random slope was simulated in two ways: (1) β k was sampled from a uniform distribution in the range of − 0.1 to 0.1 and (2) β k was set to zero. For each type of data, we obtained P values in three ways: (1) from association between the SNP and the estimated random slope from a mixed linear model, (2) from association between DNA methylation and a SNP by age effect and (3) from association between DNA methylation and a SNP by age effect with random slope fitted as a covariate. We repeated this simulation 300,000 times and compared P values and median chi-square (λ median ) between these associations. Based on the simulated data with no SNP effect on the random slope (β k = 0), we found the P values from the second analysis method were inflated (λ median = 2.01). P values from the other two ways were not inflated (λ median = 1.00), and no difference in the detecting power was identified between these two ways based on the simulated data with SNP effect on the random slope (β k~U (− 0.1,0.1)).

Linkage disequilibrium (LD) clumping
We applied LD clumping on the GWAS significant SNPs using PLINK2 [27], and imputed LBC genotype data was used as the reference. For each significant (P < 5 × 10 −8 ) SNP, LD (R 2 ) between this SNP and other significant (P < 5 × 10 −8 ) SNPs within 1 Mbp distance were calculated and SNPs with LD larger than 0.1 were defined as a clump. Within each clump, only the SNP with smallest P value would be selected during LD clumping.
All analysis was performed by using R package, version 3.2.2 [31]. Figures were generated using ggplot2 [32].

Data
DNA methylation was measured on individuals from Lothian Birth Cohort 1921 (LBC1921) and Lothian Birth Cohort 1936 (LBC1936) [33,34] using Illumina Infinium HumanMethylation450K BeadChip arrays. There were 3471 samples across all waves of data collection (Additional file 1: Table S1), and 344,000 DNA methylation probes remained after removing probes encompassing SNPs annotated by Illumina (GEO ID GPL13534) and probes identified as potentially cross-hybridizing [35] (see details of quality control steps in Additional file 1). Only individuals with DNA methylation measured at two or more different time points were considered, and samples with inconsistent measurements (match rate < 0.8) of control probes within individuals were removed (Additional file 2: Figure  S1), leaving 2894 samples from 954 individuals (Table 1). Among them, 283 individuals had DNA methylation measured at two time points, and 356 and 315 individuals were measured at three and four time points, respectively. The effects of covariates (sex, age and cell counts) and batches (position in the array, hybridization date, set ID, plate ID and array ID) on DNA methylation were removed before further analysis ('Methods' , Additional file 1: Figures  S2 and S3).

Identification of CpG sites with a random slope in methylation
For each CpG site, we estimated the variance of the rate of change (random slope) between individuals in a mixed linear model ('Methods'). A non-zero variance indicates the existence of individual differences in the rate of change in DNA methylation across time. Forty-two thousand two hundred fifty-three probes were found to have a statistically significant random slope (likelihood ratio test, P < 0.05/344,000, Bonferroni corrected) based on 2894 samples. Permutation test analyses indicated that the statistical significance of the estimated effects of the random slope is not caused by the violations of the assumptions of the test statistic ('Methods' , Additional file 2: Figures S4 and S5). Moreover, no substantial impact on the estimation of random effects was found by introducing a covariance ('Methods' , Additional file 2: Figure S6A) between the random slope and the mean level of DNA methylation in each individual into the model, or the inclusion of additional corrections for age effects such as a quadratic term ('Methods' , Additional file 2: Figure S6B). To obtain a robust set of CpG sites with a statistically significant variation in the rate of change, we divided the individuals into two groups according to the number of time points for which they have a measurement. One group contains individuals with two or three time points, and the other group has individuals with four time points. We applied the mixed linear model on each of these two groups and found that the estimated variances of random slopes for each CpG site were correlated between these independent groups (R = 0.41, 95% bootstrap CI 0.40-0.42, bootstrapping was repeated 30,000 times, 'Methods' , Fig. 1a). One thousand five hundred seven CpG sites were identified to have a statistically significant (p < 0.05/ 344,000) variation in the rate of change of DNA methylation in both groups (rsCpGs, Fig. 1b, Additional file 3: Table S2). The overlap is statistically significant (odds ratio = 4.9, P < 3.3 × 10 −5 , permutation test, 30,000 times), and these 1507 rsCpGs were used for further analysis. A summary of chi-square statistics for the variance of the random slope is presented in Table 2. We observed a larger variation in DNA methylation of rsCpGs compared to randomly selected probes in each wave (Additional file 2: Figure S7). Moreover, an increase in the variability of DNA methylation with age can be identified in most of these rsCpGs (Fig. 1c). These rsCpGs overlapped with probes identified by Slieker and colleagues [14]. Based on a cross-sectional study, Slieker et al. identified 6366 positions that showed changes in variably of methylation with age using 3295 whole blood DNA methylation profiles. Among those positions, 540 probes overlap with the 1507 rsCpGs in our study. This highly significant overlap (odds ratio = 45.9, P < 3.3 × 10 −5 , permutation test, 30,000 times) provides an independent confirmation of our results.

Genomic locations of CpG sites with random effects
The dynamicity of DNA methylation varies across the human genome [36,37]. To investigate whether the rsCpGs locate in the more variable genomic regions, we mapped these CpG sites to the genome and applied an enrichment test on these probes ('Methods'). We observed an enrichment of rsCpGs in the Shore region of CpG islands (regions within 2 kb upstream or downstream of a CpG island are called north shore and south shore, respectively). Genomic positions of the CpG island were obtained from the UCSC Genome Browser [28]), with an odds ratio (OR) of 2.0 for both the north shore (p < 3.3 × 10 −5 , permutation test, 30,000 times) and the south shore (P < 3.3 × 10 −5 , permutation test, 30,000 times) (Fig. 2). DNA methylation in the shore region was previously reported to be more dynamic than that in CpG islands [36], and our results indicate that CpG sites with random slopes locate in the regions with more malleable DNA methylation. Similarly, rsCpGs were found to be enriched in reprogramming-specific differently methylated regions (RDMR, regions differentially methylated in the reprogramming process) [37].

Biological enrichment of CpG sites with random effects
To explore the biological function of the rsCpGs, we applied a gene over-representation test on the nearest genes of rsCpGs using PANTHER (version 13.1) [38] ('Methods'). The result showed that 1235 genes around the 1507 rsCpGs were statistically significantly (Fisher's exact test, P = 3.7 × 10 −10 , FDR = 3.9 × 10 −8 ) enriched in Homeodomain (Homeobox) transcription factor (PC00119) protein class (Table 3). We also investigated the significance of these protein classes using a permutation test and found they remained significant (100 repeats). Furthermore, we performed Gene Ontology (GO) analysis on  [46] these genes and found that ectoderm development (GO 0007398, P= 5.9 × 10 −18 , FDR = 1.5 × 10 −15 ) and developmental process (GO 0032502, P = 1.8 × 10 −17 , FDR = 2.2 × 10 −15 ) were the most significantly over-represented biological processes (Additional file 4: Table S3). Cadherin signalling pathway (P00012), Wnt signalling pathway (P00057) and Heterotrimeric G-protein signalling pathway-Gq alpha and Go alpha-mediated pathway (P00027) were found to be significantly enriched pathways for genes around rsCpGs in a PANTHER pathway analysis (Table 3). Since there are only 177 primarily signalling pathways in PANTHER database [38], we further performed the pathway analysis in an integrated pathway database ConsensusPathDB (version 33) [39]. This analysis on the same gene sets showed that the most significant pathway for rsCpGs was "Neuronal System" (P = 1.4 × 10 −9 ). Full details of significant pathway results are given in the Additional file 5: Table S4.
Genetic effects on the random slope rsCpGs were found to be enriched in the CpG sites with large heritability (p < 3.3 × 10 −5 , permutation test, 30,000 times, 'Methods' , Fig. 3a), indicating a substantial genetic component to their variation. To examine the genetic contribution on the random effects of rsCpGs, we performed genome-wide association studies (GWASs) using PLINK2 [27], fitting the predicted random slope for each person (obtained from the mixed model analysis) as the dependent trait ('Methods'). Results showed that there were four significant SNP-probe pairs in total (P < 5 × 10 −8 /1507, after linkage disequilibrium (LD) clumping, 'Methods'), three of them are cis (in same chromosome) ( Table 4, Fig. 3b). In addition, 343 rsCpGs were identified to have at least one genome-wide significant (P < 5 × 10 −8 , after LD clumping) SNP effect (436 SNP-probe pairs).
Ninety-five percent of the SNPs are on different chromosomes from their corresponding probes (Fig. 4, Additional file 6: Table S5). The SNP effect on the random slope can also be interpreted as the SNP by age effect on DNA methylation ('Methods' , Additional file 2: Figure S8). Van Dongen et al. reported 71,894 CpG sites to have an interaction between genetic effects and age (P < 0.05) on DNA methylation [5], and the 343 rsCpGs with a significant SNP effect on the random slope from our analyses were enriched (P < 3.3 × 10 −5 , permutation test, 30,000 times) in these probes (Additional file 2: Figure S9). This provides an independent confirmation of our results.

Relationship between rate of DNA methylation change and covariates
To detect a possible contribution of the covariates in estimating the random slope of DNA methylation, we investigated two covariates that were previously identified to have a change of variation with ageing: body mass index (BMI) and walking speed (the time to walk 6 m) [20,40]. We fitted each of the covariates in the full model and re-estimated the significance of the variance of random slope. No significant changes were observed (Additional file 2: Figure S10), implying no contribution of these two covariates to variation of the random slope.

Discussion
We estimated the variation in the rate of change of DNA methylation for each probe by implementing a mixed linear model in a longitudinal analysis. One thousand five hundred seven probes (rsCpGs) were found to have statistically significant variation in the rate of change between individuals. These rsCpGs were enriched in the shore region of CpG island, which is consistent with more dynamicity of DNA methylation in CpG island shore region than the CpG island itself [37]. We found that the closest genes of rsCpGs were enriched in the Homeobox gene cluster, which was also reported in Slieker et al. to be associated with age-related variably methylated positions (aVMP) in cis [14]. The Homeobox gene cluster is involved in the process of cell development [41], and recent evidence showed that it is related to ageing [42,43]. Pathway analysis on these genes in PANTHER database indicated they were enriched in Wnt signalling pathway (P00057), which was also reported to be related to the ageing process [44,45]. One of the most significantly over-represented Gene Ontology (GO) category in these genes was the developmental process (GO 0032502), which was discovered to be significant for the probes that consistently drift among twins over time [16]. These results indicate the rsCpGs may be involved in the developmental process (such as ageing) by regulating their nearby genes.
There is a significant higher heritability of the 1507 rsCpGs compared to the overall level. GWAS results on the random slope identified 436 SNP-probe pairs (343 rsCpGs) with a genome-wide significant association (P < 5 × 10 −8 ), suggesting a SNP by age effect on the CpG sites. Among them, 95% of the SNPs were on different chromosomes from their probes, which (in the absence of non-identified confounders) indicated a potential major trans SNP by age effect on DNA methylation of rsCpGs.
Our study has several limitations. Although the permutation test indicates our results will not be inflated by the violations of the assumptions of the distribution of the test statistic under the null hypothesis, our results could be inflated by unknown confounding factors. We adjusted for known possible confounders, including the chronological age at which the samples were taken, but cannot exclude the possibility of unknown confounders that have effects on the mean or variance of the measured DNA  . It is significantly larger (P < 3.3 × 10 −5 , permutation test, 30,000 times) than the overall level. No significant correlation (R = − 0.005, P = 0.27) was found between heritability of probes and the distance to their meQTLs [47]. However, there is a small but significant association (R = 0.07, P < 2.2 × 10 −16 ) between the heritability and the mean phenotypic correlation (R 2 ) between a target probe and other probes on the same chromosome. This indicated that CpG sites with substantial heritability could contribute to the estimation of heritability of other CpG sites that they correlate with. b An example to show the significant association between SNP dosage and the random slope of DNA methylation methylation. The effects of covariates including age, sex and cell counts were adjusted in our quality control steps, and we further confirmed that BMI and walking speed have no effects on the rate of change in DNA methylation. However, other exposures, like medication, smoking status and disease status may potentially contribute to this variation, which can influence the estimation of random effects. Nevertheless, the 1507 rsCpGs that have a statistically significant random slope in two separate groups of individuals indicate that these results should be robust. There was no gene expression data on the same individuals available, and we simply assume that the expression of a gene can be regulated by its closest DNA methylation CpG site, which may not be true. Finally, our results are based on older individuals and may not apply to different age ranges.

Conclusions
Ageing is strongly correlated with changes in DNA methylation, and the rates of change over time at one CpG site can differ between individuals. We detected CpG sites with different changing rates (random slope) using a mixed linear model and found 1507 CpG sites that have a statistically significant rate of change in methylation between individuals, and that these different rates of change can be partially explained by genetic effect. Genes around rsCpGs were enriched in Homeobox gene clusters and Wnt signalling pathway, both of which have been reported to be involved in the ageing process. Our results imply that the changing rate of DNA methylation varies between individuals at several CpG sites, and this difference is associated with genetic factors. These CpG sites might be useful markers to better understand individual differences in ageing.