We focused on understanding the underlying biology that protects certain high-risk individuals against AD. We term these individuals “AD resilient individuals” and define them as individuals who are at least 75 years old, cognitively normal, and carry at least one APOE ε4 allele. Our approach consists of three key parts: linkage analysis and fine mapping, genetic analyses, and experimental biological validations. For simplicity, an overview of each step, datasets used, specific criteria applied, and high level results are presented in Fig. 1.
Pedigree selection
We used the Utah Population Database (UPDB) to identify large pedigrees with an evidence of excess AD mortality (i.e., families with a higher number of AD deaths than expected). The UPDB is a population-based resource linking the computerized genealogy of the Utah pioneers, and their descendants, to various electronic health data repositories for the state, including Utah death certificates [18]. The UPDB includes over seven million individuals, 2.5 million of which have at least three generations of genealogical data and are descendants of the original founders of Utah; over one million of these individuals have at least 12 of their 14 immediate ancestors in the database.
Since 1904, Utah death certificates have been coded and linked to individuals in the UPDB, allowing us to identify all individuals where AD is included as a cause-of-death. AD as a specific cause-of-death was first introduced to the International Classification of Disease (ICD) in revision 9 and retained in revision 10. Deaths were only considered an AD death if the death certificate included AD ICD codes (ICD9 331.0; ICD10 F00 or G30) as a primary or contributing cause-of-death. This study used a uniform, consistent source for all diagnoses (AD that contributed to cause of death as evidenced by presence on a death certificate) and is not limited by bias introduced by study designs with inconsistent methods of diagnoses, or family recall of disease symptoms. The most significant limitation of this analysis is that coding for AD diagnosis has been present since 1979 (ICD versions 9 and 10). Given the breadth of our data, this limits our ability to identify cases that might be related across multiple generations (e.g., great grandparent/great grandchild), but our requirement for three generations of genealogy means that very distant relationships within the same generation are possible (Additional file 1: Figures S1 and S2). The most likely misclassification is that a death certificate for an individual who died with AD did not include AD as a cause of death. This would result in an underestimate of the number of AD deaths in a pedigree. Although individuals dying from AD may have been censored from our observation in this resource, the assumption can be made that cases are uniformly censored within cohorts across the resource, leading to conservative, but unbiased, estimates of relative AD mortality within pedigrees.
We used a method previously described by Kauwe et al. [19] to identify large pedigrees with a statistical excess of AD mortality. Briefly, each pedigree in the UPDB consists of all descendants of a set of UPDB founders. We identified pedigrees with an excess of AD deaths by comparing the observed (i.e., number of affected individuals in the pedigree) to the expected numbers of AD-affected individuals within the pedigree. The expected number of AD deaths was estimated using population-based, cohort-specific rates of AD death estimated from all Utah death certificates for individuals in the UPDB genealogy. To calculate the expected number of AD-affected individuals in a pedigree, first we divided all individuals in UPDB into cohorts based on birth year (5-year blocks), sex, and birth state (Utah or somewhere else), and normalized expected AD incidence to adjust for cohort-specific variation in death certificate information. All individuals were assigned to one of the resulting 132 cohorts. The proportion of individuals with AD in a cohort is the cohort-specific AD death rate for the UPDB genealogy population. This approach controls for differences in diagnosis and use of ICD codes for AD over time and space.
Next, we assessed each pedigree individually. To calculate the expected number of AD-affected individuals in a pedigree, we divided all pedigree descendants into cohorts, as described above, and multiplied the number of total descendants from the pedigree within the cohort by the cohort-specific AD rate previously calculated (i.e., proportion of AD individuals in the cohort) and summed the values across all cohorts within the pedigree. Therefore, the expected number of AD-affected individuals in a pedigree is the sum of the expected number of AD-affected individuals from each cohort in the pedigree. Finally, the observed number of AD descendants for a pedigree is calculated by counting individuals in the pedigree with an ICD code that indicates AD as a cause-of-death.
We estimated the relative risk (RR) for AD for each pedigree as the observed number of AD-affected descendants divided by the expected number of AD descendants. One-sided probabilities for the alternative hypothesis testing an RR > 1.0 were calculated under the null hypothesis RR = 1.0, with the assumption that the number of observed cases follows a Poisson distribution (an approximation to a sum of multiple binomial distributions representing the number of expected cases per cohort) with mean equal to the expected number of cases. This Poisson approximation is statistically appropriate for both rare and common phenotypes, being more conservative for a common disease. Pedigrees exhibiting excess AD descendants over expected were defined as high-risk.
Samples
DNA and clinical phenotype data for AD cases and AD resilient samples for the linkage analysis were obtained from the Cache County Study on Memory Health and Aging (CCS), which has been described in more detail previously [20]. Briefly, the CCS was initiated in 1994 to investigate the association of APOE genotype and environmental exposures on cognitive function and dementia. This cohort of 5092 Cache County, Utah, residents (90% of those aged 65 years or older in 1994), has been followed continuously for over 15 years, with four triennial waves of data collection and additional clinical assessments for those at high-risk for dementia. DNA samples were obtained from 97.6% of participants. The Cache County population is exceptionally long-lived and ranked number one in life expectancy among all counties in the 1990 US Census [21]. All but one of the members of the CCS have been linked to the UPDB and their extended genealogies are known. This population was the source of most of the Centre d’Etude du Polymorphisme Humain (CEPH) families that have been used to represent Caucasians in many genetic studies worldwide, including the HapMap project. Recent analyses confirm that these data are representative of the general European-American population [22]. For this study, we needed both AD cases and resilient individuals identified in the same pedigrees.
First, we identified 232 AD resilient individuals (defined as those over age 75, cognitively healthy, and carrying at least one APOE ε4 allele) from the CCS with a strong family history of AD. The set consists of 135 females and 97 males, with mean age of 81 years. As previously mentioned, each of these individuals carries at least one APOE ε4 allele, and nine were homozygous for APOE ε4. We obtained WGS for 212 of these CCS samples using the Illumina HiSeq sequencer to an average depth of 40× and mapped the resulting reads with the Burrows-Wheeler Aligner (BWA) [23]. We performed variant calling using the Genome Analysis Toolkit (GATK) best practices (i.e., HaplotypeCaller) [24, 25]. We also genotyped each sample using the Illumina 2.5 M SNP Array for quality control and for use in linkage analyses.
Next, we identified 581 AD cases from the CCS, 492 of whom were followed from diagnosis to death. Since 2002, CCS participants with incident dementia have been followed prospectively in the Cache County Dementia Progression Study. An expert panel of neurologists, neuropsychologists, neuropsychiatrists, and a cognitive neuroscientist assigned final diagnoses of dementia following standard research protocols (e.g., NINCDS-ADRDA criteria for AD [20] or NINCDS-AIREN criteria for vascular dementia [26]). Each case was genotyped for the variants of interest using Taqman assays.
ADNI data used in the preparation of this article were obtained from the ADNI database (http://adni.loni.usc.edu/). The ADNI was launched in 2003 as a public–private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For up-to-date information, see http://www.adni-info.org/.
Linkage analysis
Linkage analyses were conducted using pedigrees that included at least four AD resilient individuals and four AD cases. To identify key regions associated with AD resilience, we identified shared chromosomal segments among our AD resilient samples within each pedigree using MCLINK [27]. The set of OmniExpress SNPs considered were reduced to a set of high-heterozygosity markers with low or no pairwise linkage disequilibrium to enable unbiased linkage analysis. Pedigrees were analyzed using a general dominant model that assumed a disease gene frequency of 0.005 with penetrance estimates for carriers and non-carriers of 0.5 and 0.0005, respectively, and we considered different modes of inheritance and corrected for multiple tests [28]. We extracted inheritance information for each pedigree by reconstructing haplotypes using a Monte Carlo Markov Chain methodology with blocked Gibbs sampling [27,28,29]. For parametric analyses, MCLINK calculates robust multi-point linkage scores (theta LODs or TLODs) [29]. We consider TLOD scores > 1.86 (corresponding to a false-positive rate of one per genome) as suggestive evidence for linkage, and scores > 3.30 as significant, as defined by Lander and Kruglyak [30]. Using a conservative cutoff further allowed us to explore biological evidence for the maximal number of genes and variants, which are few by nature for this type of study.
Once linkage evidence was established via these methods, we utilized all SNP markers in the region to provide fine mapping localization evidence. Linkage evidence from each pedigree was considered independently.
WGS variant filtering
Variants within the one LOD interval of the maximum linkage score were analyzed using the Ingenuity Variant Analysis and Tute Genomics Analysis programs (https://www.qiagenbioinformatics.com/products/ingenuity-variant-analysis/). For the Ingenuity Variant Analysis we used version 3.0.20140422 with content versions as follows: Ingenuity Knowledge Base (Arrakis 140408.002), COSMIC (v68) [31], dbSNP (build 138 (08/09/2013)), 1000 Genome Frequency (v3) [32], TargetScan (v6.2) [33], EVS (ESP6500 0.0.21), JASPAR (10/12/2009) [34], PhyloP hg18 (11/2009), PhyloP hg19 (01/2009) [35], Vista Enhancer hg18 (10/27/2007), Vista Enhancer hg19 (12/26/2010) [36], CGI Genomes (11/2011), SIFT (01/2013) [37], BSIFT (01/2013), The Cancer Genome Atlas (09/05/2013), PolyPhen-2 (HumVar Training set 2011_12) [38], Clinvar (02/11/2014).
All variants from the linkage regions were filtered as follows (see Additional file 1: Supplementary Note 1 for the effect each filter had on the number of variants):
-
Included variants that are shared by resilient samples
-
Included variants with call quality at least 20.0 in AD cases or resilient samples, outside the top 0.2% of the most exonically variable 100-base-pair windows in healthy public genomes (based on the 1000 Genomes Project), and outside the top 1% of the most exonically variable genes in healthy public genomes (based on the 1000 Genomes Project)
-
Excluded variants if the allele frequency was at least 3% in the 1000 Genomes Project, the public Complete Genomics genomes, or the NHLBI ESP exomes (http://evs.gs.washington.edu/EVS/).
-
Included variants associated with gain-of-function, or were heterozygous, hemizygous, haploinsufficient, or compound heterozygous
-
Included variants experimentally observed to be associated with a phenotype by any of the following criteria: 1) pathogenic, possibly pathogenic, established gain-of-function in the literature, or inferred activating mutations by Ingenuity; 2) predicted gain-of-function by BSIFT; 3) located in a known microRNA binding site, or frameshift, in-frame indel, stop loss, missense, and not predicted to be benign by SIFT, or disrupt a splice site up to two bases into an intron; 4) deleterious to a microRNA or structural variant; 5) located in a known promoter binding or enhancer site; 6) located in an evolutionarily conserved region, determined by a phyloP p value ≥ 0.01, or 7) in an untranslated region
-
Included variants absent in AD cases in the pedigree and present in a gene within two protein interaction connections upstream, or one connection downstream, of genes that are known, or predicted, to affect susceptibility to late-onset familial or sporadic AD
Genetic validation analyses
We used three independent datasets for genetic validation analyses. First, all SNPs that met the filtering criteria (described above) were evaluated in a set of samples with sequence data. Then, significant markers from those analyses were genotyped and assessed for association in samples from the CCS. Finally, WGS data from the ADNI were analyzed. Our initial validation analysis was conducted using data from an augmented version of the Alzheimer Genetic Analysis Group dataset [12]. These data consist of whole exome sequences (WES) and WGS for 427 AD cases and 798 elderly controls originating from the United Kingdom and North America. The assembly and use of this dataset have been described in several studies (e.g., [39]). Briefly, since our dataset consisted of a mix of exomes captured using different kits, and whole genome sequences, we employed a highly conservative approach to variant selection to increase our confidence that analyzed variants are true positives. We limited our dataset of variants to only those genomic regions we expected to have been sequenced in each of the exomes (based on capture probes used for exome library preparation) and whole genomes. Next, we compiled a list of all the variants present in at least a single sample. We examined each of the variants from the list of total variants in each sample, whether or not the variant was called by the Genome Analysis Toolkit (GATK) best practices, and reassigned the genotype for that variant according to the following criteria. (1) If the variant was called by the GATK and passed all quality filters recommended by the GATK, we used the GATK genotype. (2) If no variant was called at the genomic position in question, we returned to the raw VCF file and if there were reads containing the variant, but the variant was not called because of failing filters or because only a small number of reads contain the variant, we set the genotype to missing for the sample. (3) Finally, if all the reads at this position for the sample indicated reference alleles, we set the genotype to homozygous reference.
Variants that were significant in the first validation analyses were genotyped in 523 AD cases and 3560 controls from the CCS (after exclusion of samples that were included in the linkage analysis). WGS from 191 AD cases and 279 controls from ADNI were used to conduct gene-based tests for association. These samples are described in detail on the ADNI website (http://adni.loni.usc.edu/data-samples/genetic-data/wgs/). Finally, there were no variants in these genes passed quality control in the Alzheimer’s Disease Sequencing Project samples.
We performed association analyses, using PLINK [40], between AD status and the top SNP in each linkage region (based on Ingenuity analyses), using a logistic regression and controlling for age, sex, and site. Given the linkage results, all tests were conducted assuming we were searching for a SNP with a protective effect against AD. We tested a single SNP from the linkage region in each family. As such, the alpha for the single SNP analyzed in each family is 0.05. Next, we used the sequence kernel association test (SKAT)-O to perform gene-based association tests in the ADNI samples to test whether each gene was a potential AD resilience gene [41]. SKAT-O was designed to combine both a burden test and a non-burden sequence kernel association test. It maximizes power from both test types, where burden tests are more powerful when the majority of variants in a region are both causal and in the same direction, and SKAT is adapted to regions with largely non-causal variants or causal variant effects are in different directions [41]. Thus, SKAT-O is ideal when the percentage of causal variants and their directions within a region are not known beforehand.
Gene expression studies
We examined levels of RAB10 and SAR1A expression in the temporal cortex of 80 brains with neuropathologic diagnosis of AD vs. 76 elderly control brains which lacked any diagnosis of neurodegenerative diseases. These brains were part of the Mayo Clinic RNA sequencing (RNAseq) cohort, described previously [42]. All subjects underwent RNAseq using Illumina HiSeq 2000, 101-base-pair, paired-end sequencing at the Mayo Clinic Genomic Core Facility. All the AD and some of the control brains were from the Mayo Clinic Brain Bank; whereas other control brains were from the Banner Sun Health Institute. Following quality control, raw read counts normalized according to conditional quantile normalization (CQN) using the Bioconductor package were used in the analyses. For differential gene expression (DGE) comparing AD vs. controls using the “Simple Model”, multi-variable linear regression analyses were conducted in R, using CQN normalized gene expression measures and including age-at-death, gender, RNA integrity number (RIN), brain tissue source, and flowcell as biological and technical covariates. We also performed DGE including cell-specific gene levels as covariates in addition to all covariates in the “Simple Model”, using the expression levels for the five central nervous system (CNS)-specific genes as follows: ENO2 for neurons, GFAP for astrocytes, CD68 for microglia, OLIG2 for oligodendrocytes, and CD34 for endothelial cells. The rationale for the “Comprehensive Model” is to account for any CNS cell-population changes that occur due to disease pathology. Significance accounting for multiple testing was assigned using q values which are based on false discovery rates [43].
Additionally, RAB10 and SAR1A expression levels were evaluated in publicly available datasets from human AD and age-matched control brains (GSE5281 and syn3159438). The GSE5281 dataset was obtained from laser microdissected neurons from AD and control brains [44]. The syn3159438 dataset was obtained from anterior prefrontal cortex (APC), superior temporal gyrus (STG), parahippocampal gyrus (PHG), and pars opercularis (PO) [45]. RNA expression values were log transformed to achieve a normal distribution. An analysis of covariance, including age and sex as covariates, was used to determine association with disease status as previously described [46, 47].
Biological validation studies
To further investigate the connection between RAB10 and SAR1A and AD risk, we assessed the impact of gene overexpression and silencing on APP and ß-amyloid levels in N2A695 cells.
We used the following plasmids for this study: pCMV6-Rab10 (Origene), pCMV6-Sar1A (Origene), pGFP-V-RS-Rab10 shRNA (Origene), pGFP-V-RS-Sar1A shRNA (Origene), pCMV-GFP, and pGFP-V-RS-scrambled shRNA (Origene). The optimal shRNA for each gene was selected from four possible shRNAs based on the plasmid producing the most robust knockdown in vitro.
Mouse neuroblastoma cells (N2A) expressing human APP-695 isoform (termed N2A695) were used in this study [48]. N2A695 cells were plated and grown in Dulbecco’s modified Eagle medium (DMEM) and Opti-MEM supplemented with 1% L-glutamine, 10% FBS and 1% antibiotic-anti-microbial solution, and 200 μg/mL G418. Upon reaching confluency, cells were transiently transfected using Lipofectamine 2000 (Life Technologies). Culture media was changed 24 h after transfection. After an additional 24 h, cell media and cell pellets were collected for subsequent analysis. Nine independent replicates were performed for each condition.
Cell death following overexpression and knockdown was assessed by measuring LDH release in the cellular medium (Thermo Scientific) according to the manufacturer’s instructions. Percentage of cytotoxicity was then calculated following manufacturer’s recommendations:
$$ \%\mathrm{Cytotoxicity}=\left(\left(\mathrm{Transfected}\;\mathrm{LDH}\hbox{--} \mathrm{Spontaneous}\;\mathrm{LDH}\right)\div \left(\mathrm{Maximum}\;\mathrm{LDH}\hbox{--} \mathrm{Spontaneous}\;\mathrm{LDH}\right)\right)\times 100 $$
To assess overexpression and silencing of RAB10 and SAR1A, total RNA was isolated from N2A695 cells 48 h after transfection using RNeasy (Qiagen). RNA was converted to cDNA using the High-capacity cDNA reverse transcription kit (Thermo Fisher Scientific). Gene expression was analyzed by real-time PCR using an ABI-7900 real time PCR system. Taqman (Thermo Fisher Scientific) real time PCR assays were used to measure the expression of RAB10 (Mm00489481_m1), SAR1A (Mm01150424_m1), and the housekeeping gene GAPDH (Hs02758991_g1). Samples were run in triplicate. To avoid amplification interference, expression assays were run in separate wells from the housekeeping gene.
Real-time data were analyzed using the comparative threshold cycle (CT) method [49]. Briefly, the CT is the PCR cycle at which fluorescence rises above background, allowing us to calculate the original RNA levels. For the comparative CT method, the average CT for RAB10 or SAR1A were normalized to the average CT for GAPDH. The resulting value was then corrected for assay efficiency. Samples with a standard error of 20% or less were analyzed. RAB10 shRNA resulted in a 54% reduction of endogenous RAB10, and SAR1A shRNA resulted in a 26% reduction of endogenous SAR1A.
To assess steady-state levels of RAB10, SAR1A, and APP, cell lysates were extracted in lysis buffer (50 mM Tris pH7.6, 1 mM EDTA, 150 mM NaCl, 1% TritonX-100, protease inhibitor cocktail) on ice. Lysates were centrifuged at 14,000xg for 10 minutes at 4 °C and the resulting supernatant was saved for SDS-PAGE and immunoblotting. Total protein concentration was measured by BCA assay according to manufacturer’s protocol (Thermo Scientific).
Standard sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) was performed using 4–12% Criterion Tris-HCl gels (Bio-Rad). Samples were boiled in Laemmli sample buffer prior to electrophoresis [50]. Immunoblots were probed with 9E10 (myc; Sigma), 6E10 (APP, sAPPα; Covance), 22C11 (APP, sAPPtotal; Millipore), sAPPβ (Clontech), and CT695 (APP, CTF-β and CTF-ɑ; ThermoFisher).
The levels of human Aβ40 and Aβ42 were measured from conditioned cell culture media by sandwich ELISA as described by the manufacturer (Thermo Fisher Scientific). ELISA values were obtained (pg/mL) and corrected for total intracellular protein (μg/mL) based on BCA measurements from cell lysates.
Aβ concentrations are expressed as mean ± standard deviation obtained from at least three separate experiments in each group. Data were assessed by one-way analysis of variance (ANOVA). When ANOVA indicated significant differences, the Student’s t-test was used with Bonferroni correction for multiple comparisons. Results presented are representative and those with p values < 0.05 were considered significant.