Study population
The UCLA Health System includes two hospitals (520 and 281 inpatient beds) and 210 primary and specialty outpatient locations predominantly located in Los Angeles County. The UCLA Data Discovery Repository (DDR) contains de-identified patient EHRs that have been collected since March 2, 2013, under the auspices of the UCLA Health Office of Health Informatics Analytics and the UCLA Institute of Precision Health. Currently, the DDR contains longitudinal records for more than 1.5 million patients (inpatient and outpatient), including basic patient information (height, weight, gender), diagnosis codes, laboratory tests, medications, prescriptions, hospital admissions, and procedures. The UCLA ATLAS Community Health Initiative includes the EHR-linked biobank within the UCLA Health System. Currently, there are more than 37,000 genotyped participants with their de-identified EHR linked through the DDR. Participation is voluntary and privacy is protected by de-identifying the samples. Additional information regarding recruitment, consent, sample processing, and quality control pipelines can be found in previous work [17, 18, 21]. Patient Recruitment and Sample Collection for Precision Health Activities at UCLA is an approved study by the UCLA Institutional Review Board (UCLA IRB). IRB#17-001013.
Self-identified demographic information
Self-identified demographic information is collected as a part of clinical care which is then translated to the EHR. Participants self-identify race and ethnicity via two distinct drop-down fields where there are pre-determined multiple-choice fields for race and ethnicity (see Additional file 2: Table S1, S2 for full list containing exact terminology). At this time, only one selection from each category can be chosen as a patient’s primary race and ethnicity [25]. We group together race/ethnicity pairings to form “self-identified race/ethnicity” (SIRE) groupings (Additional file 2: Table S3). Patients also report their “Preferred Language” from pre-determined multiple-choice fields within the EHR. See the section “Notes on terminology and naming conventions” for a more detailed discussion of terminology used for SIREs.
Notes on terminology and naming conventions
In this section, we explicitly discuss the origin of the terminology and naming conventions used throughout this manuscript with respect to genetic ancestry, race, and ethnicity. We refer to Peterson et al. [26] for a more comprehensive description of terms for GWAS in ancestrally diverse populations.
The term “genetic ancestry” refers to the characterization of the population(s) from which an individual is descended and describes the genetic relationship implied by shared, large segments of genomic DNA between an individual and these ancestors [27]. Throughout this work, we reserve this term to describe individuals with information about the origin of their recent biological ancestors. For instance, we treat populations represented in genetic reference panels (e.g., 1000 Genomes Project [28, 29]) as instances of genetic ancestry since the information describing the origin of the recent biological ancestors represented in the samples is known.
Much of this work involves inferring the genetic ancestry information for a set of individuals. We introduce the term “genetically inferred ancestry (GIA)” to describe the genetic characterization of individuals within a group who likely share recent biological ancestors as inferred by a method of choice. We emphasize that GIA differs from genetic ancestry in that GIA depends on the inference method (e.g., clustering) and choice of reference data (e.g., 1000 Genomes).
The terms “Native American genetic ancestry” and “Native American GIA” refer to ancestry and/or recent biological ancestors from individuals originating from indigenous groups originally from North America, Central American, and South America. The term “Native American race” refers to the definition used by the US Census, “ a person having origins in any of the original peoples of North and South America (including Central America) and who maintains tribal affiliation or community attachment” [30]. We recognize that individuals in this group may prefer other terms such as “American Indians.” To be clear, identification of subjects as Native American GIA is not meant to imply a tribal status.
In the context of this work, the term “African genetic ancestry” describes individuals whose recent biological ancestors originated from the continent of Africa. “African American (AA) GIA” refers to an admixed group of individuals within the USA who have recent biological ancestors inferred to be of African ancestry and thus have partial or total ancestry originating from Africa. The term “Admixed American ancestry” refers to those with recent biological ancestors from European, African, and Native American ancestries that achieved admixture in North America, Central America, and South America. Thus, Admixed American ancestry contains global proportions of all three ancestry groups. “Hispanic Latino American (HL) GIA” refers to the group of admixed individuals within the USA whose recent biological ancestors were inferred to be individuals of Admixed American ancestry. “European ancestry” refers to individuals with recent biological ancestors with origins in continental Europe. “European American (EA) GIA” refers to individuals within the USA with recent biological ancestors inferred to be of European ancestry, thus, partial or total ancestry originating from Europe. “East Asian ancestry” and “South Asian ancestry” refers to individuals with recent biological ancestors from East Asia and South Asia respectively. “East Asian American (EAA) GIA” and “South Asian American (SAA) GIA” refers to individuals within the USA with recent biological ancestors inferred to be of East Asian ancestry or South Asian ancestry.
We disapprove that the term “White/Caucasian” is a preset multiple-choice option under the race field within the medical records and renounce its usage due to its erroneous origins and historically racist implications. We strongly discourage the connection of the term “Caucasian” with the discussion of race, a social construct separate from biology, and emphasize that the term does not have biological implications [31]. For subsequent analyses presented in this work, we use “White” to refer to the “White/Caucasian” category. Although this terminology is still built into the language of many documents and surveys, such as EHRs, we make this change to avoid perpetuating its usage within the field.
Basic genotype quality control
Bio-samples collected from the UCLA ATLAS Community Health Initiative in the form of blood samples were de-identified and then processed for DNA extraction and genotyping. We utilized a “frozen snapshot” of ATLAS data composed of all samples processed up to 6/18/2021. ATLAS participants (N=36,779) were genotyped using a custom genotyping array constructed from the Global Screening Array with the multi-disease drop-in panel [19] under the GRCh38 assembly. Overall, the array measured 700,079 sites for capturing single-nucleotide polymorphisms (SNPs) and short insertions and deletions (indels).
We filtered out poor-quality markers by removing unmapped SNPs, SNPs with >5% missingness, and strand-ambiguous SNPs (M = 19,313 variants removed). We excluded samples with missingness >5% (N=1 individual removed). We identified duplicate individuals (or identical twins/triplets, etc.) using KING 2.2.2 [32] (“--duplicate”) and removed the individual with the lowest missing rate from each pair (N=42 individuals removed). All quality control steps were conducted using PLINK 1.9 [33]. Following sample- and variant-level quality control, M=667,191 genotyped SNPs remained across N=36,736 individuals for downstream analyses. All subsequent genetic analyses in this paper utilize this QC’d set of genotypes. Additional steps of QC may be conducted before running specific analyses, as described in more detail below. A summary of the sample sizes and sets of SNPs used in subsequent analyses is described in Additional file 8: Table S14. We refer to our previous work for a more thorough description of the quality control pipelines constructed for ATLAS [17].
Genotype imputation
After performing array-level genotype quality control, the PLINK-formatted genotypes were converted to VCF format and uploaded to the Michigan Imputation Server [34]. On a variant level, the server removed the variant if it was not an A, C, G, T allele, monomorphic, a duplicate, an allele mismatch between the reference panel and provided data, an insertion-deletion, or if the SNP call rate is less than 90%. The server will additionally correct for any necessary strand flips or allele switches needed to match the reference panel. The server additionally phases the data using Eagle v2.4 [35], and imputation is performed against the TOPMed Freeze5 imputation panel [20] using minimac4 [36]. In summary, the explicit parameters used on the server are “TOPMed Freeze5” for the reference panel, “GRCh38/hg38” for the array build, “off” for the rsq filter, “Eagle v2.4” for phasing, no QC frequency check, and “quality control & imputation” for the mode. After we filtered by R2>0.90 and MAF>1%, the final set of variants contained M=8,048,268 sites.
Genetic relatedness inference
We computed pairwise kinship coefficients to determine family relationships using King 2.2.2 [32]. We performed inference on the set of genotype data passing quality control (see “Basic genotype quality control”) for a total of N=36,736 individuals and M=667,191 SNPs. We identified a set of unrelated individuals (N=35,761) up to degree 2 where individuals with kinship coefficient <0.0884 were included (“king --unrelated --degree 2”). This level of relatedness is expected since members of the same family will often be within the same healthcare system.
Continental genetic inferred ancestry
We estimated GIA membership using a 2-step clustering procedure. First, we performed principal component analysis (PCA) [37] on all individuals in ATLAS (N=36,736) and samples from 1000 Genomes. Specifically, we first filtered genotypes from ATLAS by Mendel error rate (“plink --me 1 1 –set-me-missing”), founders (“--filter-founders”), minor allele frequency (“–maf 0.15”), genotype missing call rate (“--geno 0.05”), and Hardy-Weinberg equilibrium test p-value (“–hwe 0.001”). The filtered genotypes from ATLAS are then merged with the 1000 Genomes phase 3 dataset [28]. We align reference alleles between the two sets of data and filter out SNPs that are not an A, C, T, or G allele. Next, a 2-step LD pruning is performed on the merged dataset: (1) “--indep 200 5 1.15”, (2) “--indep-pairwise 100 5 0.1.” All filtering steps and LD pruning were performed using PLINK 1.9 [38]. This resulted in a total of 22,589 SNPs across 36,736 individuals in ATLAS. We computed the first 10 principal components using the FlashPCA 2.0 software [39] with all default settings.
For the second step, we perform clustering on the principal components to estimate GIA cluster membership for each individual in ATLAS. We use the K-nearest neighbors (KNN) algorithm where we use the superpopulation name of the samples in 1000 Genomes to define the cluster labels. The superpopulations form 5 clusters: European, African, Admixed American, East Asian, and South Asian genetic ancestry. For each ancestry cluster, we run KNN on the pair of PCs that capture the most variation for each genetic ancestry group: European, East Asian, and African ancestry groups utilize PCs 1 and 2, the Admixed American group use PCs 2 and 3, and the South Asian group use PCs 4 and 5. For each ancestry group inference, we run KNN separately. In each analysis, we use 10-fold cross-validation to select the “k” hyper-parameter from k=5, 10, 15, 20. If a sample from ATLAS had >0.50 cluster membership, then the sample is reported as the genetic inferred ancestry represented in that cluster (European genetic ancestry (GA) → European GIA, African GA → African American GIA, Admixed American GA → Hispanic Latino American GIA, East Asian GA → East Asian American GIA, South Asian GA → South Asian American GIA). See “Notes on terminology and naming conventions” for a more in-depth discussion about the naming of GIA clusters. Individuals who did not attain >0.50 membership in any cluster or were matched to multiple clusters were reported as being ‘Ambiguous GIA’. A comparison between the GIA clusters and the genetic ancestry clusters from 1000 Genomes in PC-space is visualized in Additional file 1: Fig. S2.
Subcontinental genetic inferred ancestry
We estimate subcontinental GIA membership for individuals within the East Asian American GIA group using a 2-step clustering procedure similar to the continental GIA clustering discussed in a prior section (“Continental genetic inferred ancestry”). First, we perform PCA on all individuals in the EAA GIA group in ATLAS (N=3,331) and samples from the East Asian ancestry population in 1000 Genomes. Using only the genotyped SNPs, we perform the same filtering steps as described above, namely filtering ATLAS genotypes by Mendel error rate, founders, MAF > 0.15, genotype missing call rate, Hardy-Weinberg equilibrium test, and LD pruning. Following sample- and variant-level quality control, M=36,504 SNPs remained. We also found that not restricting to only unrelated individuals does not bias our estimates (Additional file 8: Table S16). We then compute the first 10 principal components using FlashPCA with all default settings.
For the second step, we perform clustering on the principal components to estimate subcontinental GIA cluster membership for each individual in the East Asian American GIA group in ATLAS. We use the K-nearest neighbors (KNN) algorithm where we use the population name of the East Asian ancestry samples in 1000 Genomes to define the cluster labels. The populations form 5 clusters: Han Chinese, Southern Han Chinese, Dai Chinese, Japanese, Kinh Vietnamese genetic ancestry. We run KNN using PCs 1–4 with 10-fold cross-validation to select the “k” hyper-parameter from k=5, 10, 15, 20. If a sample from ATLAS had >0.90 cluster membership, then the sample is reported as the genetic inferred ancestry represented in that cluster. Individuals who did not attain >0.90 membership in any cluster were reported as being “Ambiguous EAA.” A visualization of the inferred GIA clusters is visualized in Additional file 1: Fig. S3A.
Alternatively, we can define GIA clusters using self-identified information from the samples in ATLAS. We perform a similar approach as above, except we use ATLAS individuals’ self-identified race as the labels to define the clusters. We limit cluster definitions to self-identified race groups with N>20 for a total of 7 clusters: Chinese, Filipino, Japanese, Korean, Taiwanese, Thai, and Vietnamese. Although we do not utilize label information from 1000 Genomes, we still use the PCs computed on the merged ATLAS and 1000 Genomes dataset to keep PCA projections consistent across the 1000 Genomes-based and self-identified race-based clustering methods. We run KNN using the same procedure and thresholds as above. Again, individuals who did not attain >0.90 membership in any cluster were reported as being “Ambiguous EAA.” A visualization of the race-based inferred GIA clusters is visualized in Additional file 1: Fig. S3B. Explicit clusters could not be confidently computed for other continental GIA groups.
IBD calling
For identity-by-descent (IBD) calling, an interim version of the ATLAS data consisting of 24,318 individuals was used. First, ATLAS data was merged with the 1000 Genome Project [28], the Simons Genome Diversity Project [40, 41], and the Human Genome Diversity Project [42, 43]. Data was filtered to remove duplicated sites and individuals or sites with more than 1% missingness. Hardy-Weinberg equilibrium was calculated in the largest SIRE groups (NH-White, HL-Oth, NH-AfAm, NH-Asian) and we removed sites that did not pass a filter of p-value < 1×10−10. We also removed individuals whose EHR sex did not match PLINK estimated genotyped sex and those who had excess heterozygosity. Lastly, we used only sites with a MAF greater than 5%. In total, 418,195 SNPs were kept for IBD analysis. The merged dataset was then statistically phased using Shapeit4 [44]. IBD was called using iLASH using default parameters [2]. For downstream analysis, IBD segments were summed between individuals to create a list of edges, where each row represented a pair of individuals, and each column represented the total genome-wide IBD between those two individuals. This matrix was filtered to remove rows representing individuals who were third degree relatives or closer, calculated using KING. We then created an undirected weighted graph using the R Package iGraph [45] where the nodes were the individuals, and the edges represented the amount of IBD shared between a pair of individuals. The InfoMap community detection algorithm, implemented in iGraph, was used to detect IBD communities [46]. InfoMap was run with default parameters. iGraph was again used for cluster visualization. The 20 largest communities were selected for visualization, and outlier nodes with degree less than 30 were removed. The graph was then visualized with the Fruchterman and Reingold force directed layout, run with 1000 iterations [47]. Each community was assigned a unique color to ease visualization.
Genetic admixture analysis
We inferred the proportion of genetic ancestry using the ADMIXTURE software [48] under the unsupervised clustering mode with the number of clusters k=4, 5, 6. Specifically, we restrict to only SNPs with only an A, C, G, T allele and with MAF > 0.05 (“--maf 0.05 --snps-only ‘just-acgt’”) within ATLAS. We then merge the data from ATLAS with the 1000 Genomes phase 3 dataset and limit inference to only the subset of the overlapping SNPs. We then perform LD pruning every 2 kb on the merged dataset (“--bp-space 2000”). All filtering steps and LD pruning were performed using PLINK. This resulted in a total of 223,095 SNPs across 36,736 individuals in ATLAS which was then used for ancestry inference using ADMIXTURE. We also found that not restricting to only unrelated individuals does not bias our estimates (Additional file 8: Table S16).
Finally, we performed the admixture analysis with “./admixture atlas_1kg_bed_fille k” with k equal to 4, 5, or 6. We compare the ancestry proportions from each SIRE to estimate the ancestry represented in each mixture component. For k=4, we label the component with the majority of NH-White individuals as European ancestry, the component with the majority of NH-AfAm individuals as African ancestry, the component with the majority of NH-Asian individuals as East Asian ancestry, and the component with the highest number of HL-Other and HL-White individuals as Native American ancestry.
Phecodes
Billing codes documented in the medical record were used to generate phenotypes for analysis. The previously described phecode ontology (v1.2) maps the specific ICD-9 and ICD-10 codes from each patient’s chart onto a group of >1800 more general and clinically meaningful phenotype terms [49]. Mapping completed with the PheWAS R package [50] (https://github.com/PheWAS/PheWAS) creates binary phenotypes. Patients with one or more instances of a phecode were considered cases while patients without any instance of the corresponding phecode were considered controls. We limited analyses to phecodes with at least N-Case > 50 in each GIA group for a total of 1568 phecodes meeting this threshold in the EA GIA, 802 in the AA GIA, 1223 in the HL GIA, and 891 in the EAA GIA group.
Role of phecode occurrences for defining cases
We define a phecode occurrence as an encounter containing at least one of the ICD codes specified in the phecode definition. If a corresponding ICD code is found on another separate encounter, we treat this instance as a separate phecode occurrence. We compare two definitions of phecodes. For the first definition, we only require the presence of an ICD code attached to any type of patient encounter (i.e., laboratory tests, hospital, outpatient, medications, telehealth appointments, notes, phone calls). For the second definition, we require the presence of an ICD code attached to only encounters from appointment, office, hospital, or procedure visits. This stricter definition attempts to avoid capturing encounters that may be less indicative of a diagnosis (e.g., patient-physician telehealth messaging). We refer to these two definitions as all-encounter-derived and visit-derived phecodes. Using these two types of definitions, we then vary the number of phecode occurrences required for defining cases and compute the proportion of retained cases compared to the sample sizes if only 1 occurrence was required.
Association between phecodes and genetic ancestry
To test the differential prevalence of phecodes across genetic ancestry group, we performed a marginal association test for each phecode to compare its prevalence in one of the genetic ancestry groups (EA, AA, HL, and EAA) with the other three groups using the following logistic regression model:
$$\mathrm{logit}\left(\mathrm{phecode}\right)={\beta}_0+{\beta}_1\mathrm{genetic}\_\mathrm{ancestry}\_\mathrm{group}+{\beta}_2\mathrm{sex}+{\beta}_3\mathrm{age}\ \left[\mathrm{over}\ \mathrm{all}\ \mathrm{ATLAS}\ \mathrm{individuals}\right]$$
To account for the potential confounding effects of SIRE, we performed additional analyses with the model:
$$\mathrm{logit}\left(\mathrm{phecode}\right)={\beta}_0+{\beta}_1\mathrm{genetic}\_\mathrm{ancestry}\_\mathrm{group}+{\beta}_2\mathrm{sex}+{\beta}_3\mathrm{age}+{\beta}_4\mathrm{SIRE}\ \left[\mathrm{over}\ \mathrm{all}\ \mathrm{ATLAS}\ \mathrm{individuals}\right]$$
Statistical significance was determined after correcting for the number of tested phecodes with Bonferroni correction procedure (p-value<1.12×10−05).
We also applied the method to the East Asian American group to test the phecode prevalence difference across subcontinental ancestry groups including Chinese, Japanese, Filipino, and Korean Americans.
Association between genetic admixture proportions and phecodes
Given the substantial variation of admixture proportion within each SIRE group, we test the association of phecode with admixture proportion (k=4) for 600 phecodes within each of the seven ATLAS SIRE groups (NH-White, NH-AfAm, HL-Other, HL-White, NH-Asian, NH-PI, NH-AmIn) with the following model:
$$\mathrm{logit}\left(\mathrm{phecode}\right)={\beta}_0+{\beta}_1\mathrm{admixture}\_\mathrm{proportion}+{\beta}_2\mathrm{sex}+{\beta}_3\mathrm{age}\ \left[\mathrm{over}\ \mathrm{individuals}\ \mathrm{within}\ \mathrm{a}\ \mathrm{SIRE}\right]$$
Each model is limited to individuals of one SIRE instead of all ATLAS individuals. Only traits with >10 cases per SIRE were tested. Significance is determined after adjusting for the number of tested phenotypes with Bonferroni correction procedure (p-value <2.08×10−05).
GWAS quality control per GIA
When performing GWAS, we stratified individuals by GIA groups and then performed an additional level of QC separately within each GIA group. We limited analyses to the 4 largest GIA groups: European American (N=22,380), African American (N=1995), Hispanic Latino American (N=6073), and East Asian American GIA (N=3331). At this time, we omitted GWAS analyses within the South Asian American GIA group due to the limited sample size (N=625). Individuals who could not be clustered into a specific GIA group (N=2332) were also omitted from GWAS analyses.
For GWAS, we utilized imputed data consisting of 8,048,268 SNPs across N=36,736 individuals. Within each ancestry group, samples identified as heterozygosity outliers (+/− 3 SDs from the mean) were removed and SNPs that failed the Hardy-Weinberg equilibrium test (p-value <1×10−12) were also removed. Finally, we limited analyses to only SNPs with MAF > 1% within each GIA group, yielding a total of N=22,380 individuals and M=6.0 million SNPs within the European American GIA group, N=1995 individuals and M=5.9 million SNPs within the African American group, N=6073 and M=6.3 million SNPs within the Hispanic Latino American group, and N=3331 individuals and M=4.8 million SNPs within the East Asian American group.
Ancestry-specific GWAS
GWAS for all 6 traits were performed separately within each of the 4 continental ancestry groups that met the minimum N>50 cases. Additional GWAS-specific quality control is performed within each GIA group (see GWAS quality control per GIA). Using marginal logistic regression implemented in PLINK, we computed association statistics at each imputed autosomal SNP (“plink --logistic beta”). We additionally used age, sex, and PCs 1–10 as covariates where age is defined as the individuals’ current age within the EHR (as of September 2021). The values used to represent sex in this specific analysis are derived from patients’ self-identified sex as reported in the EHR. Within the EHR, this specific field is labeled as “Sex” and has a list of pre-determined multiple-choice fields where participants select one of the following options: “Male,” “Female,” “Other,” “Unknown,” “*Unspecified,” “X.” We find that 45.1% of individuals self-identify as male and 54.9% self-identify as female.
Meta-analyses
We perform meta-analyses for each trait across all GIA groups. First, we run ancestry-specific GWAS (see “Ancestry-specific GWAS”) within each GIA group with adequate sample size (N-Cases>50). We exclude analyses where very few of the SNPs produced a valid (non-NA) p-value which is likely attributed to a small sample size. The meta-analysis for skin cancer consisted of measurements from the EA and HL GIA groups; EA, AA, HL, and EAA GIA groups for ischemic heart disease; EA, AA, HL, and EAA GIA groups for chronic nonalcoholic liver disease; AA and HL GIA groups for uterine leiomyoma; HL and EAA GIA groups for liver/intrahepatic bile duct cancer; EA, AA, and HL GIA groups for chronic kidney disease. We performed each meta-analysis using a fixed effect model as implemented in PLINK (“--meta-analysis + logscale”). Association statistics computed from the meta-analyses are reported for SNPs that occur in at least two of the GIA groups.
PheWAS
We perform a PheWAS on the top SNPs from each ancestry-specific GWAS analysis that met genome-wide significance (p-value <5×10−8). Only phecodes with at least N-Cases>50 per GIA group were considered, resulting in a total of 1568 phecodes meeting this threshold in the EA GIA and 1223 in the HL GIA. Analyses in the AA and EAA GIA groups were excluded since the top SNPs were not significantly associated in these groups. We additionally stratified individuals by sex for the sex-specific phecodes, which are denoted in the definition of each phecode. This resulted in a total of 24 male- and 113 female-specific phecodes within the EA GIA group, and 12 male- and 87 female-specific phecodes within the HL group after limiting to phecodes with at least N-Cases > 50. We used individuals’ self-identified sex as reported in the EHR for this analysis.
We performed an association test between the top SNP and all phecodes in a given GIA group under a logistic regression model. Age, sex, and PCs 1-10 were used as covariates in the regression where age is defined as the individuals’ current age within the EHR (as of September 2021), and sex is derived from individuals’ EHR. The association test is performed using the logistic regression option implemented in PLINK (“plink --logistic beta”). The PCs used in the regression analysis were re-computed using only on individuals from within each respective GIA group. Phenotype significance was determined as p-value <0.05/(# phecodes), thus each GIA group has a specific significance threshold due to the different number of tested phecodes. A more stringent threshold also accounting for genome-wide significance is also computed where p-value <5×10−8/(# phecodes). Both thresholds are denoted in the PheWAS plots.
Effective sample size of associated phecodes
To assess the power of the PheWas analysis at rs2294915 between the European American (EA) GIA and Hispanic Latino American (HL) GIA groups, we compute the effective sample size (Neff) of each associated phecode, where the effective sample size balances the number of cases and controls when measuring the power of an analysis [51]: Neff = 2 / (1/Ncases + 1/Ncontrols).