Statistical power in COVID-19 case-control host genomic study design

The identification of genetic variation that directly impacts infection susceptibility to SARS-CoV-2 and disease severity of COVID-19 is an important step towards risk stratification, personalized treatment plans, therapeutic, and vaccine development and deployment. Given the importance of study design in infectious disease genetic epidemiology, we use simulation and draw on current estimates of exposure, infectivity, and test accuracy of COVID-19 to demonstrate the feasibility of detecting host genetic factors associated with susceptibility and severity in published COVID-19 study designs. We demonstrate that limited phenotypic data and exposure/infection information in the early stages of the pandemic significantly impact the ability to detect most genetic variants with moderate effect sizes, especially when studying susceptibility to SARS-CoV-2 infection. Our insights can aid in the interpretation of genetic findings emerging in the literature and guide the design of future host genetic studies.


Background
Variability in susceptibility and disease severity in response to viral exposure is a hallmark of infectious diseases [1]. This variation is partially due to a complex interaction between the pathogen and the host genome, and it is impacted directly or indirectly by factors specific to both the pathogen (e.g., viral load, viral genotype) and the host (e.g., sex, pre-existing medical conditions, race) [2]. Identifying host genetic factors responsible for this variation can pinpoint susceptible populations, inform personalized treatment, and guide vaccine development [3]. SARS-CoV-2, the virus that causes coronavirus disease 2019 (COVID- 19), appears to be no exception [4]. By Spring 2020, several international research groups [5] were proposing or implementing studies to identify the host genetic factors that influence (i) susceptibility to infection with SARS-CoV-2 [5] and (ii) disease severity of COVID-19 once infected [5]. These include sequencing and array-based studies that investigate rare [6] and common [5,7] genetic variants.
Numerous case-control studies of varied designs have emerged to better understand the host genetic factors associated with susceptibility to SARS-CoV-2 and severity of COVID-19. Due to the fast-paced nature of the pandemic, one major challenge is how to best utilize existing resources while accounting for the limitations of the available phenotype and genotype data. The study designs proposed to date include prospective case-control recruitment designs where array-based genotyping or sequencing is planned [6,8], retrospective designs using existing genotyped cohorts with concurrent phenotyping for COVID-19 [9,10], and hybrid designs using prospectively collected cases and population controls with genotype data readily available [7]. Each design has implications for how the susceptibility and disease severity phenotypes are defined [11] and may have associated limitations that are not surmountable given the current state of the pandemic.
The field of infectious disease genetic epidemiology has long described the unique and complex implications that a requirement of exposure to the infectious agent imposes on study design [12]. Here, we draw on current estimates of exposure, infectivity, and test accuracy of COVID-19 to demonstrate the feasibility of detecting host genetic factors associated with COVID-19 susceptibility and severity with current study designs. We demonstrate why studying susceptibility to SARS-CoV-2 infection could be futile at certain stages of the pandemic. Our insights can aid in the interpretation of genetic findings emerging from the literature and guide the design of future host genetic studies.

Identifying host genetic factors of susceptibility to infection with SARS-CoV-2
An important objective of host genetic studies is to determine whether there are rare or common genetic variants that modify the risk of SARS-CoV-2 infection upon exposure. Examples of the existence of such variants for other infectious diseases are sparse in the literature [3], although for SARS-CoV-2, one recent study estimated 50% heritability for infection susceptibility [4]. Case-control studies to identify such genetic contributors to SARS-CoV-2 infection have, for practical reasons, defined cases as individuals with laboratory confirmed SARS-CoV-2 infection and controls as those who tested negative [11,13]. Given low population exposure rates at the early stage of the pandemic [14], the lack of data on an individual's exposure to SARS-CoV-2 [3], and high infectivity [15], many of these "controls" will be misclassified since it is not certain whether they were actually exposed to the pathogen and would have been infected had they been exposed. Imperfect test accuracy also contributes to misclassification. For example, concerns have been raised about false-negative reverse transcription polymerase chain reaction (RT-PCR) tests in patients showing apparent COVID-19 illness; sometimes, up to 40% of samples were deemed falsely negative [16]. Falsepositive tests, although less prevalent due to the high specificity of RT-PCR tests [17], lead to misclassification of cases, which can also be problematic if disease prevalence is low. For hypothesis testing in genetic studies, the result of misclassification of either cases or controls is reduced statistical power for detecting common and rare genetic variants associated with susceptibility to infection with SARS-CoV-2 ( Fig. 1).
We used simulations to demonstrate the factors that impact statistical power to identify genetic variants associated with SARS-CoV-2 infection susceptibility. It is assumed that controls are selected from persons testing negative for the virus. In addition to sample size, effect size of a variant (odds of being infected relative to being unaffected; OR) and its minor allele frequency (MAF), critical factors include rates of population-level exposure, test accuracy, and baseline infection susceptibility ( Fig. 1 and Additional file 1: Supplementary Methods). We assume that the majority of the population will be infected when exposed, except those who carry protective alleles [19,20] which lead to reduced infection susceptibility upon exposure. Therefore, the power calculation is estimated based on a range of ORs less than 1, without loss of generality.
In the simulation, the number of genetic risk variants each individual carries is determined by the MAF under Hardy-Weinberg equilibrium (HWE). Here, we specified a rare (MAF = 0.01) and common (MAF = 0.2) allele. Risk of SARS-CoV-2 infection susceptibility upon exposure varies depending on the number of risk variants a person carries and the effect size specified (OR = 0.2, 0.5, 0.7 investigated here). Population-level exposure (specified to be 5%, 20%, and 40%) and the corresponding test accuracies (80%, 90%, and 100% sensitivity) are treated as constants and used to simulate various plausible realworld scenarios. Population-level exposure, for instance, determines the proportion of simulated individuals exposed to SARS-CoV-2 and impacts the power of different study designs. Logistic regression is performed, regressing SARS-CoV-2 infection status on the number of protective alleles, and an association between the genetic variant and SARS-CoV-2 infection susceptibility is concluded in the simulation if the corresponding p value is below the genome-wide significance level (p < 5e−8). Further simulation details are provided in Additional file 1: Supplementary Methods. Figure 1a demonstrates that under perfect test accuracy, not only is it difficult to detect rare variants with very large protective effects (OR = 0.2; MAF = 0.01; blue), it is also extremely challenging to detect common variants with moderate protective effects (OR = 0.7; MAF = 0.2; green) using sample sizes reported in existing genetic studies (e.g., > 2000 cases [7]). Moreover, low population-level exposure (5%; solid line Fig. 1a) reduces statistical power, as many of the "controls" would have been infected if exposed to SARS-CoV-2, contributing to higher misclassification rates. Given low population exposure rates even in the hardest-hit regions [21], our chances of detecting a true association with infection susceptibility remain slim. The curation of better-defined controls through careful selection of those with a high probability of exposure to SARS-CoV-2 (such as health care workers or household members of positive cases), which is analogous to an increasing population-level exposure, leads to improved power to detect associated genetic variants. Figure 1b demonstrates that false-negative tests caused by low sensitivity of SARS-CoV-2 infection testing [16] can dramatically reduce statistical power. Increasing the number of false-negative tests (sensitivity = 0.9, green; sensitivity = 0.8, blue) leads to much reduced study power even when the genetic variant of interest is a common variant with large genetic (protective) effect (OR = 0.5, MAF = 0.2; Fig. 1b). Increasing population-level exposure or using better-defined controls can improve power (Fig. 1a), but these power gains are no longer guaranteed when coupled with an inaccurate, low-sensitivity test for SARS-CoV-2 infection. Increasing population exposure does not result in higher statistical power when sensitivity = 0.8 (blue curve; Fig. 1b) since the test misclassifies many cases as controls and offsets any power gain that would be realized by carefully selecting exposed individuals as controls. With lower test sensitivity (sensitivity = 0.7), we show that increasing population-level exposure is detrimental to the study power for disease susceptibility (Additional file 2: Figure S1). Figure 1c shows that, combined with low exposure rate, baseline infection susceptibility given exposure also affects power, in a manner that is inversely proportional to the baseline infection rate. For example, assuming a population-level exposure rate of 20% as seen in the hardest-hit geographic regions [22], a high baseline infection rate (e.g., 90%, the purple curve in Fig. 1c) diminishes the chance of detecting an associated variant with OR = 0.5 and MAF = 0.2 in a case-control study with 6000 cases and 6000 "controls." This is because increasing baseline infection susceptibility results in more people infected if exposed to SARS-CoV-2, which, combined with low exposure rate, leads to a higher misclassification rate of controls and lower power (e.g., 5268 out of 6000 "controls" are misclassified assuming 20% population exposure and 90% baseline infection rate; Fig. 1 Statistical power (i.e., the probability of detecting an association when it truly exists) to detect association between a genetic variant and infection susceptibility at the genome-wide significance level (5e−8) [18]. A 1:1 case-control study design was used for all parameter settings. Reported effect sizes are on the odds ratio (OR) scale, parameterized as log-additive for each additional protective allele. a Assuming perfect test accuracy and baseline infection susceptibility at 80% based on recent estimates [15], there is low statistical power to detect true associations when there is either low population-level exposure to SARS-CoV-2 or moderate genetic (protective) effect sizes (OR = 0.7). Detecting rare variants (MAF = 0.01) remains challenging even with a much larger protective effect size (OR = 0.2). b Reducing sensitivity for testing SARS-CoV-2 infection not only reduces statistical power but also negates gains that result from increasing population exposure. c Assuming 20% population exposure rate seen in the hardest-hit regions, baseline infection susceptibility, in the absence of the contributing protective genetic allele, can also severely impact power. Higher infection susceptibility (i.e., higher infectivity) can diminish any chance of detecting true signals with currently available sample sizes. Lower population exposure will further dampen statistical power as seen in a. MAF, minor allele frequency Additional file 1: Supplementary Methods). The loss in statistical power is exacerbated by the current (fortunately) low population-level exposure to the SARS-CoV-2 infection. With the sample sizes specified, detecting a true genetic association with SARS-CoV-2 infectivity appears achievable only if the infection susceptibility is low. However, recent studies have shown > 80% infectivity for individuals over age 30 [15], suggesting the blue (baseline infection susceptibility 80%) or purple (baseline infection susceptibility 90%) curves in Fig. 1c more closely reflect the current reality.
The lack of information on individual-level exposure, coupled with low population-level exposure rates, high infectivity, and inaccurate, low-sensitivity tests, make studies designed to identify genetic variation contributing to SARS-CoV-2 infection susceptibility largely infeasible. We see that a larger study cohort, with misclassification, is not guaranteed to yield higher statistical power than a smaller dataset with well-defined cases and controls, as noted elsewhere [23]. Our simulation results indicate that careful selection of controls with a high probability of exposure (e.g., frontline workers or household members of positive cases) rather than sampling test-negative controls is a preferable strategy. Given the negative effect on power from using a low-sensitivity test, it is also preferred to include individuals who were monitored or tested multiple times in a given timeframe; such a design can potentially save costs and yield greater insights in understanding genetic susceptibility to SARS-CoV-2 infection, even with a smaller sample size.

Identifying host genetic factors of disease severity of COVID-19
Another major objective of host-genetic studies is to identify genetic variants associated with severity of symptoms given SARS-CoV-2 infection. Such studies may focus on severe respiratory disease [7] or alternative phenotypes identified prospectively as researchers learn more about the sequelae of infection [24]. Ideally, cases should include infected patients who have the defined phenotype and controls should be comprised of infected patients with mild or no symptoms [3,25]. One commonly used definition of disease severity is hospitalization [11], which we use here for simulating case-control status. However, given the time, effort, and cost of recruiting and genotyping infected but non-hospitalized individuals, along with the urgent need for genetic insights, alternative designs that leverage matched controls in the general population have been proposed and implemented [7,11]. Contrary to the low population exposure rate that drives case-control misclassification in SARS-CoV-2 infection susceptibility studies, it is the low population infection rate (population exposure rate × baseline infection susceptibility) and the lack of information on individual-level infection that lead to misclassification in disease severity studies; some of the "controls" would have developed severe symptoms had they been infected with SARS-CoV-2.
Here, we examine the gain in statistical power and efficiency (in terms of requiring smaller sample size) in hypothesis testing when using test-positive controls over a design that samples controls from an untested, available population sample.
The number of genetic risk variants each individual carries is again determined by the MAF specified; here, a rare (MAF = 0.01) and common (MAF = 0.2) allele investigated under Hardy-Weinberg equilibrium (HWE). Risk of COVID-19 disease severity upon infection varies depending on the number of risk variants a person carries and the effect size (the odds of severe over mild disease; OR). Population-level infection (varied to be 5%, 40%, and 90%), baseline risk of hospitalization (5%), and test accuracy (80%, 90% and 100% sensitivity) are specified as constants and varied to simulate plausible real-world scenarios. Population-level infection determines the proportion of simulated individuals infected with SARS-CoV-2, while the baseline risk of hospitalization dictates the probability of developing severe symptoms upon infection for individuals without the genetic risk variants. Logistic regression is performed, regressing hospitalization on the number of risk alleles, and an association between the genetic variant and COVID-19 disease severity is concluded if the corresponding p value is below the genome-wide significance level (p < 5e−8). Further simulation details are provided in Additional file 3: Supplementary Methods.
The common (risk) variant effect sizes in Fig. 2a, b were chosen to reflect those recently reported in genetic studies of COVID-19 [7]. Only one red curve is shown since the study design uses test-positive individuals with mild symptoms as controls, which is not influenced by population-level infection rates. Thus, there is no misclassification attributed to infection rates. As expected, the higher red curve demonstrates that severity phenotyping in the absence of misclassification in a casecontrol study yields higher power than using controls with misclassification from the general population. Increases in population infection rates effectively closes the power gap between the two study designs since most population-based controls would have been infected and hospitalized if necessary, thus reducing misclassification. However, as serological tests have shown low population infection rates (e.g., 5% in Spain by May 11, 2020) even in the hardest-hit regions [21], we expect the gap in power demonstrated in Fig. 2 to remain for the foreseeable future. This observation is most relevant for prospective studies: careful selection of individuals infected with SARS-CoV-2 as controls improves power and can reduce spending on genotyping.
The effect size of an associated risk variant can have, of course, a large impact on the power of the study design (Fig. 2a vs. b). Detecting a common variant with moderate effect size (OR = 1.3, MAF = 0.2) is challenging without drastically increasing the number of participants studied. In fact, the sample size required to reach 80% power at the genome-wide significance level exceeds 4 times that required in Fig. 2a (Additional file 2: Figure  S2). Similar patterns are seen when detecting a rare variant (Fig. 2c) as power is lower for such variants even for a large effect size (OR = 5, MAF = 0.01). To achieve the same power, the smaller sample size realized by selecting test-positive rather than population controls (red curve vs. blue curves in Fig. 2c) once again demonstrates the potential to save significant costs in sequencing given the low population infection rates in the current phase of this pandemic.
To quantify the costs saved by using controls with confirmed infection, we defined relative reduction in sample size as 1 − n test positive controls n population controls . Figure 2d illustrates Fig. 2 Statistical power in hypothesis testing to detect a true association between a genetic variant and COVID-19 disease severity at the genomewide significance level (5e−8). A 1:1 case-control study design was used for all parameter settings. Only one red curve is shown since the study design uses confirmed infected individuals with mild or no symptoms as controls (test-positive controls), which is unaffected by population-level infection rates and the corresponding case-control misclassification. Effect sizes are reported on the odds ratio (OR) scale for each additional risk allele (logadditive scale). Perfect test accuracy is assumed in all plots. a Assumes a common variant with large effect size (OR = 1.7, MAF = 0.2). Using test-positive controls (red) yields higher power than using population-based (untested) controls (blue). High population infection rates reduce the gap between the two study designs but remains unlikely in the current phase of the pandemic. b Detecting a common variant with moderate effect size (OR = 1.3, MAF = 0.2) is challenging without drastically increasing the number of participants included for either design. c Detecting a rare variant even with a large effect size (OR = 5, MAF = 0.01) is more difficult with currently available sample sizes. Using test-positive controls without misclassification once again demonstrates higher power compared to population-based untested controls with misclassification. d Assumes OR = 1.7 and MAF = 0.2. Relative reduction in sample size, 1 − ntest positive controls npopulation controls , from using test-positive controls compared to population-based controls. n test _ positive _ controls and n population _ controls refer to the number of cases (1:1 case-control ratio) needed to achieve 80% power at the genome-wide significance level (5e-8) [18]. Relative reduction in sample size for other settings show similar trend and can be found in Additional file 2: Table S1 relative reduction in sample size with varying population infection rates when studying a common variant with a large effect size (OR = 1.7, MAF = 0.2); similar qualitative conclusions are gleaned as the parameters are varied (Additional file 2: Table S1). As expected, studies using controls with confirmed infection show the greatest benefit when disease prevalence is low. Given current levels of population infection rates at less than 10% [21], choosing controls with confirmed SARS-CoV-2 infection can save over 10% in genotyping costs. On the other hand, if genotype data for population controls are already available [9,10,26],~1200 genotyped controls are required to detect a rare variant with very large effect (OR = 5.0, MAF = 0.01) given (currently) low population infection rate (5%). The required sample size rises to~3800 for detecting a common variant with moderate effect (OR = 1.3, MAF = 0.2), and even more samples are needed if we allow the case-control ratio to decrease (Additional file 2: Figure S3). Given the sheer number of population controls available internationally, this seems like a reasonable and accessible strategy. Although beyond the scope of this article, this strategy is limited by the challenges of confounding and host-specific factors, which are often not widely available in population-based controls, leading to difficulties in interpreting genetic association results. When sample collection is planned prospectively, these results suggest a strategy of test-positive, matched controls for optimal power and interpretation.

Discussion
Understanding how host genetic factors contribute to variation in disease susceptibility and severity can shed light on heterogeneity in the immune response and the hostpathogen interaction and facilitate the development of therapeutics and vaccines. Genetic studies have already been deployed in several countries to investigate SARS-CoV-2 infection susceptibility and COVID-19 disease severity. However, ongoing studies are hindered by a lack of information on SARS-CoV-2 exposure and infection at the individual level, low population exposure/infection rates, and inaccurate, low-sensitivity tests. In particular, given current population exposure rates, a design that uses test-negative controls will not provide adequate power to detect variants that protect from infection when exposed to SARS-CoV-2; this strategy leads to lower study power and greater difficulty for studying infection susceptibility than disease severity. A design that focuses on, for example, frontline workers such as healthcare professionals who are at high risk of exposure and are more likely to receive repeated testing would have much greater power. For case-control studies investigating the genetics of COVID-19 disease severity, collecting those who tested positive for SARS-CoV-2 infection rather than using existing population-based controls should be prioritized whenever feasible. Household controls, which are easier to recruit and provide the added benefit of increased exposure probability, could serve as another effective selection of controls to enhance study power. Leveraging both genetically related and unrelated household controls can also disentangle contributions from genetic and environmental factors to SARS-CoV-2 infection susceptibility and COVID-19 severity. Using testpositive controls also provides the added benefit of prospective phenotypic observation to allow for future genetic studies of as-of-yet unknown phenotypic sequela of this disease, even in those who are mild or asymptomatic [24]. A prospective design also facilitates future investigation into the genetics of antibody levels and vaccine responsiveness, which is of major interest in ongoing vaccine development and deployment efforts.
Contrary to the SARS-CoV-2 infection susceptibility studies, we have assumed perfect test sensitivity and specificity for the COVID-19 severity study simulations. For study designs where controls have confirmed infection, low sensitivity (false negatives) that is characteristic of RT-PCR tests used for SARS-CoV-2 infection [16] is not a major consideration. Similarly, designs that use population-based controls without leveraging information on test results are also not affected by false-negative RT-PCR tests. Previous studies have reported near-perfect specificity for RT-PCR tests [27] which leads us to conclude that imperfect test accuracy has limited impact on host-genetic studies of COVID-19 disease severity, with the exception of meeting recruitment milestones.
Thus far, we have assumed RT-PCR tests were used to identify individuals infected with SARS-CoV-2 and that the exposure rate was constant across the sample. However, RT-PCR cannot indicate whether someone has been previously infected with SARS-CoV-2 due to its short detection period. Serology tests typically have longer detection periods and thus can be used to assess previous exposure to pathogens and contribute to a more refined selection of controls. It is worth noting that, as cases and controls are currently being selected from multiple sites for practical reasons, varying test accuracy and exposure rates over time and location would further impact study power.
Besides exposure, there are other confounders or mediators we have not addressed in this article which can further reduce power, diminishing our chances to detect variants that directly impact COVID-19 disease severity. That is, confounding could lead to the identification of genetic variants that impact, for example, comorbidities such as diabetes or hypertension, already known to put individuals in a higher risk category for severe disease. But these variants are likely not what would be most relevant to inform therapeutics and vaccine development. Variation associated with factors specific to the host rather than host-pathogen interactions is a major concern for case-control studies using populationbased convenience controls, rather than case-control studies in captive, highly exposed populations, such as frontline workers. One example of such confounding could be the ABO locus reported in [7], where blood donors were used as a control group and type-O blood may have been over-represented due to its universal donor status. Given the negative impact of ignoring important intrinsic variables on study power, pursuing stratification or matching on known intrinsic variables, especially confounders which are likely to be present when combining individuals from different geographical locations, is a desirable approach that should be considered in COVID-19 case-control host genetic studies moving forward [28]. Estimates of genetic effects, measured in odds ratios, are also difficult to interpret even if the unadjusted intrinsic variables are not confounders [28]. Future studies potentially through matching and/or stratification will also need to take into account the age distribution of cases and controls given the varying infection susceptibility and disease severity experienced in different age groups [15].
The identification of genetic variation that directly impacts infection susceptibility and disease severity would be an important step towards personalized treatment plans, risk stratification, therapeutic, and vaccine development and deployment. However, limited phenotypic data and exposure/infection information significantly impact our ability to detect these variants, especially our ability to detect infection susceptibility alleles. As the community moves ahead with the proposal and implementation of these studies, careful consideration of designs that prioritize clarity in the interpretation of the findings rather than expediency will prove to be more beneficial and cost-effective in the long run.