An atlas connecting shared genetic architecture of human diseases and molecular phenotypes provides insight into COVID-19 susceptibility

Background While genome-wide associations studies (GWAS) have successfully elucidated the genetic architecture of complex human traits and diseases, understanding mechanisms that lead from genetic variation to pathophysiology remains an important challenge. Methods are needed to systematically bridge this crucial gap to facilitate experimental testing of hypotheses and translation to clinical utility. Results Here, we leveraged cross-phenotype associations to identify traits with shared genetic architecture, using linkage disequilibrium (LD) information to accurately capture shared SNPs by proxy, and calculate significance of enrichment. This shared genetic architecture was examined across differing biological scales through incorporating data from catalogs of clinical, cellular, and molecular GWAS. We have created an interactive web database (interactive Cross-Phenotype Analysis of GWAS database (iCPAGdb)) to facilitate exploration and allow rapid analysis of user-uploaded GWAS summary statistics. This database revealed well-known relationships among phenotypes, as well as the generation of novel hypotheses to explain the pathophysiology of common diseases. Application of iCPAGdb to a recent GWAS of severe COVID-19 demonstrated unexpected overlap of GWAS signals between COVID-19 and human diseases, including with idiopathic pulmonary fibrosis driven by the DPP9 locus. Transcriptomics from peripheral blood of COVID-19 patients demonstrated that DPP9 was induced in SARS-CoV-2 compared to healthy controls or those with bacterial infection. Further investigation of cross-phenotype SNPs associated with both severe COVID-19 and other human traits demonstrated colocalization of the GWAS signal at the ABO locus with plasma protein levels of a reported receptor of SARS-CoV-2, CD209 (DC-SIGN). This finding points to a possible mechanism whereby glycosylation of CD209 by ABO may regulate COVID-19 disease severity. Conclusions Thus, connecting genetically related traits across phenotypic scales links human diseases to molecular and cellular measurements that can reveal mechanisms and lead to novel biomarkers and therapeutic approaches. The iCPAGdb web portal is accessible at http://cpag.oit.duke.edu and the software code at https://github.com/tbalmat/iCPAGdb. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-021-00904-z.

are the R statistical programming language (R Core Team, 2020), the R package Shiny for interaction of web pages with R scripts (Chang et al., 2020), Shiny Server as a 24/7 multi-user platform to make Shiny apps publicly accessible (RStudio, 2020), the database environment SQLite for efficient querying of GWAS and CPAG results (Hipp, 2020), and the R package RSQLite to execute SQL queries from within R scripts (Muller et al., 2020). CPAG computations, as described in the iCPAGdb section of the paper and (Wang et al., 2015), are implemented in the Python programming language, external to the web app, and are executed using parameterized system calls constructed from values supplied by the user through on-screen controls. The results of a CPAG execution are read by the R script, processed, and presented to the viewer in various tables and graphs on a web page. Shiny conducts the interaction between web pages, R scripts, and the CPAG functions. The SQLite database is primarily used by the CPAG functions, but is also a source of identifiers and other values used to populate selection lists within the web app.
Program flow 1 outlines the major processing steps for review mode. Basically, the user requests an existing CPAG result set from which a corresponding table and heatmap are generated and displayed. Various filtering and graph construction controls are available for iterative sub-setting of data and selection of significance measure and number of top significant phenotype pairs to plot. The "Download" button enables the researcher to make a local copy of records appearing in the currently displayed results table. Important packages used in this mode are DT (Xie et al., 2020) for construction of and interaction with tables and ggplot2 (Wickham, 2016), plotly (Sievert, 2020), and heatmaply (Galili et al., 2017) for basic plotting, interactive plotting (hover labels), and heatmap generation, respectively. Although the packages employed are relatively full-featured and robust, several custom algorithms had to be developed to overcome specific limitations of various functions of these packages when used with our data. For example, the clustering algorithm used by the heatmaply() function to add dendrogram lines to a graph generated errors when used with the distance matrix computed by the base R distance() function. A substitute vector distance algorithm was developed to overcome domain errors reported by heatmaply(). Figure 1 shows an example review session.  Table" tab (bottom of the image) and a corresponding heatmap that relates pairs of inter-GWAS phenotypes by the selected significance measure (Fisher, Bonferroni, etc.) appears in the "Heatmap" tab. Figures 2 and 3 show the table and heatmap that appear after clicking the "Molecular traits vs. Human disease" row of the selection table, which is accessed by scrolling down.   Program flow 2 outlines the major processing steps for compute mode. Figure 4 shows the corresponding "Upload and Compute" tab of the app. In this mode, the user browses files on a local computer, selects a properly formatted GWAS result file of interest (containing, for a single phenotype, SNPs and GWAS significance, p, values), specifies format and column configuration, then uploads the file. Next, CPAG computation parameter values, including iCPAGdb GWAS set to be crossed with, significance thresholds for filtering, and linkage disequilibrium (LD) population are specified. When "Compute CPAG" is pressed, the R script composes a system level command to execute the CPAG (Python) function. The future() function of the R future package (Bengtsson, 2020) combined with a delaying pipe from the promises package (Cheng, 2020) execute CPAG operations asynchronously, waiting on completion before resuming R script execution. An important consideration is that the default behavior of R is to execute instructions synchronously, so that one instruction is completed in entirety prior to another beginning. This is problematic in a multi-user setting when long-running computations are executed. Typical CPAG execution time ranges from thirty seconds to ten minutes, making synchronous execution problematic. Although (the open source version of) Shiny Server accommodates multiple simultaneous connections, it allocates a single R process to each application. iCPAGdb is an application. Therefore, within the group of simultaneous iCPAGdb users, at most one is being serviced at any given time. Others must wait until that user's CPAG analysis completes. But, by executing the CPAG function from within a future() call and promise pipe sequence, R will spawn an individual, asynchronous process for executing the CPAG function, enabling multiple, simultaneous users. Of course, this has to be programmed and scripts must be adjusted to account for the delayed, and impromptu nature of results being returned. In addition to one CPU being allocated for each asynchronous CPAG call, the CPAG function itself executes as a multi-threaded process on multiple CPUs. It is important to consider expected load (number of simultaneous users and number of CPUs used in parallel) when configuring a server to be used for public compute services.
Program flow 2, compute mode

Upload a GWAS section
Researcher: 1. Browses local computer for a GWAS file and uploads it 2. Specifies file structure (delimiter type, GWAS SNP and significance columns)

Compute section
Researcher: 3. Selects iCPAGdb GWAS study (GWAS source two) to be used for investigation of cross-GWAS associations 4. Selects p-thresholds (to filter phenotypes and SNPs) for each GWAS and specifies a linkage disequilibrium (LD) population

6.
Validates the uploaded file (verifies that specified columns and delimiters are present)
Creates an asynchronous future() environment for CPAG execution 9. Passes control to the CPAG function and waits on returned (promise piped) results

Generates a table of results and heatmap using current on-screen configuration values
Researcher: 11. Reviews and interacts with results as on the Review tab In initial testing, the app was found to perform efficiently, giving researchers convenient access to both precomputed CPAG results and newly computed values for uploaded GWAS data. Cross-phenotype relationships presented by the app agree with those known to researchers with expert knowledge of data sets employed during verification. It is hoped that the app will become a recognized and useful tool for researchers conducting exploratory cross-phenotype analysis of GWAS.
Complete R and Shiny scripts for the iCPAGdb web app, along with additional design information, are available at https://github.com/tbalmat/iCPAGdb.