Genetic feature engineering enables characterisation of shared risk factors in immune-mediated diseases

Background Genome-wide association studies (GWAS) have identified pervasive sharing of genetic architectures across multiple immune-mediated diseases (IMD). By learning the genetic basis of IMD risk from common diseases, this sharing can be exploited to enable analysis of less frequent IMD where, due to limited sample size, traditional GWAS techniques are challenging. Methods Exploiting ideas from Bayesian genetic fine-mapping, we developed a disease-focused shrinkage approach to allow us to distill genetic risk components from GWAS summary statistics for a set of related diseases. We applied this technique to 13 larger GWAS of common IMD, deriving a reduced dimension “basis” that summarised the multidimensional components of genetic risk. We used independent datasets including the UK Biobank to assess the performance of the basis and characterise individual axes. Finally, we projected summary GWAS data for smaller IMD studies, with less than 1000 cases, to assess whether the approach was able to provide additional insights into genetic architecture of less common IMD or IMD subtypes, where cohort collection is challenging. Results We identified 13 IMD genetic risk components. The projection of independent UK Biobank data demonstrated the IMD specificity and accuracy of the basis even for traits with very limited case-size (e.g. vitiligo, 150 cases). Projection of additional IMD-relevant studies allowed us to add biological interpretation to specific components, e.g. related to raised eosinophil counts in blood and serum concentration of the chemokine CXCL10 (IP-10). On application to 22 rare IMD and IMD subtypes, we were able to not only highlight subtype-discriminating axes (e.g. for juvenile idiopathic arthritis) but also suggest eight novel genetic associations. Conclusions Requiring only summary-level data, our unsupervised approach allows the genetic architectures across any range of clinically related traits to be characterised in fewer dimensions. This facilitates the analysis of studies with modest sample size by matching shared axes of both genetic and biological risk across a wider disease domain, and provides an evidence base for possible therapeutic repurposing opportunities.


Mathematical exposition of basis construction
Our input data are matrices of GWAS regression coefficientsβ ti and their standard errors σ 2 ti where t = 1, . . . , T indexes traits, i = 1, . . . , p indexes SNPs, and there are no missing values by design of picking only SNPs with complete data across all traits.
We first generate trait-specific weights for each SNP, using approaches developed for fine mapping.
Wakefield [1] showed that the Bayes factor for SNP association could be approximated from GWAS summary statistics by BF ti = P (β ti |β ti ∼ N (0, σ 2 ti + W )) P (β ti |β ti ∼ N (0, σ 2 ti )) where W is the variance of a prior on the true effect, and for which we set W = 0.04, a commonly adopted value (see, for example, [2]), corresponding to a prior belief that true odds ratios exceed 1.5 with a probability about 5%.
We assume the genome is partitioned into R non-overlapping sets of SNPs, R 1 , . . . , R R , according to the location of recombination hotspots, such that each SNP belongs to exactly one region, and we write i ∈ R r if the i-th SNP belongs to the r-th region. We use the notation r(i) to denote the region to which SNP i belongs; i.e. r(i) = R j ⇐⇒ i ∈ R j . Then, under the assumption that at most one variant in a region R r is causal, and that each such variant (if it exists) appears in the dataset, the posterior probability for each SNP i ∈ R r to be causal is [3] pp ti = πBF i (1 − m r π) + i∈Rr πBF i where m r = |R r | is the number of SNPs in region R r and we set the prior probability that any SNP in the dataset is causally associated with any trait to be π = 10 −4 , corresponding to a prior belief that 1 in 10,000 SNPs across the genome is causal. [2] Note that the single causal variant assumption allows us to express this probability without reference to LD. [3] Although these trait-specific weights, pp ti , were developed for fine-mapping, we emphasise that we are not fine mapping here, with only a subset of SNPs in our dataset.
However, these weights are attractive because: (i) they tend to 0 forβ ti not distinguishable from 0; and (ii) they deal naturally with LD, since they are a function of the complete data likelihood for any recombinationhotspot defined region under the assumption the region contains a single causal variant.
We estimate the probability that the r-th region contains a causal association for the trait t by summing across SNPs in that region v tr = i∈Rr pp ti and use these to generate a final SNP weight which is a weighted average over pp ti Each SNP also has a different variance in the population, 2f i (1 − f i ), and this contributes to the variance where f i is that SNP's minor allele frequency, and we use this to generate a T × p matrix of shrunk regression coefficients to which we add a T + 1 th row of zeroes, representing controls. A centred version of this matrix forms our final matrix for PCA decomposition Standard PCA may be used to project G C into a T -dimensional space by post-multiplying by the p × T projection matrix Q, whose columns are the first T eigenvectors of (G C ) G C , to obtain To project a new trait into the space, we require a vector of regression coefficients for that trait across the same set of SNPs,β 0 = (β 0i , i = 0, . . . , p). Typically these will be log OR for logistic regression, but they could also be from linear regression or any generalised linear model. Eachβ 0i , has a corresponding estimate of variance ν 2 0i , and the variance-covariance matrix of standardisedβ 0 , Z = (β 0i /ν 0i ) is Σ, the correlation matrix of genotypes indexed by SNPs inβ 0 [4]. The trait is then projected onto the basis according to the linear function The null location in this space (i.e. the projection of a zero vector) is given by X 0 = −C Q, so that the difference between the projection of the new trait and control is given byδ = P 0 − X 0 , which also has variance var(P 0 ) as only P 0 is random. We noteδ = Dβ 0 is thus a weighted average of effect sizes for the test trait. Analogous to standard GWAS hypothesis tests, which test null hypotheses of the formβ 0 = 0, we can testδ = 0 for each component. To simultaneously test the null hypothesis that the estimand δ is zero across all components, a statistic, X 2 overall , can then be defined as if the null hypothesis is true.
Similarly, we may define statistic X 2 j to allow us to test the hypothesis that component j of δ is zero: if the null hypothesis is true.
Due to differential genotype coverage between input traits there is a tension between the number of traits and the density of variants that can be included in the basis. To investigate the dependence of the basis on variant density we created two shrinkage bases using six of the original input studies that used imputation: a 'sparse' basis using a subset of 265,887 SNPs used in the main text basis and a 'dense' basis using 4,623,330 SNPs (non-palindromic, MAF>0.01) common across the six studies. We found that despite the dense basis containing an order of magnitude more SNPs the PC scores for input traits were highly correlated between bases ( Figure SN 1). Genotype coverage varied across the traits we projected onto the basis (Supplementary Table 3). By default, we set missing values to zero, and investigated the effect of this on resultant projections as follows.
For each trait missing at least one basis SNP, we constructed an alternative basis using only SNPs present in all 13 basis traits and the trait to be projected. After projecting the target trait onto this alternative 'tailored' basis we assessed the difference in PC scores across components (Figure SN 2). We found in a majority of traits, PC scores were only nominally affected. Projections of NMO and PsA (Aterido) were substantially different in the tailored basis, mostly due to attenuated and thus conservative PC scores for the main basis.
We concluded that these traits were sensitive to missing data and imputed summary statistics as described in Methods.  Supplementary Table 3. Traits with the largest deviations from x = y are labelled and were subsequently imputed as described in Methods. about their true value, 0, as sample size increases. A vector of 0 is the limit as sample size tends to infinity of any such simulated GWAS. We illustrated this by simulating null case control data for the same SNPs used in the IMD basis with MAF and LD information derived from EUR samples in 1000 Genomes using simGWAS [5]. The distributions, tend to cluster closer to zero ( Figure SN 3), although varying this sample size had only a very limited impact on the structure identified and recovered by the basis, as shown by the dendrograms for different null GWAS sample sizes ( Figure SN 4).

Validity of basis for different ancestry datasets
While our basis was created from predominantly European GWAS, there is an imperative to increase ancestry diversity in GWAS [6].
This was supported by a broader comparison of two releases of UKBB summary statistics. The Neale compendium used above was chosen because the traits analysed are not filtered on case numbers, allowing us to consider IMD with small numbers of cases. Whilst the Neale compendium focuses on the European subset of UKBB (n approx = 360,000) an alternative resource, GeneAtlas, (http://geneatlas.roslin.ed.ac.uk/)(Canela-Xandri, Rawlik, and Tenesa 2018) uses all available UKBB subjects (n approx= 452,000), and a linear mixed model to adjust for population stratification. GeneAtlas projections were generally proportional to Neale (no significant deviation from proportionality identified) but attenuated (median ratio=0.89). This suggests the mix of non-European samples leads to an attenuation of signal compared to European-only. However, the larger sample size in GeneAtlas compensated for the attenuation such that GeneAtlas results showed a tendency to greater significance ( Supplementary Figures). Overall, this suggests that results projecting non-European or GWAS studies on to a European basis may reduce power for the same sample size, but does not lead to invalid results.