Complex host genetics influence the microbiome in inflammatory bowel disease

Background Human genetics and host-associated microbial communities have been associated independently with a wide range of chronic diseases. One of the strongest associations in each case is inflammatory bowel disease (IBD), but disease risk cannot be explained fully by either factor individually. Recent findings point to interactions between host genetics and microbial exposures as important contributors to disease risk in IBD. These include evidence of the partial heritability of the gut microbiota and the conferral of gut mucosal inflammation by microbiome transplant even when the dysbiosis was initially genetically derived. Although there have been several tests for association of individual genetic loci with bacterial taxa, there has been no direct comparison of complex genome-microbiome associations in large cohorts of patients with an immunity-related disease. Methods We obtained 16S ribosomal RNA (rRNA) gene sequences from intestinal biopsies as well as host genotype via Immunochip in three independent cohorts totaling 474 individuals. We tested for correlation between relative abundance of bacterial taxa and number of minor alleles at known IBD risk loci, including fine mapping of multiple risk alleles in the Nucleotide-binding oligomerization domain-containing protein 2 (NOD2) gene exon. We identified host polymorphisms whose associations with bacterial taxa were conserved across two or more cohorts, and we tested related genes for enrichment of host functional pathways. Results We identified and confirmed in two cohorts a significant association between NOD2 risk allele count and increased relative abundance of Enterobacteriaceae, with directionality of the effect conserved in the third cohort. Forty-eight additional IBD-related SNPs have directionality of their associations with bacterial taxa significantly conserved across two or three cohorts, implicating genes enriched for regulation of innate immune response, the JAK-STAT cascade, and other immunity-related pathways. Conclusions These results suggest complex interactions between genetically altered host functional pathways and the structure of the microbiome. Our findings demonstrate the ability to uncover novel associations from paired genome-microbiome data, and they suggest a complex link between host genetics and microbial dysbiosis in subjects with IBD across independent cohorts. Electronic supplementary material The online version of this article (doi:10.1186/s13073-014-0107-1) contains supplementary material, which is available to authorized users.


Supplementary Materials
Complex host genetics influence microbiome profile in inflammatory bowel disease

Microbiome processing and feature extraction
Sequencing and reference mapping We extracted 16S ribosomal RNA from mucosal tissue biopsies as described previously [1]. We clustered sequences into operational taxonomic units (OTUs) using the QIIME [2] pick_otus.py script with method "uclust_ref" and the --suppress_new_clusters option, and all other default settings to map the sequences by references against the 97% similarity reference sequences in the 13_8 Greengenes release [3]. 91.1% of sequences matched the reference set. After mapping to the reference set, the samples contained 14,315,989 sequences, with 28,689.36 ± 29,889.65 sequences per sample (mean ± s.d.). For further analysis we subsampled 2,000 sequences, without replacement, from each sample to control for sequencing effort; this sequence depth is expected to be more than sufficient to measure shifts in dominant taxa [4]. We collapsed OTUs according to the Greengenes reference sequence taxonomy assignments, and carried forward taxon groups at all taxonomic levels.

Patient selection criteria
All patients were diagnosed with Crohn's disease or ulcerative colitis. Patients were between the ages of 18 and 75. For all analyses except the meta--analysis of clinical covariates and NOD2, we excluded patients with a history of antibiotic usage in the month preceding sampling. For the small number of patients that had more than one sequenced sample, we selected a non--inflamed biopsy first. In the rare case that there was more than one non--inflamed biopsy from the same general location we collapsed those samples together; in the rare case that these samples were from different general biopsy locations (terminal ileum versus colon) we selected terminal ileum first.

Pruning by correlation
In order to remove redundant signals from the data prior to performing genome-microbiome association tests, we calculated all--against--all Pearson's correlations within taxa and KEGG pathways. We then clustered these correlations using complete linkage at a threshold of 0.95 Pearson's correlation. Within each cluster we selected the taxon/pathway with the highest average abundance as the representative feature. Supplementary Table 6 contains a list of taxa that were clustered during this step.

Data transformation
To reduce heteroscedasticity while preserving zero--nonzero relationships within taxa, we applied the arcsine--square root transform to all taxon relative abundances, as described previously [5]. In subsequent linear association tests we excluded taxa with zero abundance in greater than 25% of samples to avoid the effects of zero--inflated distributions on the linear model. For all analyses we included only those taxa present in at least 10% of samples.

Calculation of between--subject distances (beta diversity)
We calculated principal coordinates of between--sample (beta) diversity for association testing with host genetic loci using the first three axes of principal coordinates analysis (Poi) via the standard R function cmdscale. We included several measures of between--

Supplementary Materials
Complex host genetics influence microbiome profile in inflammatory bowel disease 3 sample diversity. To measure the phylogenetic distance between samples we used weighted UniFrac [6] distances calculated with QIIME and the Greengenes reference phylogeny.

Immunochip processing and feature extraction NOD2 multiple variant aggregation
For NOD2 tests we extracted six known causal variants described previously [7]: rs2066844 (R702W), rs2066845 (G908R), rs5743277, rs5743293 (fs1007insC), rs104895431 and rs104895467. These variants are almost completely independent, and thus we were able to simply take the sum of risk alleles present at any of the six across the cohorts, and still have only subjects with zero, one, or two total risk alleles. We binned two subjects with 3 total risk alleles across the six variants with the 2--risk allele subjects. We then used these counts in subsequent tests for association with the microbiome. All six variants are location in NOD2 introns.

Immunochip fine mapping and credible set identification
The variants we tested for these genes were the top published variants in the largest GWAS analysis to date [8], or the best proxies by linkage disequilibrium (LD) present on Immunochip; however, as evidenced by our success with fine mapping of additional NOD2 variants, single variants may not represent the full signal of disease risk or microbiome interaction within their respective loci. In a cohort of 33,938 individuals (partially described elsewhere [8], the remainder unpublished) we used regions of high genotyping density on the Immunochip to identify additional independent signals of disease risk near six other SNPs (Supplementary Table 7). Although apart from NOD2 incorporating these additional variants had mixed results, we expect that as the number of testable loci grows with cohort sizes, the approach will be essential in identifying genome--microbiome interactions at complex risk loci. We performed fine mapping using a Bayesian based approach (manuscript in preparation). We constructed a posterior probability function of the independent associations given the observed phenotypes and genotypes. The most probable set of independent associations was identified by maximizing this posterior probability in using a greedy algorithm [9]. Uniform prior and steepest descent approximation were used in the calculations. The credible set of SNPs was defined as the minimal set of SNPs such that their sum of posterior probabilities is greater or equal to 95% [10]. SNPs within the credible set were ranked based on their posterior probabilities.

Clinical covariates
In all association tests we included linear covariates for antibiotic usage within the last month, immunosuppressant usage within the last month, biopsy inflammation status, age, at time of sampling, gender, general biopsy location (ileum, colon, or pre--pouch ileum), CD/UC diagnosis, disease location (Montreal classification [11] locations E1, E2, or E3 for UC; L1, L2, or L3 for CD), cohort membership (i.e. Boston vs. Toronto vs. Groningen). We excluded time since diagnosis due to high correlation with age (Pearson's correlation coefficient = 0.487; p < 2.2×10 --16 ). To avoid including redundant clinical covariates in the linear regression, we identified clusters of redundant variables by the amount of information they shared. To determine the level of informational overlap between 4

Supplementary Materials
Complex host genetics influence microbiome profile in inflammatory bowel disease covariates, we first recoded each categorical variable using multiple binary dummy variables, and discretized each continuous variable using the infotheo package in R [12]. We then calculated the maximum uncertainty coefficient [13] between each pair of variables. The uncertainty coefficient measures the amount of information in one variable that cannot be explained by another. For any group of variables in which each variable contained at least 75% of the information in all the others, we selected the single variable with highest entropy as a representative for the group. At this level all covariates were retained (Supplementary Figure 7).

Linear tests with arcsine--square root transform
Due to common occurrence of samples with zero relative abundance of a given taxon, taxon relative abundances are not readily amenable to power transform for normalization, therefore we apply the arcsine--square--root transform prior to regression, as done in previous microbiome studies [1,5,14]. The arcsine--square root transform applies naturally to values on the unit interval, such as relative abundances. It has the effect of stabilizing variance and reducing heteroscedasticity. It is similar to a log or power transformation in that smaller values are decompressed and larger values are relatively compressed, but it has the additional advantage that it handles zero values by definition. The general form of the model for the relative abundance of a given bacterial taxon in subject i regressed on genetic variant j, while controlling for various covariates, is therefore: where Aij is the risk allele count for genetic variant j in subject i (0, 1, or 2); Ci1, Ci2, … are the clinical covariates described in the main text. After applying the arcsine--square root transformation we performed standard multivariate regression in R[15] to test the null hypothesis that coefficient β1 was equal to zero.

Outlier detection and filtering
In every association test we excluded samples with values of the bacterial taxon relative abundance that exceeded three times the interquartile range beyond the median value, as estimated using the R function boxplot.

Microbiome--wide comparison across cohorts
To test for conservation of associations a particular genetic locus with the entire microbiome across cohorts, we first obtained individual p--values and regression coefficients for interaction of that locus with each individual microbiome feature. For a given locus we then selected the subset of microbiome features that were nominally significant (p < 0.05) in at least one cohort, and compared the directionality (signs) of the coefficients for corresponding microbiome features from one cohort to the other. We assessed the similarity of associations between cohorts using the phi coefficient, the Pearson's correlation of the positive/negative signs of the coefficients. We used the phi coefficient of the signs of the regression coefficients instead of correlation of the actual regression coefficients because the magnitude of the regression coefficients is highly correlated with the mean of the particular microbiome feature being considered. Using the sign tests for conservation of the directionality of the effect across cohorts. We identified genes whose microbiome associations were conserved between at least two cohorts (FDR < 0.05 when comparing Boston to Toronto or Boston to Netherlands), then constructed a gene interaction network from those using GeneMANIA [16]. The gene network displayed

Supplementary Materials
Complex host genetics influence microbiome profile in inflammatory bowel disease 5 excludes those genes with no interactions found (LFNG, VWC2). We used GeneMANIA to test for enrichment of functional modules at FDR < 0.05.

Imputation of missing clinical data
Recent antibiotics usage history was not collected for the Groningen cohort; we imputed these values using a Random Forests classifier [17] with default settings in the R randomForest package [18] trained on all taxa, functions, and clinical data for other subjects, with estimated accuracy of 87.3% (96.74% expected accuracy for true "No" subjects, 58.7% error for true "Yes" subjects); two of 135 Groningen subjects were predicted to have had recent administration of antibiotics.

Clinical covariate sensitivity analysis
We verified that effect sizes and directionality were highly conserved regardless of whether pre-pouch ileum samples were included or excluded (Spearman's rho = 0.92; p= 4.0×10 -6 for bacterial taxa; Supplementary 8) subjects over age 50 were excluded (Spearman's rho = 0.94; p= 4.1×10 -6 ; Supplementary Figure 9). We verified in a subset of 377 samples that the addition of covariates for mesalamine and steroid usage did not alter the directionality of any nominally significant results (directionality was conserved for 6 out of 6 nominally significant taxon associations with NOD2).

Supplementary Materials
Complex host genetics influence microbiome profile in inflammatory bowel disease Supplementary Figure 3. Detailed schematic of host genome--microbiome association testing methodology Host genome--microbiome interaction testing involves potentially thousands or millions of genetic polymorphisms and hundreds or thousands of bacterial taxa and genes. Full feature-by--feature interaction testing is likely to be underpowered in all but the largest cohorts or meta--analyses; therefore our methodology includes careful feature selection from both data types. Raw genetic polymorphisms were derived from Immunochip data and filtered by known IBD associations from large--cohort GWAS studies. NOD2 risk signal was calculated as the sum of 6 independent causal variants as described in the main text. SNPs were further filtered by minor allele frequency and genotyping rate. For each of the 163 previously published IBD--related SNPs, we counted the number of risk alleles for a given subject (0, 1, or 2). Microbiome operational taxonomic units (OTUs) were grouped by lineage at all taxonomic levels using Greengenes. These were filtered by prevalence (rate of presence) and mean relative abundance. Clinical covariates were collapsed when they contained redundant signals (see Methods); categorical variables were recoded as binary dummy variables. Finally, linear tests were performed for association of microbiome features with individual SNPs while controlling for clinical covariates.

Supplementary Materials
Complex host genetics influence microbiome profile in inflammatory bowel disease 9 Supplementary Figure 4

. Comparison of NOD2 association tests in subjects with UC versus subjects with CD
A value on the horizontal axis represents the (negative log) significance of associations between a bacterial feature and NOD2 including only samples from subjects with CD; a corresponding value on the vertical axis represents the (negative log) significance of the same test when including only subjects with UC. Although the two sets of subjects are completely independent, the two sets of test results are highly correlated. a, taxon--NOD2 coefficients (p = .006, Spearman's correlation 0.57).

Supplementary Materials
Complex host genetics influence microbiome profile in inflammatory bowel disease Supplementary Figure 5. Differentially abundant taxa on inflamed biopsies a, the species Bacteroides uniformis is significantly less abundant on inflamed biopsies; a number of unclassified OTUs from the genus Lactobacillus are significantly more abundant on inflamed samples. These effects are true at different biopsy locations. b, a plot of independent OTUs assigned to the unclassified Lactobacillus taxonomic group, showing consistently higher mean relative abundance in inflamed samples than in non--inflamed samples.

Supplementary Materials
Complex host genetics influence microbiome profile in inflammatory bowel disease 11 Supplementary Figure 6. Between--individual genomic distance correlates with microbiome distance Scatterplot of genetic distance between pairs of individuals (Manhattan distance) against distances between microbiomes of those pairs of individuals (Bray Curtis distance of genus-level taxa).