Identifying Crohn’s disease signal from variome analysis

Background After many years of concentrated research efforts, the exact cause of Crohn’s disease remains unknown. Its accurate diagnosis, however, helps in management and even preventing the onset of disease. Genome-wide association studies have identified 140 loci associated with CD, but these carry very small log odds ratios and are uninformative for diagnoses. Results Here we describe a machine learning method – AVA,Dx (Analysis of Variation for Association with Disease) – that uses whole exome sequencing data to make predictions of CD status. Using the person-specific variation in these genes from a panel of only 111 individuals, we built disease-prediction models informative of previously undiscovered disease genes. In this panel, our models differentiate CD patients from healthy controls with 71% precision and 73% recall at the default cutoff. By additionally accounting for batch effects, we are also able to predict individual CD status for previously unseen individuals from a separate CD study (84% precision, 73% recall). Conclusions Larger training panels and additional features, including regulatory variants and environmental factors, e.g. human-associated microbiota, are expected to improve model performance. However, current results already position AVA,Dx as both an effective method for highlighting pathogenesis pathways and as a simple Crohn’s disease risk analysis tool, which can improve clinical diagnostic time and accuracy.


1
AVA,Dx pipeline. We constructed the AVA,Dx pipeline as outlined in Fig. 1 (a 2 more detailed pipeline in Additional file 1: Figure 1) and Methods. Briefly, we 3 cleaned up the data, performed predictions of functional effects of variants [11, 12] 4 and converted the latter into per-gene scores of individual-specific functionality. 5 To build predictive models we considered externally determined disease genes 6 (e.g. GWAS and literature-identified genes) and extracted our own disease-gene 7 sets via computational feature selection (FS). Note that in this manner we 8 identified previously unreported CD-related genes. Support Vector Machine[13] 9 (SVM) models using these gene sets were tested in leave-one-out cross-10 validation on the training data and on (batch effect corrected) previously unseen 11 data. In permutation testing, we showed that our prediction performance was 12 significantly non-random. 13 in both CD-train and CD-test panels. In evaluating this observation, we found that 20 this difference was largely due to the difference in missing variant calls. There 21 are two ways to address missing data [16,17]: remove individuals with missing 1 disease signal, for each gene we calculated scores that (1) give different weights 2 to different variant types, based on the severity of their effects on protein 3 function [11,12] (gene_score). Additionally, (2) easy_gene_score that assigns a 4 fixed value for all types of exonic variants, (3) binary_gene_score that uses 5 binary effect/no-effect predictions for non-synonymous variants, (4) exonic 6 variant burden (EV_gene_score), and (5) non-synonymous variant burden 7 (NSV_gene_score) versions of gene scoring were also evaluated. We built SVM 8 models in leave-one-out cross-validation on the CD-train panel, using randomly 9 chosen subsets of various sizes of all human genes (ALL set), as well as the 10 genes mapping to the known GWAS-established CD loci[3] (GWAS set), 11 literature-annotated CD genes (MeSH set), and Swiss-Prot annotated CD genes 12 (SP set). All model building procedures and data sets are described in detail in 13 the Methods section. 14 As expected, models that were built on the ALL set genes were similar in 15 performance to random guessing with any gene_score scheme. Increasing the 16 number of genes per subset used for model building did not improve 17 performance (Additional file 1: Figure 4 to 8). Interestingly, GWAS and MeSH 18 sets had similar results as the ALL set ( Fig. 2A, Additional file 1: Figure 4 to 8). 19 The best performance was achieved by using all 18 SP genes (SP max ) with 20 NSV_gene_score (ROC AUC = 0.55 and area under precision-recall curve, PR 21 AUC = 0.66, Additional file 1: Figure 8). While disappointing, the inability of these 22 models to differentiate healthy individuals from disease affected ones, suggests 23 that there is no sequencing or scoring artifact that could differentiate CDs and 1 HCs in CD-train panel. built SVM models for leave-one-out cross-validation using top ranked subsets of 8 genes in various sizes. The Pascal-ranked genes achieved much better 9 performance than other external sets (Fig. 2B). Using gene_score our models 10 achieved the highest ROC AUC of 0.70 (PR AUC = 0.73, with PascalGWAS 175 11 genes, i.e. top-ranked 175 genes from PascalGWAS set). By further permuting 12 the CD/HC labels in cross-validation (Methods), we empirically showed that the 13 performance of our models is significantly non-random (ROC/PR permutation p-14 val = 0.001/0.011). The NSV_gene_score had slightly worse but still significant 15 performance (ROC/PR AUC = 0.66/0.74, achieved with PascalGWAS 75 genes, 16 permutation p-val = 0.013/0.001). The other three scoring schemes 17 (easy_gene_score, binary_gene_score and EV_gene_score) had better 18 performance with Pascal-ranked genes than with the other external sets, but they 19 did not perform as well as gene_score and NSV_gene_score (Additional file 1: 20 Figure 4-8). Thus, from here on we only considered the gene_score and the 21 NSV_gene_score for all model building. Note that here and in all fixed gene sets, 22 overfitted to the data, as feature selection was performed in a leave-one-out 1 fashion, i.e. excluding the testing individual. Thus, our results suggest that FS 2 selected genes can differentiate CDs from HCs in our data, particularly using 3 signal that is not available to the statistics-based methods like GWAS. 4 The KS5 genes NSV_gene_score selected genes (28-41 genes per iteration) 5 failed to differentiate CDs from HCs in CD-train (ROC/PR AUC = 0.32/0.47). The 6 DKMcost set was better, but also performed poorly (ROC/PR AUC = 0.65/0.73; 7 Additional file 1: Figure 12). These results suggested that gene_score is more 8 informative than NSV_gene_score, highlighting the importance of using the 9 severity of functional effects of variants in evaluating disease genetics. 10

Feature selection identifies known and previously undescribed CD genes. 11
In the leave-one-out cross-validation, PascalGWAS 175 , DKMcost 125 and KS5 max 12 sets performed well in differentiating CDs from HCs in the CD-train panel (Fig. 2). 13 Interestingly, the KS5 max and DKMcost 125 sets only overlapped with 14 PascalGWAS 175 genes by one to five and two to four genes (average p-val = 15 0. 34 and 0.25, hypergeometric test[20], background of ALL genes,  3) suggested additional genes/proteins, which were not found by FS but may be 21 relevant to CD; e.g the TAF1 and the HNF4A transcription factors regulate many 22 DKMcost 125 genes, including the infamous NOD2 [22]. HNF4A was annotated as 23 1 further evaluation, but preliminary analysis shows that it contains a bromodomain, 2 which may be critical in inflammation in general and bowel inflammation 3 specifically in CD [44,45]. and late onset of CD. To test this, we divided the CD-train panel into CD-train early 8 and CD-train late according to the onset age of CD individuals (Methods); note that 9 both panels contained the same 47 HC individuals. We performed feature 10 selection on CD-train early and CD-train late panels separately and evaluated the 11 performance of models trained with the selected genes (Additional file 1: Figure  12 15 and 16). The top-ranked DKMcost 275 and 200 genes from CD-train early and 13 CD-train late (Additional file 13), respectively, were used for gene pathway 14 overrepresentation analysis. In addition to some of the same pathways identified 15 for the entire CD-train panel, DKMcost 275 genes from CD-train early were 16 significantly enriched in endosomal/vacuolar pathway, antigen presentation, 17 hedgehog pathway, graft-versus-host disease, type 1 diabetes mellitus, 18 proteasome degradation, etc. (Additional file 14). However, no known molecular 19 pathway was significantly enriched in DKMcost 200 genes from CD-train late . These 20 results suggest that genetically encoded pathways of CD pathogenesis are more 21 obvious in the early-onset patients, while late onset CD may be more 22 environmentally (or microbially) driven -a theory that we will test with more data 1 in the future. 2 Batch effect across panels. While filtering removed the overall variant count 3 discrepancies between HCs and CDs within a panel, differences between panels 4 remained. Regardless of health status, individuals from CD-train had significantly 5 more variants than those from CD-test (Additional file 15 and 16). Furthermore, 6 there were differences in variant burdens by variant type; e.g. CD-train had 7 significantly more synonymous and heterozygous non-synonymous SNVs and 8 less stop-losses and heterozygous stop-gains than CD-test. In total, CD-train and 9 CD-test shared 98,795 SNVs, while, additionally, CD-train had 74,218 (42.9%) 10 and CD-test had 30,198 (23.4%) unique SNVs. This "batch effect" may indicate, 11 among other issues, a difference in coverage of sequence regions of different 12 sequencing platforms [50,51]. 13 To evaluate the severity of the batch effect, we combined the CD-train and CD-14 test individual ALL gene_score profiles into one panel and performed clustering 15 using pam [52]. Individuals clustered precisely according to batch (Additional file 1: 16 Figure 17), indicating that obvious batch differences that would be likely to 17 negatively impact prediction. To proceed with testing our CD prediction models 18 on previously unseen data we had to first remove the batch effect. We first 19 evaluated using the ComBat[53] algorithm on the combined panel of CD-train 20 and CD-test (Methods). After adjustment of the gene_score value by ComBat, we 21 were no longer able to separate batches (35.87% agreement between cluster 22 members and batch labels, Additional file 1: Figure 18). Note that the originally 23 selected DKMcost 125 genes still maintained their ability to differentiate CD-train 1 CDs from HCs in leave-one-out cross-validation after ComBat adjustment 2 (ROC/PR AUC = 0.74/0.80, Additional file 1: Figure 19). 3 Clinical application of AVA,Dx. To apply our method in a real-life situation, 4 where one individual is evaluated for CD from his/her exome sequencing data, 5 we built PascalGWAS 175 and the best-performing DKMcost 125 gene-based 6 prediction models using the entire CD-train panel. Note that DKMcost 125 genes 7 selected here and were not identical to the ones used in cross-validation (>77% 8 overlap). For prediction of each individual from the CD-test panel, we applied 9 ComBat[53] to adjust gene_scores of the entire CD-train panel and the one 10 individual, respectively (Methods). 11 To select the default cutoff in AVA,Dx score for calling an individual healthy or 12 CD-affected we plotted the prediction scores for each individual from CD-train in 13 cross-validation and selected a cutoff that best differentiated CDs from HCs 14 After years of study on the subject and numerous promising findings, CD risk 2 prediction from genetic information still remains a problem. We developed 3 AVA,Dx, a machine learning method that uses individual exome data of a panel 4 of Crohn's disease patients and healthy individuals to select CD-relevant genes 5 and, potentially, predict the health status of previously unseen individuals. We 6 first identified the functional effects [11,12] of exome SNVs and combined them 7 to create gene scores, indicative of gene functional deficiencies. This approach 8 efficiently decreases the dimensionality of data from considering all exome 9 variants (173,013 variants) to focusing only on affected genes (13,957 genes). 10 Additionally, FS techniques highlight the disease-related genes thus further 11 reducing the dimensionality of data. While our method currently only considers 12 coding variants, the path to integrating other CD-relevant types of variants [54], 13 e.g. splice site, regulatory, etc., into gene scoring is also clear. 14 The main idea behind AVA,Dx is that disease-causing variation is likely to be 15 functionally detrimental to affected genes/pathway components. To evaluate 16 whether molecular function disruption is an important indicator of gene 17 involvement in disease we tested a number of scoring schemes for evaluating 18 variant effect. As expected, models built with PascalGWAS genes performed 19 significantly better than random regardless of the scoring scheme, particularly for 20 the larger CD-train panel. These results indicate both that GWAS (with Pascal 21 filtering) indeed captures CD-association successfully and that the captured 22 signal is likely that of association, not causation. On the other hand, gene_score-23 based models performed significantly better than all other schemes when used 1 with FS gene sets, indicating that severity of the variant functional effect 2 contributes to annotating genes involved in causing disease. 3 In earlier work, logistic regression on GWAS data from a CD panel of ~13,000 4 individuals attained a ROC AUC of 0.86 [8]. However, using only ~1,300 people, 5 reduced performance quite significantly (ROC AUC = 0.60). The number of 6 individuals in this type of study clearly contributes heavily to the success of CD 7 risk prediction. On the other hand, AVA,Dx reached its peak performance (ROC 8 AUC=0.74) using no non-coding variants and only 111 people -less than a tenth 9 involved in GWAS. Interestingly, the logistic regression model described in Wei 10 et al. [8] (albeit limited by using exonic variation only) was able to correctly identify 11 46 of 64 patients in our cohort, but misidentified 26 of 47 healthy individuals as 12 affected by Crohn's Disease. AVA,Dx identified just one more patient correctly, 13 but it did so at significantly higher accuracy -mislabeling eight people fewer than 14 logistic regression analysis. 15 Large sample size requirements and use of common SNPs only, limit 16 applicability of GWAS. Causative factors of CD remain unidentified by GWAS-17 based models even as their predictive accuracy improves. For instance, some 18 disease-related genes may not be found simply because they contain variants 19 not covered by the SNP-array or because there were not enough people in the 20 study. As mentioned above, GWAS associations are also often markers of 21 disease, rather than causes. Our FS genes (e.g. KS5 max and DKMcost 125 ), on the 22 other hand, are selected based on the functional changes of the genes, 23 descriptive of separating CD-affected individuals from healthy controls. This 1 approach is not only more precise than GWAS, as illustrated by our model 2 performance (0.74 vs. 0.70 for DKMcost 125 and PascalGWAS 175 models, 3 respectively), but also informative of pathogenicity pathways (Additional file 12). 4 Interestingly, while FS and external gene set-based models both perform well, 5 the sets did not have much overlap, suggesting that FS identifies previously 6 unknown CD-related genes. We also note that for other complex or rare diseases, 7 where GWAS data is not available or informative, AVA,Dx may work to predict 8 health-status and identify pathogenicity pathways based on even a small number 9 of whole exome sequences. 10 To summarize, we developed AVA,Dx, a method that uses exome variant-11 caused gene functional changes to identify disease-related genes and makes to 12 make health status predictions. Our method is able to use less than five percent 13 of the people normally involved in a GWAS study to identify disease-genes and 14 to make fairly accurate CD predictions for previously unseen individuals. Notably, 15 AVA,Dx appears to be fairly robust to differences in panels and in 16 sequencing/filtering methods, making it a potentially clinically useful tool. 17 Furthermore, the CD-genes we identify appear to be relevant to CD, as indicated 18 by the matches of our pathways to known work, and yet significantly different 19 from those highlighted by GWAS. Thus, AVA,Dx presents an orthogonal way for 20 identifying disease-related genes, while avoiding the most severe research 21 limitation -the requirement of a large study panel. Note that while our results 22 indicate that a larger panel would be beneficial for our method's predictive 23 capacity, they also suggest that significant gains can be achieved by simply 1 doubling the number of people in the study to use less than a tenth of what a 2 GWAS study would require. Looking ahead, it is clear that increasing panel size 3 and including the effects of regulatory variants is likely to improve performance. 4 Moreover, we suggest that the AVA,Dx approach to model building is not limited 5 to Crohn's Disease, but is rather applicable to a wide spectrum of genetically 6 linked, potentially rare and complex, diseases. employed to identify true polymorphisms in the samples rather than those due to 20 sequencing, alignment, or data processing artifacts. 21 For each VCF file we ran ANNOVAR[57] to identify all variants mapping to 22 Swiss-Prot[58] proteins. Specifically, we extracted the RefSeq mRNA identifiers 23 from ANNOVAR output and mapped these to Swiss-Prot. Note that if a single 1 variant mapped to more than one protein, all proteins were included into the 2 affected set. 3

Individual relationships.
To confirm individual relationships, we used 4 hierarchical cluster analysis (from R package SNPRelate[59]) on the individual 5 SNP dissimilarity matrix to draw dendograms. The dendogram of CD-train 6 implied family relationships for individuals S076&S111 and S087&S110 7 (Additional file 1: Figure 22), which were not initially recorded in the individual 8 information. In the analyses throughout the study, we considered each pair of 9 them as a family and the individuals from one family were left out as one-fold in 10 the cross-validation analysis (see FS candidate gene set extraction and CD 11 model training). The individual relationships from CD-test were confirmed to 12 include 28 families (Additional file 1: Figure 23). 13 Data filtering. We removed all variant calls on the X-and Y-chromosomes, as 14 well as mitochondrial DNA variants. We then filtered the original VCF files with 15 VQSR and retained only the PASS variants. We computed the number of 16 different types of variants in each panel (Additional file 2 to 5) separately for HC 17 and CD individuals and tested the difference between them using the 18

Kolmogorov-Smirnov test[15]. Since we found significant differences between 19
CDs and HCs within the same data panel (Additional file 2 and 3) and across 20 panels (Additional file 15), we further cleaned the data to remove all variants with 21 missing calls (identified as ./.). Removal of these loci ensured that every 22 individual has a confident call at every locus of the same panel. Gene scoring. We first checked the Swiss-Prot protein sequence for 9 correspondence; i.e. we looked for the variant-defined wild-type residue to exist 10 in the variant sequence position. If the position contained the mutant amino acid 11 instead, we assumed allele disagreement between reference databases RefSeq 12 and Swiss-Prot. For these variants, we chose the RefSeq sequence to be correct 13 and replaced the amino acid in the Swiss-Prot sequence to correspond to 14 RefSeq. We then computed the raw SNAP[11] score for each variant, ranged 15 from -100 to 100, where any score less than or equal to zero is classified as 16 neutral, i.e. no protein function change, and non-neutral otherwise. 17 An individual variant score was assigned as follows, for: 18 (1) non-synonymous variants 19 a. SNAP score >=0 (effect): v_score=0.06+(SNAP score/100)*0.94 20 b. SNAP <0 (neutral): v_score = 0.055 21 (2) synonymous variant, v_score = 0.05 22 (3) InDel variants, v_score = 1 23 (4) erroneously mapped variants and variants in 11 genes that could not be 1 handled by SNAP (genes > 6000 amino acids), v_score = 0.055. 2 Individual v_scores of heterozygous variants were multiplied by 0.25 (in Eqn. 1 3 het=0.25 for heterozygous and het=1 for homozygous variants) to approximate 4 the effects of heterozygosity. 5 For every gene in every individual we computed a gene functional deficit score 6 (gene_score) as a sum over all gene-specific v_scores (Eqn. 1). Note that gene 7 scores computed in this fashion are zero only for genes that have no variants at 8 all. However, further comparison between gene scores for different genes is not 9 possible, as the score is highly dependent on gene length and overall tolerance 10 for variability, e.g. longer genes with more variable regions will tend to score 11 higher while remaining relatively functional biochemically. 12 Thus, for each gene, g, the overall variant burden score of all N g variants was: 13

Eqn. (1) 14
We also calculated other versions of the gene_score: 15 (1) an easy_gene_score, as in Eqn. 1, but non-synonymous v_score=1 16 (2) a binary_gene_score, as in Eqn. 1, but SNAP-neutral non-synonymous 17 v_score=0.055 and effect non-synonymous v_score=1 18 (3) EV_gene_score, in which gene_score = # of exonic variants in the gene 19 (4) NSV_gene_score, in which gene_score = # of nonsynonymous variants in 20 the gene 21 In our representation, thus, every individual exome can thus be viewed as a 22 vector of individual gene scores with an associated binary disease class (status: i.e. genes that are not affected by any variants in a particular individual are 2 assigned a zero score. Genes with no variants in any individual in a panel were 3 removed from consideration. Additionally, we also removed genes that have only 4 one non-zero score (one individual with a variant) and genes that have consistent 5 non-zero scores across a panel (all have the same variant). 6 Reference candidate gene set extraction. The ALL set represented all genes 7 where at least one individual in CD-train had at least one variant. 8 For each human protein entry in Swiss-Prot (20,204 entries, September 13 th , 9 2015) set we extracted all gene and protein names, mapped to the protein entry 10 identifier. All single word names longer than two characters were included into 11 the search. The Swiss-Prot identifier, minus the "_HUMAN" suffix was also 12 included as a name. Names were normalized as follows: 13 1. Multi-word names were converted into bags-of-words. The abstracts were 14 searched for a combination of all words. 15 2. In a multi-word name, the word separators were any combination of space 16 " ", parenthesis "(", ")", colon ":", or semicolon ";". 17 3. All single words were normalized by replacing any dash "-", forward slash 18 "/", star "*", comma ",", carrot "^", and period "." with a star "*" in the middle 19 of a word and by removing them at the ends. 20 All PubMed abstracts were normalized in the same fashion as names. We 21 searched these abstracts for the occurrence of names in CD publications ("crohn 22 disease" MeSH term, MeSH set, 2,471 genes, computed on October 27 th , 2015, 1 and 1824 genes from MeSH set were in CD-train, Additional file 19). 2 In addition to mining the literature, we also used 3 (1) a set of all 1,286 validated protein-coding genes in the 163 known CD-4 associated regions[3] (GWAS set, 925 genes from GWAS set were in CD-train, 5 Additional file 7) 6 (2) a Pascal[9] ranked list of CD-related genes, 393 genes with a Benjamini & 7 Hochberg[18] corrected p-val < 0.05 (PascalGWAS set, Additional file 8, 318 8 genes from PascalGWAS set were in CD-train) 9 (3) a set of all proteins annotated as Crohn's disease related (Disease feature) in 10 Swiss-Prot (SP set, 22 genes, February 29 th , 2016, and 18 genes from SP set 11 were in CD-train, Additional file 20) 12 MeSH, GWAS, PascalGWAS and SP gene sets were considered external sets 13 throughout the study. In text, a subscript number following the set name indicated 14 the gene number of top ranked genes from this set used to build models. For 15 example, PascalGWAS 100 indicated building a model using top ranked 100 genes 16 from PascalGWAS set. A subscript letter r before the gene number meant a 17 random selection of genes from this set instead of top ranked genes (e.g. 18 GWAS r100 indicated 100 genes randomly selected from GWAS set were used for 19 model building). 20

Features selection (FS) candidate gene set extraction. We recorded genes 21
affected by at least one variant (ALL set) separately for each CD-train, CD-22 train early, and CD-train late panels. We performed the following gene set selections: 23 (1) Collected genes where at least 3 CDs and no HCs had non-zero 1 gene_scores (Disease set, abbreviated as DIS set). 2 (2) Compared the distribution of five different versions, respectively, of 3 gene_scores for CDs vs. HCs using the t-test (TT5 set) and ks-test (KS5 set) 4 and took the genes that were differently (p < 0.05, no correction for multiple 5 testing) distributed in CDs and HCs. 6 (3) Applied DKMcost[60] feature selection (from R CORElearn package[61]) and 7 ranked genes by their merit. 8 In order to avoid overfitting, we applied the above FS techniques (DIS, KS5, TT5 9 and DKMcost) in a leave one person out fashion, iteratively in each fold of cross-10 validation. Thus, we had built multiple models of CD-train-based AVA,Dx with 11 different gene sets each using the same FS technique, so that no model was 12 trained and tested on the same sample. As a "sanity check", we collected genes 13 as described in method (1) above from the entire CD-train (overfitting), and 14 trained DISO gene models in a leave-one-out fashion. Subscript numbers, e.g. 15 KS5 r100 or DKMcost 125 , meant the random ( r ) or top ranked, respectively, number 16 of genes used in building the model as described in Reference candidate gene 17 set extraction. When all genes from the FS set were used to build a model, the 18 gene name was followed by a subscript max (e.g. KS5 max ). 19 Computing gene set overlap. As described above, the number of FS candidate 20 gene sets for one panel and one extraction technique was equal to the number of 21 unrelated individuals in that panel, e.g. there were 109 different KS5 sets in a 22 109-fold cross-validation on CD-train data. For calculation of overlap between 23 any KS5 set and a gene set with fixed genes, e.g. MeSH set, we computed the 1 overlap and the significance (hypergeometric distribution test against a 2 background of the corresponding variant-affected genes) for all 109 KS5 sets 3 and recorded the mean. For calculation of overlap between two non-fixed gene 4 sets, e.g. between KS5 and DKMcost sets, we computed the overlap and 5 significance when the same test individual was held-out, and recorded the mean. 6 Finding gene networks. We used the ConsensusPath database[27] to identify 7 the enrichment in alterations of the known molecular pathways in the selected 8 CD-train genes. ALL set of CD-train was used as the background list. KS5 max and 9 DKMcost 125 selected from entire CD-train panel, as well as PascalGWAS 175 10 genes, were used as input for the pathway enrichment analysis. All pathways 11 with a q-val < 0.1 were recorded in Additional file 12. Induced network analysis 12 from ConsensusPath database using FS genes as starting points was used to 13 detect additional potentially CD-associated genes. 14

CD models. 15
Training cross-validation. We built CD models using leave one person out cross-16 validation on the CD-train panel. Note that individuals of the pair S087 & S110 17 and the pair S076 & S111 are more genetically similar than others (and 18 potentially related, Additional file 1: Figure 22 (resampling with replacement) to create new training samples in a balanced 1 manner. All models used the Support Vector Machine (SVM) algorithm in R's 2 e1071 package [62]. Note that changing the learning method, i.e. replacing 3 support vector machines with Naïve Bayes, neural networks, etc. or adjusting 4 method parameters, could potentially produce better results. However, as the 5 goal of this experiment was to evaluate the CD-relevance of the selected gene 6 sets, we did not optimize algorithm performance. For evaluating the performance 7 of the different gene selection methods, we: 8 (1) randomly sampled with replacement different numbers (10, 25-200, in steps 9 of 25, 500, or 1000) of genes, 100 times from each cross-validation fold gene-set. 10 The gene number was recorded as a subscript following the gene set name as 11 described in Reference candidate gene set extraction and FS candidate gene set 12 extraction sections. For example, for 100-gene KS5 set (KS5r 100 ) in CD-train this 13 meant that we trained 10,900 models -100 random gene sets for each cross-14 validation fold. Note that when the gene set did not have enough genes for 15 sampling, we used the entire set to build models -one model per fold (max 16 subscript following the gene set name; e.g. KS5 max ). For example, if the KS5 set 17 had 113 genes, models requiring more than that used the whole KS5 (KS5 max ) 18 set in every model training iteration. Note that for all models built using a fixed set 19 of genes, the only source of difference in model performance is the differential 20 resampling of the training individuals of the minor class. 21 (2) we took the top ranked 10 or 25 to 200 (in steps of 25) genes and performed 22 cross-validation with the same top ranked genes for each fold of cross-validation: 23 e.g. DKMcost 50 means we trained 10,900 models -top ranked 50 DKMcost 1 genes for each testing fold. Here as above, the only source of the difference in 2 model performance is the differential resampling of the training individuals of the 3 minor class. 4 For each gene set we computed the various model performance metrics, individuals with CD misidentified as being healthy). 10 Class labels for CD and HC were set at 100 and -100; i.e. more negative scores 14 indicate likely healthy individuals and more positive scores indicate likely CD 15 patients. We obtained ROC and PR curves for by varying the threshold for 16 classifying an individual as CD-affected or healthy from -100 (most healthy) to 17

(most CD). 18
To further test if the performance was achieved by chance, we did permutation 19 test by shuffling the labels of every training fold in the cross-validation 1,000 20 times. Null distributions of ROC and PR AUCs were based on these 1,000 results. 21 The permutation p-values for the "real" cross-validation AUCs were obtained 22 empirically by counting the number of larger permuted AUCs (divided by 1000).
Logistic regression from GWAS data. We obtained the logistic regression from a 1 previous study [8] via personal contact. CD-train has 31 loci among the 573-loci 2 regression model, and we made prediction with the model using the 31 loci 3 (listed in Additional file 21). We obtained predictions of precision=64% and 4 recall=72% with this model. ComBat was developed originally for gene expression data. It is an empirical 8 Bayes framework for adjusting data for batch effect [53]. Here we apply the 9 parametric ComBat to the gene_score, which represents the gene functional 10 change instead of expression change. 11 We first applied ComBat to the combined panel of CD-train and CD-test. We  When predictions of all individuals were made, we evaluated the performance by 20

ROC/PR AUCs and the MCC (Matthews correlation coefficient[66], Eqn. 3) 21
based on the chosen cutoff. The significance of the prediction result was also 22 evaluated by a 1,000-time permutation test as described in cross-validation, 23 except that we shuffled the labels for the entire CD-train panel this time and null 1 distributions of ROC and PR AUCs were drawn from the prediction results on 2   p  o  p  u  l  a  t  i  o  n  -b  a  s  e  d  l  i  n  k  a  g  e  a  n  a  l  y  s  e  s  .   T  h  e  A  m  e  r  i  c  a  n  J  o  u  r  n  a  l  o  f  H  u  m  a  n  G  e  n  e  t  i  c  s   1   2  0  0  7  ,   8  1  :   5  5  9  -5  7  5  .  2  1  8  .  B  e  n  j  a  m  i  n  i  Y  ,  H  o  c  h  b  e  r  g  Y  :   C  o  n  t  r  o  l  l  i  n  g  t  h  e  f  a  l  s  e  d  i  s  c  o  v  e  r  y  r  a  t  e  :  a  p  r  a  c  t  i  c  a  l  a  n  d   3   p  o  w  e  r  f  u  l  a  p  p  r  o  a  c  h  t  o  m  u  l  t  i  p  l  e  t  e  s  t  i  n  g  .   J  o  u  r  n  a  l  o  f  t  h  e  r  o  y  a  l  s  t  a  t  i  s  t  i  c  a  l  s  o  c  i  e  t  y  S  e  r  i  e