Identifying Crohn’s disease signal from variome analysis

Background After years of concentrated research efforts, the exact cause of Crohn’s disease (CD) remains unknown. Its accurate diagnosis, however, helps in management and preventing the onset of disease. Genome-wide association studies have identified 241 CD loci, but these carry small log odds ratios and are thus diagnostically uninformative. Methods Here, we describe a machine learning method—AVA,Dx (Analysis of Variation for Association with Disease)—that uses exonic variants from whole exome or genome sequencing data to extract CD signal and predict CD status. Using the person-specific coding variation in genes from a panel of only 111 individuals, we built disease-prediction models informative of previously undiscovered disease genes. By additionally accounting for batch effects, we were able to accurately predict CD status for thousands of previously unseen individuals from other panels. Results AVA,Dx highlighted known CD genes including NOD2 and new potential CD genes. AVA,Dx identified 16% (at strict cutoff) of CD patients at 99% precision and 58% of the patients (at default cutoff) with 82% precision in over 3000 individuals from separately sequenced panels. Conclusions Larger training panels and additional features, including other types of genetic variants and environmental factors, e.g., human-associated microbiota, may improve model performance. However, the results presented here already position AVA,Dx as both an effective method for revealing pathogenesis pathways and as a CD risk analysis tool, which can improve clinical diagnostic time and accuracy. Links to the AVA,Dx Docker image and the BitBucket source code are at https://bromberglab.org/project/avadx/. Electronic supplementary material The online version of this article (10.1186/s13073-019-0670-6) contains supplementary material, which is available to authorized users.


Section 2 -Ethnicity annotation
We annotated the ethnicity of all individuals in our study by mapping our panels to individuals from the 1000 Genomes Project (3). Specifically, we combined all individuals from our panels with the 1000 Genomes Project panel by their shared variants (98,010 SNPs). R package SNPRelate (4)

Section 3 -Individual relationships
To confirm individual relationships, we used kinship coefficient analysis (5)
In further testing, feature selection (FS) and leave-one-out cross-validation were done exactly as described in Methods using both gene_score and NSV_gene_score-ing schemes. Here, NSV_gene_score with DKMcost or KS5 FS performed much worse than gene_score-based models, suggesting all further use of gene_score-ing.

Section 5 -Known CD genes are poorly informative of individual health status
We performed leave-one-out cross-validation training with several external gene sets: (1) literature-annotated CD genes (MeSH set); (2) genes mapping to known GWAS- To extract the MeSH set, for each human protein entry in Swiss-Prot (20,204 entries, September 13 th , 2015) set we collected all gene and protein names, mapped to the protein entry identifier. All single word names longer than two characters were included into the search. The Swiss-Prot identifier, minus the "_HUMAN" suffix was also included as a name. Names were normalized as follows: 1. Multi-word names were converted into bags-of-words. The abstracts were searched for a combination of all words.
3. All single words were normalized by replacing any dash "-", forward slash "/", star "*", comma ",", carrot "^", and period "." with a star "*" in the middle of a word and by removing them at the ends.
All PubMed abstracts were normalized in the same fashion as names. We searched these abstracts for the occurrence of names in CD publications ("crohn disease" MeSH term, MeSH set, 2,471 genes, computed on October 27 th , 2015; 1,824 genes from the MeSH set were represented in the CD-train individuals).
To create the GWAS set, we extracted a set of all 1,286 validated protein-coding genes in the 163 known CD-associated regions (GWAS set, 925 genes from GWAS set were in CD-train) (8).
Since there is no ranking of the importance of genes in any of these set, we used different numbers of randomly selected genes from each set for the cross-validation, running the selection 1,000 times for each gene number/set combination. In text and figures, a subscript letter r before the gene number means a random selection of genes from this set instead of top ranked genes (e.g. GWASr100 indicates 100 genes randomly selected from the GWAS set).
As expected, models that were built on the ALL set genes were random in performance with any of the gene scoring scheme (Supplementary Figure 2D).
Increasing the number of genes per subset used for model building did not improve performance. Interestingly, GWAS and MeSH sets had similar results as the ALL set (Supplementary Figure 2A and 2B). The best performance was achieved by using all 36 VEO genes gene_score (ROC AUC = 0.578 and area under precision-recall curve, PR AUC = 0.624). While disappointing, the inability of these models to differentiate healthy individuals from disease affected ones, suggests that there is no sequencing or scoring artifact that could differentiate CDs and HCs in the CD-train panel.

Section 6 -Performance of the GWAS-based models
We evaluated the predictions of (1) the previously reported logistic regression model based on 573 CD loci (9) and (2)   Similarly as the logistic regression method, we simply removed the un-covered loci from the PRS equation for risk score calculation. The PRS method also performed better than random for all panels (Supplementary Table 6). But AVA,Dx still achieved better performance in all panels, especially in the large WTCCC-GTEx panel.

Section 7 -Removing batch effect across panels
Different sequencing platforms and variant call algorithms impose batch effects on genetic data. Generally, a prediction model is built using one batch of data, here CDtrain in our study. New sequencing data often does not cover exactly the same loci as the training data (Supplementary Table 2 and 3). Thus, for a generic risk model that uses specific variants, the only solution is to impute the missing loci, which may introduce spurious noise, and to ignore the new loci, which may decrease the model predictive power. Moreover, different types of sequencing/sequence-analysis may introduce variant determination bias even for the shared loci.
By default, AVA,Dx uses the variants of the CD-train panel that cover, in total, 83,879 exonic loci. Our evaluation sets include both whole exome sequencing and whole genome sequencing data from different platforms, each covering >50% of the CD-train loci. In computing our gene scores, we inherently assume that the missing variants may be homozygous reference in additional sets -an assumption that may be false, but not expected to be crucial for our model. In this iteration of AVA,Dx we also ignored the new (not present in CD-train) variants. However, our testing (data not shown) suggests that normalizing gene scores to account for the number of the variants present may allow us to include these in the future. For all other types of differences, we were able to remove batch effects using a statistical approach used in batch effect removal in expression data -ComBat (10). Our results suggest that this method helps generate sufficiently accurate predictions. However, for more accurate predictions going forward, new individual genomes should be sequenced using the exact same protocols as the training set.