iCAGES: integrated CAncer GEnome Score for comprehensively prioritizing cancer driver genes in personal genomes

All cancers arise as a result of the acquisition of somatic mutations that drive the disease progression. A number of computational tools have been developed to identify driver genes for a specific cancer from a group of cancer samples. However, it remains a challenge to identify driver mutations/genes for an individual patient and design drug therapies. We developed iCAGES, a novel statistical framework to rapidly analyze patient-specific cancer genomic data, prioritize personalized cancer driver events and predict personalized therapies. iCAGES includes three consecutive layers: the first layer integrates contributions from coding, non-coding and structural variations to infer driver variants. For coding mutations, we developed a radial support vector machine using manually curated mutations to predict their driver potential. The second layer identifies driver genes, by using information from the first layer and integrating prior biological knowledge on gene-gene and gene-phenotype networks. The third layer prioritizes personalized drug treatment, by classifying potential driver genes into different categories and querying drug-gene databases. Compared to currently available tools, iCAGES achieves better performance by correctly classifying point coding driver mutations (AUC=0.97, 95% CI: 0.97-0.97, significantly better than the second best tool with P=0.01) and genes (AUC=0.93, 95% CI: 0.93-0.94, significantly better than MutSigCV with P<1×10−15). We also illustrated two examples where iCAGES correctly nominated two targeted drugs for two advanced cancer patients with exceptional response, based on their somatic mutation profiles. iCAGES leverages personal genomic information and prior biological knowledge, effectively identifies cancer driver genes and predicts treatment strategies. iCAGES is available at http://icages.usc.edu.


Introduction
Cancer carries somatic mutations acquired during the lifetime of an individual (Stratton et al. 2009). While majority of those are "passengers", which are mutated randomly and functionally neutral, a small proportion are "drivers", which are causally implicated in oncogenesis (Greenman et al. 2007). When it comes to a patient, the challenge for his/her molecular diagnostic and treatment lies in rapid and accurate identification of these driver mutations harbored in his/her tumor cells from a large number of background noises of passenger mutations (Zender et al. 2006), and to devise appropriate drug treatment based on the specific driver mutations and genes implicated in this patient.
Next-generation sequencing technology enabled researchers to rapidly identify somatic mutations from a patient by comparing the sequence from his/her tumors with that from blood or non-cancerous tissues (Meyerson et al. 2010). Accordingly, several computational tools were developed to help identify these cancer drivers, using readily available personal cancer genomic information generated from next-generation sequencing (Kaminker et al. 2007;Carter et al. 2009;Carter et al. 2010). Such tools can be classified into three categories: the first category focuses on genomic mutations, the second category focuses on transcriptomic information and the third focuses on post-transcriptomic information.
The first category, tools focusing on genomic mutations, can be further classified into whereas FunSeq2 (Fu et al. 2014) prioritizes non-coding mutations. While these tools paved the way for cancer driver prioritization, there is still significant room for improving prediction accuracy. For example, Gnad et. al. found that many current methods or combination of methods for mutation prioritization fail to exceed 81% accuracy of detecting real cancer driver mutations (Gnad et al. 2013). Personal genomic mutations prioritization tools, on the other hand, are much underdeveloped. To our knowledge, Phen-Gen is the only tool for prioritizing personal disease driver genes using only mutations identified from next-generation sequencing (Javed et al. 2014). However, it is a general disease gene prioritization tool, without focusing on cancer, a disease that is generally very different from most other genetically inherited diseases (Table 1).
Besides computational tools that use genomic mutations as input, there are other tools that use transcriptomic or post-transcriptomic information as input. Some tools, such as PARADIGM- SHIFT (Ng et al. 2012), DawnRank (Hou and Ma 2014) and ActiveDriver (Reimand and Bader 2013) provide personal cancer driver gene prediction; however, they require gene expression, phosphorylation or copy number variation data from patients, all of which is not often feasible to obtain due to cost and other practical issues. Moreover, they require complicated data preprocessing and data transformation, a challenge for average biologists and clinicians (Table 1).
Thus, there is a strong need for a robust and user-friendly tool to systematically predict personal cancer drivers, which drove the development of iCAGES (integrated CAncer GEnome Score). iCAGES takes somatic mutation profile as input, and rapidly prioritizes cancer driver mutations, genes and targeted drugs, tailored for individual patients with cancer. iCAGES consists of three consecutive layers. The first layer prioritizes personalized cancer driver mutations, including coding mutations, non-coding mutations and structural variations. We devised the second layer of iCAGES to link these mutation features to genes, using a statistical model and prior biological knowledge on cancer driver genes. We devised the third layer of iCAGES to better serve clinicians and researchers interested in personalized cancer therapy, by generating a prioritized list of drugs targeting the repertoire of these potential driver genes.
iCAGES can help increase the accuracy of cancer driver detection and prioritization, bridge the gap between personal cancer genomic data mining and prior cancer research knowledge, and facilitate cancer diagnostics and personalized cancer treatments.

Overview
A general overview of iCAGES is given in Figure 1. To prioritize driver mutations, iCAGES in its first layer takes somatic mutations from next-generation sequencing (NGS) as input and outputs three kinds of driver potential scores for three kinds of mutations. The first layer annotates each mutation with radial Support Vector Machine (SVM) score, normalized Copy Number Variation (CNV) signal score and FunSeq2 score. To prioritize driver genes, the second layer of iCAGES takes two major sources of input. The first source measures the genomic potential of a gene being a personal cancer driver gene, including the three kinds of mutation driver scores from the first layer. The second source measures the prior knowledge of each gene being a cancer driver gene judged on previous biological research, through Phenolyzer score. Given these two sources of input, iCAGES then models the patterns of putative cancer drivers observed in TCGA data with Logistic Regression (LR) model, and outputs a prioritized list of genes ranked by their cancer driving probabilities, or iCAGES gene scores. To predict potential personalized treatment, iCAGES in its third layer takes the prioritized list of all mutated genes and their iCAGES gene scores as input, searches for drugs targeting these genes and their neighbors and prioritizes them according to their pharmacodynamical activities, relatedness of their targets to the mutated genes and cancer driving probability of these genes, namely, iCAGES gene scores. It outputs a prioritized list of drugs ranked by their probabilities of being the best drugs for the particular patient, or iCAGES drug scores. iCAGES achieved excellent performance in correctly classifying putative cancer driver mutations, genes and drugs, compared to current tools for cancer driver prioritization.
Finally, we implemented iCAGES as both a command-line tool and a web server, and the latter facilitate users without informatics skills to perform personalized cancer mutation analysis, unlike most other tools for cancer genomics (Table 1). In the sections below, we describe the features and performance for each of the three layers in iCAGES, and demonstrate the performance of iCAGES using two real-world examples. Therefore, we believed that radial SVM score proved itself to be better than existing tools and other machine learning scoring methods and was a justifiable choice as the first layer of iCAGES ( Figure 3).

Layer 2: Gene prioritization Description
The second layer of iCAGES is gene prioritization, which relies on annotation results from the first layer. Harboring variants with different functional effects, each gene can have its cancer driver potential contributed differently by distinct kinds of mutations, such as point coding mutations, point non-coding mutations and structural variations. To model such different contributions of different kinds of mutations, we applied the LR model trained on 6,971 mutated genes from 963 breast cancers from TCGA data, using four feature scores described as follows.
To characterize each gene, we annotated each of its mutations with three genomic feature scores: radial SVM score, CNV normalized signal score or FunSeq2 score. Considering one gene may harbor more than one mutation of each category, for each gene we took the maximum of each feature score among all mutations and generated its three genomic feature scores. On the other hand, existing knowledge generated from decades of cancer genetic and genomic research can help us improve gene prioritization by taking valuable past experiences into account. To quantify such past experiences on cancer and its associated genes, we applied the database-mining tool, Phenolyzer, which outputs a list of candidate genes weighted by the chance of being associated with cancer; this is the Phenolyzer score, used as the fourth feature score of each gene. The output of this layer is the LR predicted probability for each gene, name the iCAGES gene score, which measures the cancer-driving probability for this gene.

Performance evaluation
To our knowledge, there is no large-scale benchmark data for personal cancer driver analysis, yet batch analysis data from TCGA data and personalized analysis data of two recently published targeted therapy patient cases are available and could be used to estimate the performance of iCAGES. As for batch analysis, we applied TCGA data with 3,281 tumors across 12 tumor types and compared the performance of iCAGES gene score with a popular batch cancer driver gene analysis tool, MutSigCV, using significantly mutated genes (SMGs) generated from MuSiC in the original publication as the gold standard (Kandoth et al. 2013 PubChem database, averaged them over each drug and normalized them to 0-1 as the active probability for this drug. The final step calculates the joint probability of a drug being an effective drug for this particular patient by multiplying iCAGES gene score for its target, its relatedness probability and its active probability, generating iCAGES drug score for each drug.

Performance evaluation
To our knowledge, little data on personalized cancer treatment using whole genome sequencing (WGS)/whole exome sequencing (WES) is available, except for two aforementioned case studies of extraordinary drug response. Therefore, we applied data of these two patients to our iCAGES drug scoring system and replicated both original findings. The first patient with lung adenocarcinoma demonstrated extraordinary response with SORAFENIB, which directly targets ARAF gene mutated in her tumor. Through iCAGES drug prioritization pipeline, we replicated the original findings, by nominating SORAFENIB as the top drug candidate out of 122 drugs (top 0.8%) (Supplementary Table 4). As for the second patient with solid tumor, we also replicated their findings by nominating EVEROLIMUS, which directly targets MTOR gene, as the 3 rd best candidate out of 199 drugs (top 1.5%) (Supplementary Table 7). Therefore, we believe that iCAGES drug score demonstrated itself to be an effective tool in prioritizing personalized therapies in real-world cases.

Discussion
In the current study, we established a statistical pipeline, iCAGES, to rapidly analyze patient-specific cancer genomic data, prioritize personalized cancer driver events and predict personalized therapies. To the best of our knowledge, iCAGES is the first tool that sequentially prioritizes cancer driver mutations, genes and targeted drugs in three consecutive layers.
Compared to currently available tools, iCAGES achieves better performance by correctly predicting cancer driver mutations, genes and targeted drugs in analysis on both populationbased cohorts and on personal genomes. Below we specifically discuss several unique aspects that iCAGES possesses.
To our knowledge, iCAGES is the first (or one of the first) personalized cancer driver prioritization tool(s) based on genomic mutations, so it fills a practical role that other tools cannot address. While similar tools, such as MutSigCV (Lawrence et al. 2013) and MuSiC (Dees et al. 2012), do require only genomic mutations as input, they do not allow single patient analysis due to the nature of their algorithms. Indeed, they define cancer drivers as genes, whose mutation rate is significantly higher in tumors than in normal tissues among a group of patients. Since it is not feasible to calculate mutation rate given only one patient, we cannot apply these tools for personalized cancer driver prioritization, which makes iCAGES a complementary option for studying driver events at patient-level resolution.
To facilitate researchers with only genomic mutation data in hand, we designed iCAGES to have the least requirement on input information. While other personalized cancer driver prioritization tools, such as DawnRank (Hou and Ma 2014), often require patient's cancer signature data, including genomic mutation, tumor gene expression data and normal gene expression data, iCAGES only requires patient's genomic mutation data (in ANNOVAR input format or in VCF format, the format of natural output of many variant detection pipelines) and handles all data preprocessing steps for users.
To our knowledge, iCAGES is the only personalized cancer driver prioritization tool that considers point coding mutations, point non-coding mutations and structural variations. While Phen-Gen (Javed et al. 2014) also allows for point coding mutation prioritization, it ignores noncoding mutations and structural variations, which are known to contribute to cancer genesis and are known to have a significantly different mutation pattern compared to coding mutations.
To better serve clinical researchers interested in personalized cancer therapy, we designed iCAGES to be the first tool for prioritizing personal cancer therapies. DGIdb (Griffith et al. 2013) is similar to this functionality of iCAGES, because it also generates a list of drugs interacting with a list of genes; indeed, the iCAGES drug prioritization module utilizes DGIdb for querying drugs interacting with genes of our interest. However, iCAGES drug prioritization layer has several advantages over using DGIdb alone: (1) to enhance the specificity of drug queries, iCAGES first classifies genes into cancer suppressor genes, oncogenes and other genes and for each category, it queries DGIdb for drugs interacting with these genes based on their cancer evolutionary properties. Therefore, compared to the original DGIdb drug list, iCAGES drug list is much smaller in size and contains much less false positive noises; (2) unlike DGIdb, iCAGES prioritizes drugs, using the driver potential information of a given target gene (iCAGES gene score), relatedness probability from BioSystems and drug active probability from PubChem.
Such personalized prioritization process not only demonstrated itself to be effective in two realworld cases, but also useful for researchers and clinicians, who want to make the most use of time and resources. Therefore, compared to DGIdb, iCAGES may be a more practical tool for personalized drug prioritization.
To utilize valuable prior biological knowledge generated from numerous research studies on cancer, we integrated one of the largest biological knowledge databases on gene-gene interaction and gene-phenotype interaction networks in the iCAGES pipeline. In the second layer of iCAGES, to score genes based on their prior association with cancer, we applied Phenolyzer, a database-mining tool, which integrates fifteen different biological knowledge databases. Such large-scale integration of biological knowledge enhances the accuracy of iCAGES, as the prioritization process is not only based on personal genomic context, but also is guided by experts' knowledge from decades of research.
To facilitate cancer genetics and genomics research for the average biologist and clinicians, iCAGES web server has by far one of the most user-friendly interfaces compared to other cancer driver prioritization tools. For example, it includes well-documented introduction and video tutorials so that a general user can easily learn to employ this package to his/her daily research. Moreover, to enhance user experience, the resulting page features graphics rendering.   Figure 5). This result indicates that potential contamination and random noises exist in the new dataset. Indeed, the major difference between these two versions of data lies in the inclusion of 1,385,270 mutations, mostly collected from large-scale sequencing project, such as TCGA. These data made version 68 ~5.7 times larger than version 57. We caution that together with the amount of information from large-scale data comes extra noises, which demands a strict filtering process to ensure high quality of the data.
In summary, we demonstrated the superior performance of iCAGES in both batch analysis and personal analysis and we hope that this tool can complement current cancer driver detection tools, pave the way for development of such comprehensive statistical framework and shed light into cancer driver gene discovery and new avenues for personalized cancer therapy.

Training data composition
Two kinds of training datasets were collected for training radial SVM and iCAGES gene scores, respectively. As for training radial SVM score, we retrieved benchmark dataset from  which is used as the fourth feature score of the gene, we downloaded Phenolyzer cancer scores from its webserver (http://phenolyzer.usc.edu) and used them to annotate all mutated genes.

Testing data composition
Three kinds of testing datasets were collected for testing the performance of iCAGES.
The first testing dataset was curated for testing the performance of radial SVM score and the later two were curated for testing the performance of iCAGES gene score on batch and  Correlation Coefficients of all continuous and binary variables were calculated to examine potential collinearity between predictor variables and to roughly assess the predictive power of each predictor. On the observation of strong collinearity between HumDiv and HumVar trained PolyPhen-2 predictors, we chose PolyPhen-2 model trained by HumDiv data, as recommended by the developers.

Radial SVM modeling
In order to test the hypothesis that non-linear combination of predictors can better model the patterns of cancer driver mutations, two linear machine-learning algorithms, including LR, linear SVM and a non-linear algorithm, radial SVM were evaluated, using R package "e1071" (David Meyer 2014-02-13). After radial SVM was selected to model the patterns of cancer driver mutations, its parameters were further tuned to enhance its performance. In radial SVM,

Searching for iCAGES drugs
Mining for targeted therapies can be enhanced if the functions of their targets are known.
As many genes are known to play a role in oncogenesis either as "cancer suppressor gene" or "oncogene", we incorporated such prior knowledge and functionally annotated each gene to be "cancer suppressor genes", "oncogenes" or "other genes" (Vogelstein et al. 2013;Zhao et al. 2013). It is known that cancer is a Darwinian process played out in somatic tissues, so in search for effective drugs for patient, we focused on drugs that can potentially disrupt the evolutionary advantage, caused by mutated genes in cancer tissue. For example, if a cancer patient harbors a mutated MTOR, which is an oncogene, then we should search for drugs that inhibit its function; this is because to gain evolutionary advantage, MTOR tends to harbor activating mutations. On the other hand, for mutated cancer suppressor genes, we should search for drugs that activate the function of this gene, because to achieve evolutionary advantage, these cancer suppressor genes tend to harbor loss-of-function mutations. Therefore, for each candidate cancer driver genes predicted in the second layer, iCAGES queries drug-gene interaction database DGIdb for expertly curated drugs, which activate "cancer suppressor genes", inhibit "oncogenes" and interact with "other genes". Given a list of potential cancer driver genes, each with an iCAGES score, we searched for targeted drugs as follows. First, to search for their neighboring genes, we queried the BioSystems database for each gene and its outputted top 4 most related neighbors, judged by their normalized BioSystems relatedness probability. Second, we classified each gene to be cancer suppressor gene, oncogene or other kind of genes, by querying TSGene database and UniProt oncogenes. Third, we use the DGIdb database to query targeted drugs activating cancer suppressor genes, suppressing oncogenes and interacting with other genes, respectively, with different parameter settings. More specifically, for cancer suppressor genes, we query for expert curated drugs served as activator, inducer or stimulator for these genes, through DGIdb database. For oncogenes, we query for expert curated drugs served as inhibitor, suppressor, antibody, antagonist and blocker. For genes, which are neither cancer suppressor genes nor oncogenes, we query for expert curated drugs with any kind of interaction with the target. Finally, for each drug, given its BioSystems relatedness probability of its direct target with the original gene mutated, PubChem active probability for the drug and iCAGES gene score for the original mutated gene, we can calculate the joint probability of a drug being best candidate therapy for the patient by multiplying these three probabilities together, generating iCAGES drug score.

Software package
Statistical analysis and LR modeling was conducted using R (version 3.0.1). ROC curve was drawn using R package "ROCR". 95% CI was calculated using "pROC". SVM modeling was conducted using "e1071". iCAGES package was written in Perl and user interface was written in Ruby on rails, Javascript and HTML5.   G  e  n  o  m  e  s  P  r  o  j  e  c  t  C  o  n  s  o  r  t  i  u  m  ,  A  b  e  c  a  s  i  s  G  R  ,  A  u  t  o  n  A  ,  B  r  o  o  k  s  L  D  ,  D  e  P  r  i  s  t  o  M  A  ,  D  u  r  b  i  n  R  M  ,  H  a  n  d  s  a  k  e  r  R  E  ,  K  a  n  g  H  M  ,  M  a  r  t  h  G  T  ,  M  c  V  e  a  n  G  A  .  2  0  1  2  .  A  n  i  n  t  e  g  r  a  t  e  d  m  a  p  o  f  g  e  n  e  t  i  c  v  a  r  i  a  t  i  o  n  f  r  o  m  1  ,  0  9  2  h  u  m  a  n  g  e  n  o  m  e  s  .   N  a  t  u  r  e   4  9  1   (  7  4  2  2  )  :  5  6  -6  5  .  G  n  a  d  F  ,  B  a  u  c  o  m  A  ,  M  u  k  h  y  a  l  a  K  ,  M  a  n  n  i  n  g  G  ,  Z  h  a  n  g  Z  .  2  0  1  3  .  A  s  s  e  s  s  m  e  n  t  o  f  c  o  m  p  u  t  a  t  i  o  n  a  l  m  e (  D  a  t  a  b  a  s  e  i  s  s  u  e  )  :  D  1  9  1  -1  9  8  .  V  o  g  e  l  s  t  e  i  n  B  ,  P  a  p  a  d  o  p  o  u  l  o  s  N  ,  V  e  l  c  u  l  e  s  c  u  V  E  ,  Z  h  o  u  S  ,  D  i  a  z  L  A  ,  J  r  .  ,  K  i  n  z  l  e  r  K  W  .  2  0  1  3  .  C  a  n  c  e  r  g  e  n  o  m  e  l  a  n  d  s  c  a  p  e