Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations

While tumor genome sequencing has become widely available in clinical and research settings, the interpretation of tumor somatic variants remains an important bottleneck. Here we present the Cancer Genome Interpreter, a versatile platform that automates the interpretation of newly sequenced cancer genomes, annotating the potential of alterations detected in tumors to act as drivers and their possible effect on treatment response. The results are organized in different levels of evidence according to current knowledge, which we envision can support a broad range of oncology use cases. The resource is publicly available at http://www.cancergenomeinterpreter.org. Electronic supplementary material The online version of this article (10.1186/s13073-018-0531-8) contains supplementary material, which is available to authorized users.


Ia. Overview
The Cancer Genome Interpreter (CGI) first identifies the genomic alterations (mutations -i.e. point substitutions or small insertions/deletions-, copy number alterations and/or gene translocations) driving the tumor growth. On detail, each mutation is classified as (i) a known oncogenic mutation in the tumor; (ii) a known oncogenic mutation in other cancers; (iii) a predicted driver mutation of the tumor (these are further divided into two tiers); (iv) a predicted passenger event; (v) a variant which does not affect the protein sequence or (vi) a polymorphism (i.e. major allele frequency greater than 1% across healthy donors 1 ). Each gene amplification or deletion is classified as (i) a known oncogenic copy number alteration (CNA) of the tumor; (ii) a known oncogenic CNA in other cancers; (iii) a predicted driver CNA of the tumor; or (iv) a predicted passenger event. Finally, each translocation is classified as (i) a known oncogenic event of the tumor; (ii) a known oncogenic event in other cancers; or (iii) a translocation of uncertain significance. These analyses are supported by an ensemble of databases and bioinformatics methods based on several existing or newly developed resources (see the Catalog of Cancer Genes, the Catalog of Validated Oncogenic Mutations and the OncodriveMUT method sections in the present document for further details). Of note, the system assumes that all genomic alterations are correctly called (e.g. genes with unclear copy number status boundaries or mutations with low quality calls) and entered by the user.
Thereafter, the CGI explores potential therapeutic opportunities offered by the tumor's genomic makeup.
Tumor alterations are compared with genomic biomarkers of anti-cancer drugs response (sensitivity, resistance and toxicity) annotated in the Cancer Biomarkers database (see section V for further details). The CGI matches this information to the alterations observed in the tumor taking into account several considerations. First, it detects and groups co-occurring alterations that are known to interact in the response to a given drug. This includes the co-occurrence in the tumor of biomarkers of resistance and sensitivity to the same drug. Second, the match between the observed genomic alteration and the-biomarker of drug response takes into account the level of detail on the latter, e.g. -in the case of mutations-the system distinguishes whether the biomarker refers to any mutation in the gene, in one particular exon (or domain) or a specific aminoacid change . The results of the alteration analysis step are considered here; e.g the OncodriveMUT classification of a variant is taken into account for the in silico prescription in the case of biomarkers that are solely defined as an oncogenic mutation of a given gene. ; Finally, the in silico prescription considers possibilities of two types of repurposing of anti-cancer drugs. The cancer-type repurposing is used for cases in which the alteration observed in the tumor has been described as a biomarker of response to the drug in a tumor type that is different to that of the sample(s) under analysis, following the hierarchy of tumor types taxonomy. The alteration-type repurposing describescases in which a different alteration than the one described in the biomarker, but with the same putative effect, is observed in the tumor (e.g. a deletion of a tumor-suppressor when the biomarker is a loss-of-function mutation).
Furthermore, the CGI also explores as potentially interesting compounds that have been shown experimentally to bind to the products of genes with driver alterations in the tumor sample. This is based on the information of the Cancer Bioactivities database, which collects data of gene-compound chemical interactions (see the section VI for further details). This process takes into account (i) the experimentally measured strength of the reported interaction; and (ii) whether the mechanism of action of the compound on the targeted gene is coherent with the mode of action of the latter (i.e. inhibitors for oncogenes, and agonists for tumor suppressors).
All CGI analyses are cancer-specific and thus the tumor type of the sample(s) to analyze is required as an input. The CGI uses an in-house cancer taxonomy which takes into account the disease hierarchy (e.g. mutations that are known to be oncogenic in non-small cell lung carcinomas will produce a 'tumor type match' when observed in a lung adenocarcinoma sample). Therefore, the more generic is the tumor type supplied for the sample to be analyzed, the less specific the results of the CGI will be.

Ib. Pipeline annotations
The input of the CGI consists in a list of genomic alterations detected in one (or more) tumor sample(s). The CGI is able to analyze mutations (point substitutions and small insertions/deletions), gene CNAs and/or translocations. The system accepts and automatically recognizes several formats, including Human Genome Variation Society (HGV, either in genomic or protein coordinates) and Variant Call Format (VCF) for mutations. Direct or inverse mapping between genomic and protein coordinates of mutations is supported by the TransVar method 2 . To annotate the mutations, the CGI selects the transcript with the longest CCDS sequence (or longest cDNA sequence if multiple CCDS transcripts of the same length exist or the gene has no CCDS transcript), according to data retrieved from Ensembl v70, except for a set of 109 genes the canonical transcript of which was manually selected. The CGI reports include several mapping attributes such as the exon and the Pfam 3 domain affected by the mutated residue. Importantly, data provided by different databases included in the CGI (e.g. the aggregated data to build the Catalog of Oncogenic Variants) is consistently re-annotated using identical syntax and versions in order to guarantee internal compatibility. Therefore, the CGI pipeline re-maps the mutations introduced by the user accordingly to guarantee appropriate cross-matches.

Ic. Web interface
The CGI framework is freely available on the web at http://cancergenomeinterpreter.com. As stated before, the input of the CGI is (a) the genomic alterations of the tumor/s; and (b) its cancer type. The latter can be selected from a taxonomy tree that follows the in-house cancer classification. Note that several tumor samples can be analyzed in a single CGI run as far as they belong to the same tumor type, since the analysis is cancer-specific. The list of alterations may be provided as (i) one (or more) tab-separated files; and/or (ii) via a free text box. Once the user executes a new analysis, the process may be tracked using the identifier assigned to it by the system, and --once completed--the results are stored during 48 hours. The execution time of each analysis depends on (i) how long the job takes to get a slot in our computer cluster, (ii) the time required to load the data structures used by the CGI, and (iii) the number of entries to analyzed. With the aim of reducing the overall time in some of these analyses, the results for the most frequent alterations observed in tumors are pre-computed. Once finished, the resulting CGI output is provided via a web report that can be interactively browsed and filtered. This report is divided into two parts. The first one presents the result of the alterations analysis (which may be further divided into three tabs containing the results of mutations, CNAs and/or translocations as appropriate) and the other with the in silico prescription (organized in a tab showing the match of the tumor with the Cancer Biomarkers database and another tab showing the match with the Cancer Bioactivities database). If the user logs into the system, these reports are stored in a Results page within the CGI website associated with that user's account. The login process only requires a valid email address and the access is thereafter immediately granted. The CGI reports may be shared by creating a unique link and the results may be downloaded as tab-separated files. To prevent unauthorized access or disclosure, to maintain data accuracy, and to ensure the appropriate use of information, CGI uses a range of reasonable physical, technical, and administrative measures to safeguard the information, in accordance with current technological and industry standards. In particular, all connections to and from our website are encrypted using Secure Socket Layer (SSL) technology. The CGI never has access to users' password and uses a trusted third party protocol to authenticate the user. While the analyses are running, they are stored in our private servers. The results can be downloaded, shared or deleted and they are organized by an editable title. When a CGI analysis is deleted, it is completely and permanently removed from the servers.

Id. Application Programming Interface
The CGI resource can also be accessed programmatically by an API created via REST. Only registered users can make use of the API, since a token is needed for any communication between the end user and the REST API. Further details can be found at https://www.cancergenomeinterpreter.org/rest_api

II. Catalog of Cancer Genes
The CGI focuses the analysis on genomic alterations that affect the genes thought to be potentially involved in the pertinent cancer type. Although all the variants are annotated with relevant attributes (see section IV), only those affecting cancer genes qualify for further consideration as potential driver events. The Catalog of Cancer Genes is a collection of genes driving tumorigenesis in a certain tumor type(s) upon a certain alteration type (mutation, CNA and/or gene translocation). This information is supported by (a) validated data; and/or (b) bioinformatics prediction. For the former, known cancer genes are collected from the following manually curated resources: (i) the Cancer Gene Census 4 ; (ii) genes bearing mutations known to At present, these analyses have been carried out across a 6,792-overall samples pan-cancer cohort comprising 28 different tumor types.
Finally, the mode of action (loss-of-function versus gain-of-function) of each cancer gene has been also included in the Cancer Genes Catalog. This information can be (a) validated and as such, obtained from manually curated resources; or (b) predicted via bioinformatics analyses 8 . Of note, the mode of action includes an 'ambiguous' role, which is stated when it is not known and it can not be predicted with reliability by the computational methods employed to estimate so or the gene acts as both a tumor suppressor and an oncogene in a context-dependent manner.

III. Catalog of Validated Oncogenic Mutations
Not all mutations identified in cancer genes are capable of driving tumorigenesis. Consequently, the CGI considers whether a gene is mutated, but also which particular variant occurs. Therefore, we first compiled an inventory of mutations in cancer genes that are demonstrated to drive tumor growth or predispose to cancer. This was retrieved by combining the data contained in the DoCM 9 , ClinVar 10 and OncoKB 11 databases as well as the results of several published experimental assays, as those compiled by Martelotto et al. 12 . We also considered as oncogenic the mutations reported to increase sensitivity to targeted drugs included in the Cancer Biomarkers Database (see below). Germline variants found to predispose to cancer, which we retrieved from the ClinVar and IARC resources 10,13 , were also included. Contradictory data (i.e. a variant stated as oncogenic and neutral by different resources) was flagged and filtered out. In all, 24 variants (0.4% of the total) were filtered out due to this. The current version of the Catalog of Validated Oncogenic Mutations includes 5,610 somatic/germline oncogenic variants. This dataset is available at https://www.cancergenomeinterpreter.org/mutations. When this information was matched to the somatic mutations identified by exome-sequencing in the 6,792 samples pan-cancer cohort (see main text of the manuscript), we found that only a minority of the mutations observed across cancer genes were validated oncogenic events. Thus, a majority of the protein-affecting mutations (~88%) observed in tumors, even if they occur in well known cancer genes, are of unknown significance, highlighting the need for tools to classify them (see the OncodriveMUT section). Of note, we observed some of these validated oncogenic events in cancer types in which they had not been described before, such as DNMT3A p.R882H, SF3B1 p.K700R and JAK2 p.V617F mutations (known in blood malignancies [14][15][16] ) in breast, renal and glioblastoma tumors in the pan-cancer cohort, respectively. These rare events may be further relevant when they are involved in the response to anti-cancer drugs.

IVa. Overview
We have developed a novel method, OncodriveMUT, with the aim of gaining further insights into the oncogenic potential of the mutations of unknown significance. OncodriveMUT is used by the CGI to analyze the mutations in cancer genes that are not found in the Catalog of Validated Oncogenic Mutations.
OncodriveMUT combines measurements performed at the level of each individual mutation with knowledge about the driver genes (or regions thereof) in which these mutations are found. This knowledge is retrieved from the analysis of large cohorts of sequenced tumors and healthy donors, which provides the statistical power to discover gene features that are relevant to assess the importance of particular mutations. At present, we have analyzed cohorts of tumors (6,792 samples across 28 cancer types 5 ) and samples from healthy donors (60,706 unrelated individuals) 17 . On detail, the knowledge retrieved from cohorts of healthy donors are the allele frequency of variants and the protein domains depleted of functional variants in the general population. The latter points out to protein regions that may be less tolerant to functional variants. To identify them, we searched for protein domains (from the Pfam 3 database) enriched by very rare (1 out of 10,000 samples) variants according to ExAC data 17 . As a result, we identified 94 genes exhibiting 24 types of so-called 'delicate domains', which include the tyrosine phosphorilation, the protein kinase, the homeobox and the SH2 domains. On the other hand, the analysis of sequenced tumor cohorts yielded: (i) the signals of positive selection of each gene in each tumor type 5 , which is the cornerstone to identify cancer genes; (ii) the mode of action of each cancer gene in tumorigenesis, i.e. loss-of-function, oncogene or ambiguous 8 ; and (iii) protein sites with an unexpectedly high concentration of somatic mutations, i.e. mutation clusters 18 . Finally, as mutation-centric features, the OncodriveMUT uses (i) their consequence type, i.e. missense , inframe indel, or truncating mutation (e.g. a mutation within a canonical splice site, a frameshift variant or the insertion of a premature stop codon); the location of the mutation in terms (ii) of the domain (to match it to the list of delicate domains, see above), and (iii) of the protein site, on detail whether it occurs before the last exon-intron junction (which is more likely to trigger the nonsense-mediated decay pathway in case of a truncated protein) or in the last portion of the protein (since disrupting mutations may be less deleterious if they occur at the very last protein sites); and (iv) the estimated deleteriousness of the mutation, measured by the Combined Annotation Dependent Depletion score 19 .
OncodriveMUT combines these measurements using a set of heuristic rules, which are shown in Supplementary Table 2. We compared the performance obtained by these rules with a machine-learning approach; to do so, we built a random forest machine learning and classification algorithm 20 . Using bona fide oncogenic mutations and neutral events observed across cancer genes (see below), a random-forest classifier with 1,000 estimators was trained in a ten fold cross-validation with 70% of the features in order to predict the remaining 30% (data not shown). Both, the machine-learning and the heuristic approach exhibited similar performance. We therefore decided to use the latter, since the rationale behind the classification of each variant by OncodriveMUT is then human-readable and the critical review of these results is facilitated. To empower the user to carry out such a review, the measurements and attributes of each variant employed by the OncodriveMUTclassification are included in the CGI output reports. In addition, these data can support further exploration of the mode of action of each mutation. For instance, most inframe indels detected as driver events in tumor suppressor genes (whose effect is more difficult to estimate than their clearly more deleterious frameshift relatives) occur within regions where somatic mutations tend to cluster in these genes.
This may suggest a loss-of-function mechanism driven by the disturbance of critical protein sites (e.g. inframe deletions in CDKN2A binding sites within the second exon 21 ), or the acquisition of dominantnegative phenotypes driven by the creation of particular protein fragments (e.g. inframe indels in the 5th exon of TP53 22 ). The incorporation of additional computational measurements developed in the future, as well as the study of novel data and experimental results, will help to further improve OncodriveMUT analyses.

IVb. Benchmarking
First, bona fide driver and passenger mutations in cancer genes were collected to be used as positive and negative data sets to benchmark the OncodriveMUT approach, respectively. The former was composed of the entries gathered in the Catalog of Validated Oncogenic Mutations (n=5,314). For the latter, we collected a set of protein affecting mutations observed in cancer genes and found to be non-pathogenic and/or neutral in terms of oncogenesis (according to ClinVar and OncoKB annotations 10,11 , n=670) or common polymorphisms (major allele frequency larger than 1% in the general population according to ExAC 17 , n=1,006). As a result, we found that OncodriveMUT separates the variants of these two data sets with 86% of accuracy (Matthews correlation coefficient, 0.64) (Suppl. Figure 1A). OncodriveMUT outperformed other methods developed with similar purpose 19,23-26 (Suppl Figure 2). In addition, several data sets were collected to assess whether the mutations classified as drivers by OncodriveMUT follow a priori expected behaviors of oncogenic mutations. First, we downloaded the frequency of somatic protein affecting mutations in cancer genes observed across tumor samples from COSMIC v76 27 . Second, the major allele frequency across the general population of germline variants leading to a change of protein sequence in cancer genes was retrieved from ExAC 17 . And third, the cancer cell fraction of mutations observed in cancer genes was calculated using their variant allele frequency corrected by the estimated tumor purity and gene copy number 5 . As a result, we observed that mutations classified as drivers by OncodriveMUT are enriched amongst recurrent COSMIC mutations (Suppl. Fig. 1B). They are also enriched for rare germline variants across healthy donors (Suppl. Fig. 1C). Both results are expected from oncogenic events. However, a certain degree of circularity in this validation must be noted, as one of the features used by OncodriveMUT is whether the mutation under analysis falls within a cluster of somatic mutations previously identified using available sequenced tumor cohorts. On the other hand, mutations in cancer genes classified as drivers by OncodriveMUT exhibit larger cancer cell fraction than those classified as passengers (Suppl. Fig. 1D), as expected from events that undergo positive selection within the cancer cell clonal population. Of note, only protein-affecting mutations in cancer genes were considered in these tests, which highlights the ability of the OncodriveMUT method to point out those with more oncogenic potential.
Finally, we gathered results from several available experimental assays evaluating the effect of cancer mutations to assess the agreement of OncodriveMUT with the experiment in completely independent test sets. First, we used all possible missense mutations along the protein sequence of TP53 and their functional effect evaluated in yeast assays 28 . On detail, this study measured the transactivation of the TP53 mutants on several reporter genes. Only activities lower than 140% (activity of the mutant in relation to the wild-type) were included. Second, the effect of rare mutations (i.e. lowly recurrent across cancer patients) in several oncogenes were collected from three recent studies. We considered oncogenic (i) PIK3CA-mutants exhibiting activity in all the six experiments provided in ref. 29 -regardless of their strength-; (ii) mutations in oncogenes leading to sustained tumor growth before 130 days in the in vivo experiments provided in ref. 30 ; and (iii) mutations in oncogenes validated as tumorigenic in the functional screens performed in ref. 31 . Of note, any mutation included in the positive or negative sets described in the first paragraph was filtered out from this step to avoid redundancy between the two evaluations. As a result, (a) TP53 mutants classified by OncodriveMUT as driver mutations exhibited larger impairment of the gene activity than those predicted as passengers (Suppl. Figure 1E); and (b) OncodriveMUT classification of rare mutations in cancer genes reached an 82% of agreement with the experiments (Suppl. Figure 1F).

V. Cancer Biomarkers Database
The Cancer Biomarkers Database is a manually curated resource collecting genomic biomarkers of drug response found in cancer patients or in pre-clinical assays. This database follows the organization proposed in the Gene Drug Knowledge Database (GDKD) 32 , which requires, among others, the evidence supporting each alteration-drug association. On detail, five distinct levels of supporting evidence are employed: (a) clinical guidelines, which includes FDA-approved indications and recommendations from international organizations such as NCCN; (b) late clinical trials (i.e. phases III-IV); (c) early clinical trials (i.e. phases I-II); (d) clinical case reports; and (e) pre-clinical data. Genomic alterations in the database may be biomarkers of increased sensitivity, resistance or toxicity to anti-cancer therapies. Of note, negative evidences, i.e. those alterations that do not affect the response to a given drug (e.g. the use of BRAF V600 inhibitors as single agent in colorectal cancers bearing that mutation), were also included in the database and labeled as 'nonresponsive'. Absence of an event (e.g. a wild-type allele) and multi-marker entries (e.g. PIK3CA oncogenic mutation + ERBB2 amplification for Everolimus + Trastuzumab + Chemotherapy treatment in breast cancer) are also contemplated. Each entry also includes the cancer type(s) in which this association has been demonstrated and the reference (e.g. PubMed identifier or conference abstract reference) of that study. The data is collected by a board of clinical oncologists and research experts organized by cancer type expertise, who are in charge of filling the minimum-required fields for each new entry following the data model.

VI. Cancer Bioactivities Database
The Cancer Bioactivities Database was built from ChEMBL v21 33 data on compound assays. Ensembl v70 gene symbols were mapped to uniprot IDs, through Biomart Ensembl API, and mapped to ChEMBL target IDs through the mapping file provided in ChEMBL v21 downloaded from its ftp server. Only genes with a valid HGNC symbol were considered. Next, we retrieved all bioactivity data associated to the targetmolecule interactions reported by all assays probing the interaction. We included assays that measured a confidence score higher than or equal to 4 when this information was available, and entries suggesting errors in the annotations (data validity comment field) were filtered out. We considered bioactivities concerning the affinity of binding, the effective concentration, the efficacy of inhibition and the efficacy of competitive antagonism (IC50, EC50, Ki, Kd and Kb), whose values were converted to pActivity as appropriate. Each target-compound bioactivity was finally obtained by averaging the values across the available assays accomplishing our inclusion criteria. The resulting values were then grouped into three categories: (i) highly potent, with a binding affinity higher than 1 nM (pActivity >= 9); (ii) potent, with a binding affinity between 1μM and 1nM ( 9 < pActivity >= 6); and (iii) weak, with a binding affinity between 1mM and 1μM (6 < pActivity >= 3). Additional information on chemical compounds was collected, including their market status (e.g. approved or pre-clinical) and their mechanism of action (MOA). If the MOA was not available, we considered the compound as an inhibitor of the target. We grouped all MOA categories into two groups depending on whether they have a positive effect on the target (e.g agonist or opener labels) or negative (e.g inhibitor or blocker labels). The CGI in silico prescription includes a match column stating whether the MOA of the compound is coherent with the mechanism to drive the tumorigenesis (known or predicted) of the cancer gene, i.e. tumor suppressors for positive MOAs and oncogenes for negative MOAs.

VII. Use of the CGI in pan-cancer sequenced cohorts
We exemplify the ideas on the interpretation of cancer genomes described in this commentary through their application to a pan-cancer cohort of 6,792 exome-sequenced tumors 5 Fig. 3). Overall, the CGI analysis found that 40% of the PAMs observed in cancer genes are estimated to be passengers, with wide variation between genes. Of note, we found that the proportion of driver mutations in a tumor sample decreases as the total Finally, the sample-centric analysis supported by the CGI empowers the identification of alterations that are uncommon in particular tumor types but however are considered actionable in other cancers in which that alteration is observed more frequently. These events may provide potential re-purposing opportunities whose outcome is currently not known (and thus not included as a positive nor negative evidence in the Cancer Biomarkers database). Among these events, some of the most frequently observed include the possibility of targeting loss-of-function alterations of DNA damage genes and the use of rapalogs for tumors with TSC1/2 loss-of-function. Another compelling example is PTCH1, a member of the patched gene family involved in the response to hedgehog inhibitors, which are currently approved for clinical use in basal cell carcinoma and in medulloblastoma 35,36 . PTCH1 is not routinely contemplated among the genes of potential interest in other tumor types since it is rarely mutated. However, 82 samples across 19 other tumor types of the analysed pan-cancer cohort harbored mutations estimated as drivers in this gene. Moreover, most of these tumors did not exhibit any other actionable alteration supported by strong clinical evidence. This observation may point out these PTCH1-mutated tumors as suitable candidates to be included in a potential basket trial.
Next, we compared these results with the therapeutic opportunities identified for 17,462 cancer patients profiled by the GENIE project. In comparison with the 6,792 exome-sequenced patients, the GENIE cohort is enriched for biomarkers employed by molecular oncology boards, since (a) the tumors were profiled by targeted panels designed to support the clinical programs at the participating medical centers; and (b) the project included a higher proportion of recurrent/relapsing patients and/or later stage cancers. The CGI identified 8% and 6% of tumors with biomarkers of drug response supported by clinical guidelines and late clinical trials, respectively. Biomarkers of drug response supported by data obtained in early clinical trials, case reports and pre-clinical studies were found in 49.7%, 18.7%, and 60% of patients, respectively. Overall, the CGI identified at least one biomarker of drug sensitivity supported by evidences spanning from clinical guidelines to pre-clinical data in 72% of GENIE patients, a percentage that varies across cancer types. Of note, these tumors also exhibited a considerable number of biomarkers of drug resistance, as expected from a cohort with a larger share of recurrent/relapse patients and in contrast to the 6,792 pan-cancer cohort, which is mostly composed of tumors profiled at diagnosis. Among the most recurrent events, the CGI identified EGFR T790M mutations in lung tumors (providing resistance to several EGFR inhibitors), BRAF V600E mutations in colorectal tumors (resistance to Cetuximab) and ESR1 oncogenic mutations in breast tumors (resistance to aromatase inhibitors). In addition, the CGI also identified several putative loss-of-function mutations in JAK1, JAK2 and B2M genes, which have been recently reported to confer resistance to PDL1/PD1 axis inhibitors. These mutations were found in tumors with high mutation burden and/or presenting co-occurring putative biomarkers of response to these immunotherapies (e.g. NF1 and PTENmutant melanomas).
In summary, the CGI provides a systematic and rapid interpretation of the genomic alterations profiled for large tumor cohorts. These analyses provide a comprehensive catalog of cancer driver variants and the in silico prescription refines the landscape of genomic-guided therapeutic opportunities as it stands today in newly diagnosed and advanced cancers.

VIII. The CGI in the support of clinical decision-making
The CGI has been used to support the clinical decision-making process in two clinical oncology centers that are early adopters of the resource. Vall d'Hebron Institute of Oncology is a reference medical cancer institution that routinely applies a targeted next-generation sequencing panel of -at the moment of writing re-purposing SHH inhibitors, whose outcome was subsequently tested in a mouse model (data not shown). In summary, the CGI is a useful tool to support decision making of molecular tumor boards, such as those aimed to allocate patients to the most appropriate clinical trial or to comprehensively explore off-label opportunities for genome guided therapies in patients unresponsive to standard-of-care treatment.
Informed consent was obtained from all subjects participating in these projects, which were approved by the   Of note, not all the variants -e.g. indels-can be analyzed by these methods (the percentage of the variants that could not be analysed by each is detailed in the panel legend as appropriate). CanDrA results were retrieved via the Version (Plus) pipeline. Of note, all the variants classified as passenger or drivers by the method were considered regardless of their significance value (calculated as the fraction of mutations that have more extreme scores in the same class in the training data ), since the use of any threshold in this value reduced drastically the number of variants that can be evaluated (e.g 5.3% of the variants are classified with a CanDrA significance value lower than 0.05). CADD scores were retrieved via their pipeline v1.3; FATHMM (v2.3), CHASM (v4.0) and Mutation Assessor (release 3) results were retrieved by using the corresponding web sites (http://fathmm.biocompute.org.uk/cancer.html , http://www.cravat.us/, http://mutationassessor.org/r3/). Of note, we used a general configuration for those methods in which the cancer type can be stated as a parameter of the analysis. This is due to the fact that the cancer type is not annotated for all the variants (specially the negative data set); and even if this information is available, some methods do not take all the cancer types into consideration for their classification.

Suppl. Figure 3
The catalog of driver mutations retrieved by the CGI analysis of a 6,792 tumors pan-cancer cohort is available as a resource at http://www.intogen.org (a) The results can be browsed at the level of tumor type. In the example, the most frequently gene mutations of breast adenocarcinoma are shown.

(b)
The results can be browsed at the level of gene variant, including whether it is a validated oncogenic event (based on the Catalog of Validated Oncogenic Mutations) or whether it is classified as a putative driver versus passenger event (based on the OncodriveMUT analysis) otherwise. In the example, the results for the set of PIK3CA mutations observed in breast adenocarcinomas are shown.
(c) The distribution of variants across protein domains can be seen in an interactive graphic. In the example, mutations observed in breast adenocarcinoma tumors across the PIK3CA protein are shown. Figure 3 Cancer acronyms of the tumors gathered by the GENIE project.

Supplementary legend for
The cancer acronyms used in the main Figure 3 are detailed above. Note that tumors were grouped according to the most specific subtype available in the patient information.