The CGI employs existing or newly developed resources and computational methods to annotate and analyze the alterations in a tumor according to distinct levels of evidence (Fig. 1a; details in Additional file 2: Note I). The tool is freely available through an API or a web interface at http://www.cancergenomeinterpreter.org, under an open license, with the aim of facilitating its use by cancer researchers and clinical oncologists (Fig. 1b–d). In the following sections we present the blueprint for the interpretation of cancer genomes implemented by the CGI, describe the resource, and discuss its utility.
A comprehensive catalog of cancer genes across tumor types
One of the main aims of the interpretation of cancer genomes is to identify the alterations responsible for tumorigenic traits. In the CGI, this process begins with a focus on alterations that affect the genes capable of driving the cancer hallmarks of a particular tumor type. Therefore, we compiled a catalog of genes involved in the onset and progression of different types of cancer, obtained via different methods and from different sources (Additional file 2: Note II). First, from manually curated resources [2, 7, 8, 12, 13] and the literature we collected genes that have been experimentally or clinically verified to drive tumorigenesis. Second, we incorporated the results of bioinformatics analyses of large tumor cohorts re-sequenced by international initiatives such as The Cancer Genome Atlas (http://cancergenome.nih.gov/abouttcga) and the International Cancer Genome Consortium [14] (specifically, we identified genes whose somatic alterations exhibit signals of positive selection across 6729 tumors representing 28 types of cancer [4]). Each of these cancer genes was annotated with their mode of action in tumorigenesis (i.e., whether they function as oncogenes or tumor suppressors), on the basis of either experimentally verified sources, or in silico prediction [15]. The resulting Catalog of Cancer Genes currently comprises 837 genes with evidence of a tumorigenic role in 193 different cancer types (Fig. 2a). Each entry in the catalog thus includes, along with the name of the driver gene, (i) the malignancies it drives, organized according to available evidence; (ii) the types of alterations involved (mutations, copy number alterations, and/or gene translocations); (iii) the source(s) of this information; (iv) the context (germline or somatic) in which these alterations are tumorigenic; and (v) the gene’s mode of action in cancer as appropriate. The catalog is available for download through the CGI website (https://www.cancergenomeinterpreter.org/genes).
Most mutations affecting cancer genes are of uncertain significance
The focus on cancer genes described above is a necessary but not sufficient to identify the tumorigenic variants in a tumor, since not all variants observed in a cancer gene are necessarily capable of driving tumorigenesis. Therefore, the CGI next focuses on annotating and analyzing protein-affecting mutations (PAMs) that occur in genes of the Catalog of Cancer Genes. First, validated tumorigenic mutations may confidently be labeled as drivers when detected in a newly sequenced tumor. We compiled an inventory that currently contains 5314 such validated mutations, including cancer-predisposing variants, from dedicated resources [7,8,9, 12, 13, 16] and the literature (Fig. 2b; Additional file 2: Note III). This Catalog of Validated Oncogenic Mutations is available for download through the CGI website (https://www.cancergenomeinterpreter.org/mutations). Across a pan-cancer cohort of 6792 tumors sequenced at the whole-exome level (mostly at diagnosis) [4] we observed that only 5360 (916 unique variants) of the 44,601 PAMs found in cancer genes appear in this catalog. In other words, 88% of all PAMs that affect cancer genes in this cohort are currently of uncertain significance for tumorigenesis, a proportion that varies widely per gene and tumor type (Fig. 2c; Additional file 2: Note VII). It is therefore crucial to assess the tumorigenic potential of these variants, especially when they affect genes that are—or may be—therapeutic targets. We reasoned that several features of each specific mutation as well as of the genes they affect could help address this question. Moreover, we propose that some of these features of interest can be extracted from the analyses of large sequenced cohorts of healthy and tumor tissue [4, 17]. Examples of relevant attributes include the following: i) the mode of action of the gene in the cancer type (oncogene or tumor suppressor); ii) the consequence type of the mutation (e.g., synonymous, missense, or truncating); iii) its position within the transcript; iv) whether it falls in a mutational hotspot or cluster; v) its predicted functional impact; vi) its frequency within the human population; and vii) whether it occurs in a domain of the protein that is depleted of germline variants. The CGI assesses the tumorigenic potential of the variants of unknown significance via OncodriveMUT, a newly developed rule-based approach that combines the values of these features (Fig. 2d; Additional file 2: Note IVa). We assessed the performance of OncodriveMUT in the task of classifying driver and passenger mutations, using the Catalog of Validated Oncogenic Mutations (n = 5314) and a collected set of likely neutral—i.e., non-tumorigenic—PAMs affecting cancer genes (n = 1676). We found that OncodriveMUT separated the variants of these two data sets with 86% accuracy (Matthews correlation coefficient, 0.64), out-performing other methods employed for this goal (Additional file 2: Note IVb). In addition, for several features, the variants classified as drivers by OncodriveMUT followed the trend expected for oncogenic mutations (e.g., they exhibited larger clonal fractions among all mutations in cancer genes), and OncodriveMUT’s predictions on a set of recently probed uncommon cancer mutations exhibited a high concordance with experimental evidence [18,19,20,21] (Additional file 2: Note IVb). Of note, the attributes employed by OncodriveMUT to classify each variant are detailed in the CGI output, which facilitates the user’s review of the results. In summary, the CGI annotates the mutations affecting cancer genes with features relevant to their potential role in cancer, identifying validated oncogenic events and identifying the most likely drivers among the variants of unknown significance.
A database of genomic determinants of anti-cancer drug response
The second major aim of the effort to interpret cancer genomes is to identify which tumor alterations may shape the response to anti-cancer therapies. Knowledge on the influence of genomic alterations on drug response is continuously generated and reported through publications, clinical trials, and conference communications. Nevertheless, collecting and curating relevant information into an easy-to-use resource supporting the comparison with newly sequenced tumors and organize the results according to the needs of different users is challenging. The CGI employs two resources to explore the associations between gene alterations and drug response. The first is the Cancer Biomarkers database, an extension of a previous collection of genomic biomarkers of anti-cancer drug response [12] which currently contains information on 1624 genomic biomarkers of response (sensitivity, resistance, or toxicity) to 310 drugs across 130 types of cancer. Negative results of clinical trials, e.g., the unsuccessful use of BRAF V600 inhibitors as a single therapeutic agent in colorectal cancers bearing that mutation, are also included in the database. Importantly, these biomarkers are organized according to the level of clinical evidence supporting each one, ranging from results of pre-clinical data, case reports, and clinical trials in early (I/II) and late phases (III/IV) to standard-of-care guidelines. The database is continuously updated by a board of medical oncologists and cancer genomics experts (Fig. 3a; Additional file 2: Note V). As explained in the “Introduction”, the Cancer Biomarkers database is only one of the resources currently annotating the biomarkers of tumor response to drugs (Additional file 1: Table S1). The leading institutions developing these knowledgebases were recently integrated into the Variant Interpretation for Cancer Consortium (http://cancervariants.org/) under the umbrella of the Global Alliance for Genomics & Health [22]. Besides the aggregation of the data collected by each individual resource, the aim of this project will be to establish community standards to represent and share this information.
The second resource is the Cancer Bioactivities database, which currently contains information on 20,243 chemical compound–protein product interactions that may support novel research applications. We built this database by compiling a catalog of available results from bioactivity assays of small molecules interacting with cancer proteins. This information was obtained by querying several external databases (Additional file 2: Note VI). The CGI matches the alterations observed in newly sequenced tumors to the biomarkers or target genes in these two databases. This process supports the identification of biomarkers at different levels of gene resolution, ranged from variants affecting a gene region to specific amino acid changes. Of note, the CGI also reports co-occurring alterations that affect the response to a given treatment as appropriate. This includes the co-existence of biomarkers of resistance and sensitivity to the same drug, and biomarkers of drug sensitivity that depend upon simultaneous genomic events. In summary, these two databases constitute comprehensive repositories of genome-guided therapeutic actionability in cancer according to current supporting evidence. Both resources are available for download through the CGI website (https://www.cancergenomeinterpreter.org/biomarkers, https://www.cancergenomeinterpreter.org/bioactivities).