Mutation signatures of carcinogen exposure: genome-wide detection and new opportunities for cancer prevention

Exposure to environmental mutagens is an important cause of human cancer, and measures to reduce mutagenic and carcinogenic exposures have been highly successful at controlling cancer. Until recently, it has been possible to connect the chemical characteristics of mutagens to actual mutations observed in human tumors only indirectly. Now, next-generation sequencing technology enables us to observe in detail the DNA-sequence-level effects of well-known mutagens, such as ultraviolet radiation and tobacco smoke, as well as endogenous mutagenic processes, such as those involving activated DNA cytidine deaminases (APOBECs). We can also observe the effects of less well-known but potent mutagens, including those recently found to be present in some herbal remedies. Crucially, we can now tease apart the superimposed effects of several mutational exposures and processes and determine which ones occurred during the development of individual tumors. Here, we review advances in detecting these mutation signatures and discuss the implications for surveillance and prevention of cancer. The number of sequenced tumors from diverse cancer types and multiple geographic regions is growing explosively, and the genomes of these tumors will bear the signatures of even more diverse mutagenic exposures. Thus, we envision development of wide-ranging compendia of mutation signatures from tumors and a concerted effort to experimentally elucidate the signatures of a large number of mutagens. This information will be used to link signatures observed in tumors to the exposures responsible for them, which will offer unprecedented opportunities for prevention.

Mutagenic environmental exposures are important causes of human cancer. This was first understood from Percival Pott's 18th century epidemiological observation of scrotal cancer in chimney sweeps [1]. Causality was eventually confirmed experimentally by using coal tar to induce cancer in rabbits [2]. Soon thereafter, polycyclic aromatic hydrocarbons were identified as carcinogens in coal tar [3]. Much later, once the role of DNA as an information molecule was understood, the biochemical mechanisms for polycyclic aromatic hydrocarbon mutagenesis were elucidated [4]. This led to a broader appreciation of the roles of DNA damaging agents in mutagenesis and to extensive study of numerous other mutagens [5,6]. Subsequently, assays for mutagenicity became proxies for tests of carcinogenicity, with the Ames test, performed in a bacterial system, as a well-known example [7]. However, tests of mutagenicity in artificial systems do not fully connect mutagenic exposures to the patterns of mutation observed in cancers.
More recently, it has become clear that specific mutagens produce characteristic patterns of somatic mutations in the DNA of malignant cells. We describe these patterns, called 'mutation signatures', in detail below. Briefly, mutation signatures usually include the relative frequencies of the various nucleotide mutations (such as A > C, A > G, A > T, C > A) plus, ideally, their trinucleotide contexts, that is, the identities of the bases on both sides of the mutated nucleotides. Previously, our knowledge of these signatures was based on short lengths (such as a few kilobases) of DNA sequence. With the advent of next-generation sequencing, it is now possible to infer these signatures from the sequences of all the exons in the genome ('whole exome') or from the sequence of the entire genome ('whole genome').
Characterization of mutagenicity based directly on observed mutations across whole exomes or genomes offers several advantages over previous approaches, including that many more mutations can be detected, which provides far greater statistical power and allows the parsing of the superimposed mutation signatures stemming from several exposures. Actual mutation signatures are the end result of a series of biochemical and biological processes, including the metabolism of pro-mutagens to active forms, biochemical damage to DNA, the efforts of the cell to repair the damage, and, rarely, selection for or against the resulting mutations. Thus, while not obviating the need for mechanistic studies of the biochemical mechanisms of mutagenicity, cataloging mutations by next-generation sequencing provides information about a critical endpoint: the actual mutations that occur in cell lines or in human cancers in response to mutagenic exposures.
The long-term promise is that the epidemiological connection of specific mutagens to signatures actually observed in tumors will indicate which mutagenic exposures are true substantial contributors to the burden of human cancer (Figure 1). Signatures Figure 1 Linking mutation signatures to exposures or endogenous mutational processes. One can either (a) focus on signatures of one mutagen at a time or (b) study mixtures of signatures. One can study signatures of single mutagens either (c) via experimental approaches, or (d) via observation of mutation signatures in the exome or genome sequence of tumors with a known predominant mutational exposure. Some tumor exomes harbor only a handful of somatic point mutations, and presumably these tumors arise from causes other than mutagenesis. For many cancers, typical numbers of somatic point mutations in exomes are 60 to 300 [9][10][11][12]. Highly mutated cancers sometimes have >3,500 mutations per exome [13]. Typical numbers for genomes of cancers such as those of the lung or stomach are >15,000 [14,15], and a few highly mutated genomes harbor >400,000 somatic point mutations [16]. Among experimental approaches, one can use (e) reporter constructs and observe mutations in short sequences. This allows inference of relatively simple signatures, for example (g) signatures involving only single nucleotide mutations. (f) By sequencing the exome, or, ideally, the genome of a mutagen-exposed, clonal cell line, one can (h) infer a more informative, extended signature, for example one that includes the trinucleotide contexts of single-nucleotide mutations. One can also infer extended signatures by sequencing the exomes or genomes of tumors with known, predominant exposures (d). Extraction of mutation signatures from mixtures of signatures (b) requires somatic mutation catalogs from the exomes or genomes of large numbers of tumors. The most recent studies have looked at thousands of catalogs. (i) Procedures based on NMF (non-negative matrix factorization) allow (j) simultaneous inference of a set of extended mutation signatures and the contributions of each inferred signature to each tumor's mutations. (k) Extended signatures derived from mixtures of signatures (j) can be matched to extended signatures that were experimentally determined or inferred from known predominant exposures (h), thereby providing information on (l) exposures that contributed to tumors with mixtures of mutation signatures. (m) Alternatively, extended signatures extracted from mixtures can be correlated with information on mutagenic exposures or on endogenous mutagenic factors, allowing inference of the causes of mutation signatures (l). (n) The causes of some signatures will remain unknown and require further research.

(a)
The observation of known signatures in tumors might also implicate previously unsuspected exposures in particular cancers. This information on causal exposures then could provide foci for prevention efforts. Nevertheless, despite the low cost and ubiquity of next-generation sequencing, detailed mutation signatures of only a few known carcinogens have been elucidated experimentally so far. Indeed, in a recent groundbreaking survey of mutation signatures across many types of human cancers, most mutation signatures are ascribed to particular exposures by statistical association rather than recapitulation of the signatures in experimental systems [8].
We describe below the state of the art for determining mutation signatures by next-generation sequencing, the implications of this approach for detecting the carcinogenic impacts of mutagenic exposures, and its promise for prevention. We start by describing signatures of single mutagens. We then describe approaches for teasing apart superimposed signatures from multiple mutagenic processes, and conclude with a vision of how this could improve prevention.

Signatures of single mutagens
To date, the signatures of carcinogenic mutagens have been established either in vitro or in human cancers that are primarily caused by one exposure (Table 1). We elaborate first on the mutation signature of aristolochic acid (AA), which has been established both in vitro and in human cancers [16][17][18][19]. AA is a powerful mutagen that is found in some herbal remedies and that causes upper urinary-tract urothelial cancer (UTUC) [16][17][18][19]. It also probably contributes to liver cancer (hepatocellular carcinoma, HCC) [16]. Thus, in addition to providing an example of the signature of a single mutagen, AA also illustrates the use of signatures to detect likely carcinogenic exposures that were previously unsuspected.  [16][17][18][19] Widely used in traditional medicines [20]; exposure to AA is widespread in Taiwan [17] No unusual challenges UV radiation C > T;strand bias; CC > TT TC > TT (C|T)C > (C|T)T [6,8,9,[21][22][23] Prevalence of signature: 87% of melanoma [8] No unusual challenges

Mutation signature of aristolochic acid
AA is a natural compound found in plants in the genus Aristolochia ( Figure 2a). These plants are used in traditional herbal remedies for weight loss and a plethora of health problems, including menstrual symptoms, snakebites, rheumatism, arthritis, and gout [43,44]. Although challenging to document, the use of these plants probably remains widespread [20,45]. Indeed, 99 species in the genus are known to be used medicinally, and although the AA content of most species is unknown, 23 of the medicinally used species contain AA [46]. AA is metabolized to aristolactam nitrenium ions, which form covalent adducts with adenosines in DNA (Figure 2b-e) [45,47]. These adducts then lead to A > T mutations (mutations from adenine to thymine; Figure 2f ). These were initially observed as somatic mutations in the   tumor suppressor gene TP53 in UTUCs in the UK, Taiwan, and the Balkans [17,47]. This was highly unusual, as A > T mutations are rare in other types of human cancer, including UTUCs unrelated to AA exposure [18]. Furthermore, the AA-associated mutations in TP53 tended to occur in the context of CAG > CTG (C followed by the mutated A followed by G, in 5' to 3' order) [19]. However, analysis of approximately 1 kb of sequence in a single gene (TP53) [19] offers limited statistical power to determine the sequence contexts in which the A > T mutations occur. In addition, the approach of assessing physical mutation signatures in TP53, a key tumor suppressor gene, runs the risk of bias caused by conflation of physical mutation signatures with the effects of intense selection during tumor evolution.
Recently, high-throughput next-generation sequencing has provided the means to catalog and analyze somatic mutations far more completely, whether by wholeexome or whole-genome sequencing. Recent work has shown a remarkable preponderance of A > T mutations in AA-associated UTUCs from Taiwan ( Figure 3a) [16][17][18][19]. For comparison, in gastric cancer or other non-AA-associated cancers, A > T somatic mutations are rare ( Figure 3b) [8,14,24].
By way of technical explanation, if we consider a single DNA strand as a point of reference, there are 12 possible single-nucleotide mutations: four nucleotides times three possible mutations for each nucleotide. In some parts of the genome, it makes sense to use a particular strand as the reference sequence. In particular, in regions of the genome that are transcribed, we can use the transcribed strand, that is, the strand that serves as a template for the RNA polymerase, as the point of reference. However, in the non-transcribed regions, neither strand in particular is the obvious choice for the reference sequence. Therefore, the usual practice in the study of mutation signatures has been to not distinguish complementary mutations, but rather to group them together. For example, A > C mutations are grouped with the complementary T > G mutations, A > G mutations are grouped with T > C mutations, and so on.
With the availability of catalogs of somatic mutations from sequencing data, it has become possible to investigate the nucleotides that neighbor AA-induced A > T mutations. The trinucleotide sequence contexts of AA-associated mutations show a dramatic overrepresentation of cytosines and thymines immediately 5′ of mutated adenines (that is, [C|T]A; mutated adenine in bold) and overrepresentation of guanines 3′ of mutated adenines (that is, AG) ( Figure 3c) [16][17][18][19]. This preference of A > T mutations for the (C|T)AG context has not been observed in non-AA-associated cancers (such as gastric cancer; Figure 3d), suggesting that this sequence context is a particular characteristic of AA mutagenesis.
In addition, the A > T mutations in AA-associated UTUCs are less common on the transcribed strands of genes than on the non-transcribed strands (Figure 3e). This strand bias suggests that AA adducts occurring on the transcribed strand were often corrected by transcription-coupled nucleotide excision repair. Similar strand bias is not seen for the relatively infrequent A > T mutations seen in other cancers, such as gastric cancer (Figure 3f ).

An AA-like signature in liver cancer
Unexpectedly, recent examination of mutation signatures in hepatitis B virus-exposed human HCCs revealed some with obvious AA-like signatures (Figure 3g,h) [16], although this cancer type apparently was not previously linked to AA exposure [48]. The signature shows a large proportion of A > T somatic mutations with strand bias (as seen in AA-exposed UTUCs; Figure 3e,g) and a trinucleotide context that strongly resembles that in AAassociated UTUC (compare Figure 3c and Figure 3h). It is possible that exposure to AA in conjunction with hepatitis B virus infection may contribute synergistically to HCC formation, much as hepatitis and aflatoxin do (see below). As AA had not been previously implicated as a risk factor for HCC, this finding may represent a new paradigm, in which environmental exposures contributing to specific cancers are deduced from observations of mutation signatures. It is likely that Aristolochia-containing herbal remedies are the source of AA exposure in these cancers. If so, appropriate measures to minimize exposure should be taken -for example, through education and more aggressive enforcement of bans on Aristolochiacontaining remedies.

Ultraviolet radiation
Ultraviolet (UV) radiation induces several kinds of mutations, primarily C > T (Figure 2g-k, Table 1) [6,9]. It also induces double mutations CC > TT, in which adjacent cytosines mutate to thymines as a result of cytosine dimers generated by UV light. Earlier studies indicated that UV-induced C > T mutations often occur after a pyrimidine (C or T) [9,21,22]. Analysis of mutation catalogs from melanomas indicates that the trinucleotide context is often TCC [8]. As with AA-induced A > T mutations, there is strand bias: UV-induced mutations are less likely to occur on the transcribed strand [8].

Tobacco smoke
Tobacco smoking causes the vast majority of lung cancers and contributes strongly to many other cancers, including liver, colorectal, breast, prostate, and bladder cancers [49]. Tobacco smoke contains many mutagenic carcinogens, including polycyclic aromatic hydrocarbons and N-nitrosamines [25,50,51]  of tobacco smoke was studied primarily in the context of the TP53 gene, in which exposure to tobacco-smoke mutagens often results in G > T mutations [25]. Only a few studies extended the mutation signature to a trinucleotide context, and the preference for particular nucleotides 5' or 3' of the mutated nucleotides is weak (Table 1) [8,24], possibly reflecting the complex mix of mutagens present in tobacco smoke. There are challenges in dissecting the tobacco-smoke mutation signature, because the signatures from different constituent mutagens are likely to differ, and their effects on different organs and tissues are also likely to differ [51]. Thus, it would be highly informative to examine experimentally the signatures of individual mutagenic components of tobacco smoke in the genomes of exposed cell lines from different tissues (Figure 1f ).

Aflatoxin B1
Aflatoxins are byproducts of mold growing on food [52], and among the aflatoxins, aflatoxin B1 (AFB1) is thought to be the most carcinogenic and is the most studied [53]. The International Agency for Research on Cancer (IARC) classifies AFB1 as a Group I carcinogen (an agent that is definitely carcinogenic to humans) [54]. AFB1 is metabolized to an epoxide compound that can form a covalent bond with the N7 atom of guanine, thereby leading to G > T mutations (Table 1) [35]. In addition, AFB1 can induce 8-hydroxy-2'-deoxyguanosine, which also produces predominantly G > T mutations in in vitro experimental models [52]. The mutation signature of AFB1 has been primarily studied in the TP53 gene, and indeed particular somatic mutations in TP53 are used as biomarkers for aflatoxin exposure in tumors [55,56]. However, the extended mutation signature of AFB1 has not been studied (Table 1). Exposure to AFB1 is through food, but unfortunately, its contamination in food is difficult to detect. Consequently, convincing evidence that AFB1 is carcinogenic relied on studies showing that people with AFB1-derived adducts were more likely to develop cancer [29,35]. The predominant cancer associated with AFB1 is HCC, and the risk associated with combined AFB1 exposure and hepatitis infection is far greater than each individual risk [29,35].

Temozolomide
Temozolomide is an alkylating agent commonly used for chemotherapeutic treatment of melanoma and central nervous system tumors [57,58]. Temozolomide is quickly absorbed and undergoes spontaneous breakdown to form an active compound (methyltriazen-1-yl imidazole-4-carboxamide), which forms several DNA adducts: N7-methylguanine (70%), N3-methyladenine (9%), and O6-methylguanine (5%) [59]. Both the N7methylguanine and N3-methyladenine lesions are rapidly repaired by base excision repair [60]. However, the O6-methylguanine adducts sometimes are not repaired, leading to point mutations [61,62]. Although the mechanisms of temozolomide genotoxicity have been intensively studied in a therapeutic context, to our knowledge, the mutation signature of temozolomide has not been studied in experimental systems. However, Alexandrov et al. [8] detected a clear association between a CC > TC signature and temozolomide treatment in glioblastoma and melanoma patients (Table 1).

Benzene
Occupational exposure to benzene is of particular concern, as it is widely used in a variety of industries, including manufacture of petrochemicals and other chemicals, as well as in manufacture of shoes, lubricants, dyes, detergents, drugs, and pesticides [63]. Nonoccupational exposures occur from automobile exhaust and gasoline fumes, industrial emissions, and especially cigarette smoking and second hand smoke [63]. Benzene is classified as a Group 1 carcinogen by IARC [64]. It is benzene's metabolites, such as phenol, hydroquinone, and related hydroxyl metabolites, that have been linked to leukemia in experimental models in vitro and in vivo [65,66]. Benzene metabolites can exert their genotoxic effect through the formation of DNA adducts, oxidative stress, damage to the mitotic apparatus, and inhibition of topoisomerase II function [65]. Although the (g) Probable AA-exposed HCCs show a preponderance of A > T mutations with strand bias similar to that observed in AA-associated UTUCs (n = 11). (h) Trinucleotide context for mutations in probable AA-exposed HCC is highly similar to that for AA-associated UTUCs (c). Plotted using data from [16].
genotoxic mechanisms of benzene have been studied, its mutation signature is poorly understood. Thus far, research using a reporter gene has found a preponderance of C > T and C > A mutations [38] (Table 1). However, there has been no genome-wide analysis of benzene's mutation signature in cell line models or in benzeneassociated leukemias.

Endogenous mutagenic processes
There are also endogenous mutagenic processes, which are sometimes unleashed during cancer development. For example, the APOBEC genes encode DNA cytidine deaminases that, when upregulated, promote C > G and C > T mutations especially in the TC(A|T) context (mutated base in bold; Figure 4a, Table 1) [41,42,67]. Endogenous mutagenic processes arising in cancer development can also consist of inactivation of DNA repair or proofreading mechanisms. A well-known example is microsatellite instability, caused by defects in the DNA mismatch repair mechanism [68]. As another example, it was recently shown that, in some cancers, inactivation of the proofreading domain of DNA polymerase delta 1 or epsilon (POLD1 or POLE) leads to very high mutation rates [13]. POLE mutations were associated with very high rates of TCT > TAT and TCG > TTG mutations [8] (  Figure 4 Mutation signatures in lung adenocarcinoma. (a-d) Four signatures that are prominent in lung adenocarcinoma [8]. The x axes are organized as in Figure 3c. However, unlike in Figure 3c, the y axes in these plots represent proportions of mutations in inferred rather than observed signatures. (a) Signature of APOBEC-induced mutagenesis. (b) 'Age 1B', one of two signatures that correlated with age. (c) Tobacco smoke. (d) 'Signature 5' from [8], due to an unknown exposure or mutational process. (e) Almost all lung adenocarcinomas have mutations that are overlays of several of the signatures above. Signature R2 is an additional signature that may partly represent sequencing errors [8]. All panels adapted from [8] with permission from Macmillan Publishers Ltd.
Aging by itself is a major risk factor for cancer development, and the majority of tumors are diagnosed in older patients [69][70][71]. DNA damage and mutations accumulate with age [72]. Interestingly, there are different age-related mutation patterns in different tissues due to differences in functional characteristics such as mitotic rate, transcriptional activity, metabolism, and specific DNA repair mechanisms [73]. Two distinct yet similar age-related mutation signatures have been detected in cancers (Table 1), and at least one of the two is present in the overwhelming majority of tumors [8].

Mixtures of signatures
In most tumors, somatic mutation catalogs comprise the superimposed results of several mutational exposures and processes. For example, lung adenocarcinomas usually show the signature of tobacco smoke [8,14,24] (Figure 4c). In addition, these tumors often simultaneously show mutation signatures due to exposure to endogenous activated DNA cytidine deaminases (APOBECs; Figure 4a), signatures of mutations that accumulate with age (Figure 4b), and other signatures of unknown origin [8] (Figure 4d).
Given that the catalog of somatic mutations in a tumor often represents an overlay of several mutational processes, a key challenge is to dissect out and assess the contribution of each process. Building on initial work [74], recent strides have been made in computational techniques for meeting this challenge. Specifically, it is now possible to simultaneously discover the existence of multiple signatures and assess the relative contribution of each signature to each tumor's catalog of somatic mutations [8,75,76]. Figure 5 explains the process of combining three mutation signatures to reconstruct a close approximation to the observed somatic mutation catalog of a tumor.
Discovering the signatures relies on a computational analysis called non-negative matrix factorization (NMF). The input to NMF consists of the observed catalogs of somatic mutations from tens [75] to several thousands [8] of tumors. For each of the observed catalogs (one for each tumor), NMF sets up an equation such as the one shown in Figure 5. Then, for a pre-specified number, N, of undefined component signatures, NMF finds the N specific signatures and the contributions of each specific signature (the 'pie chart' circle, Figure 5b) that, for all the tumors simultaneously, provide the closest reconstructions of the observed catalogs. In its mathematical formulation, the collection of mutation catalogs (Figure 5d) is the approximate product of the matrix representing the mutation signatures (Figure 5a) and the matrix representing the contributions of each signature to each tumor (Figure 5b). In other words, Figure 5a and Figure 5b are factors that, when multiplied, yield an approximation of Figure 5d. These factors are constrained to be non-negative, because one cannot have a negative contribution of a mutation signature to a tumor, and because a mutation signature cannot have a negative proportion of mutations of a given class; this is the origin of the term non-negative matrix factorization. We emphasize that NMF simultaneously detects the signatures present in the somatic mutation catalogs of multiple tumors and determines the contribution of each signature to the somatic mutations in each tumor.
There are, of course, numerous fine points, salient among which is the question of how to find the right number, N, of signatures. This depends on the number of mutation catalogs (and the number of mutations) available for analysis, as well as on the actual diversity of mutational processes represented in the sampled tumors. A large international effort recently generated somatic mutation catalogs from 7,042 tumors encompassing 30 cancer types, and these catalogs allowed discernment of 21 mutation signatures [8]. Across all the tumors analyzed, every cancer type had at least two mutation signatures; the cancers with the most signatures were those of the liver (seven signatures) and stomach and uterus (six signatures each). Figure 4e shows the example of lung adenocarcinomas, which usually show mixtures of several mutational processes.
Based on association with clinically documented exposures or correspondence to previously known mutational profiles, the origins of 11 of the 21 signatures in [8] were identifiable. Three signatures were attributed to exogenous exposures: tobacco smoke, UV radiation, and temozolomide. Other signatures were attributed to endogenous processes, including activation of APOBEC genes, mismatch repair deficiency, mutations in the POLE gene, and mutations in the BRCA1 or BRCA2 breast cancer genes. Finally, there were two signatures for which the level of the contribution to mutations in tumors was strongly correlated with the patient's age.
Despite the power conferred by analysis of mutation signatures across the 7,042 tumors, the environmental or biological factors underlying 10 of the 21 signatures could not be identified, and indeed only three signatures were linked to exogenous exposures [8]. Furthermore, over two-thirds of the cancer types studied harbored signatures of unknown source. Thus, there is a large gap in our understanding of the environmental exposures and mutational processes that contribute to common human cancers. Conversely, there are mutagens with wellstudied biochemistry -for example, aflatoxins [30], benzene [66], and AA -that were not detected in these tumors. Possibly none or few of the 7,042 tumors analyzed had been exposed to these mutagens. Indeed, it seems likely that none were exposed to AA, which has a very distinctive signature that would have been detected had it been present. This would suggest that still other important environmental exposures were not represented among the 7,042 tumors. Because environmental exposures vary widely by geography, it will be important to determine somatic mutation catalogs from a diversity of geographic regions. For example, we previously showed that different genes are mutated in cholangiocarcinomas from different geographical regions and with different etiologies [10,11]. In addition, it is crucially important to have detailed clinical information associated with somatic mutation catalogs. It is possible that the mutagenic exposures responsible for some signatures in previous studies [8] could not be identified because the relevant clinical information was not available. For example, exposures to compounds such as aflatoxins would probably not be captured in clinical records. It is also possible that the mutation signatures of some exposures were not detected because the trinucleotide context and other characteristics of the mutations have not been determined from biochemical studies.
The examples of signatures described above focus on single-nucleotide mutations within trinucleotide contexts as the main distinguishing features of signatures. However, other characteristics of mutation catalogs can also be included as features of mutation signatures and analyzed by NMF [8,76]. For example, strand bias could and 4a-d. The mathematical procedures for approximating observed catalogs from mixtures of trinucleotide signatures are the same, but the trinucleotide context provides far more useful information: for example, the spikes in AA-exposed UTUCs show that the AAinduced A > T mutations tend to occur in a (C|T)AG trinucleotide context. Reproduced from [76] with permission from Elsevier.
be included by considering the two strands separately for each class of mutation in transcribed regions; in this case one would consider C > T on the transcribed strand to be distinct from G > A (the complementary mutation).
Other types of mutations, including small insertions and deletions and dinucleotide mutations such as those that occur as a result of UV exposure (CC > TT mutations), can also be included as features of mutation signatures. The framework can also be expanded to consider more bases adjacent to the mutated nucleotide -for example, a pentanucleotide rather than a trinucleotide context. The framework can also be applied to specific regions of the genome. For example, the APOBEC signature (Figure 4a) shows strand bias in exons, but not in introns [76]. Given that both exons and introns are transcribed, the exonic strand bias does not seem to be the result of transcription-coupled repair, and the underlying mechanism remains unknown. However, by distinguishing mutations according to whether they occur in exons or introns, this information could be used to generate a more informative mutation signature. The utility of these possible extensions remains untested, but is likely to increase as additional tumor genomes, which capture about 50 times more mutation information than exomes, are sequenced.

Mutation signatures for surveillance and prevention
Much of cancer is associated with exogenous exposures, and therefore in principle amenable to control by avoidance of those exposures. Examples include tobacco smoke, UV light, and many infectious exposures, such as hepatitis B and C, human papilloma virus, and Helicobacter pylori [77][78][79]. IARC lists 422 known or likely exogenous carcinogens [80]. Indeed, prevention by avoidance of exogenous carcinogenic exposures has been an effective longterm strategy for the control of cancer, with tobacco smoking as the most salient example [49,81]. However, evidence from recent work [8] indicates that many exogenous exposures remain unidentified. Notably, as described earlier, of the 21 mutation signatures identified in [8], 10 lacked any known underlying mutational process or exposure, and over two-thirds of cancer types were affected by signatures due to unknown causes. Furthermore, only three exogenous mutagens were identified: tobacco smoking (12% of all tumors), UV light (5% of all tumors), and temozolomide (0.5% of all tumors), and the cause of Signature 5 (found in 14% of all tumors) is unknown. Some cancers were disproportionately affected by signatures with unknown causes. For example, 89% of HCCs showed Signature 12, and 90% showed Signature 16, both with unknown causes. Conversely, the signatures of some well-known mutagens were not detected ( Table 1), suggesting that cancers due to these mutagens were rare or non-existent among the 7,042 tumors studied. This implies that the signatures of many exposures have yet to be captured in sequenced tumor exomes or genomes. Thus, the analysis of mutation signatures in catalogs of somatic mutations from tumors is promising but in its infancy. To realize this promise, we must extend our knowledge in two aspects. The first is to expand the diversities of tumor types and of their geographical origins. There is already rapid growth in the number of sequenced cancer genomes and their catalogs of somatic mutations. An important advantage of next-generation sequencing in this endeavor is that it is based on an inexpensive, commodity technology, the price of which will continue to drop. In addition, next-generation sequencing provides direct readouts of the mutations that actually occur in tumors. In this context, we note that using whole-exome or whole-genome sequencing to detect mutations (rather than sequencing targeted, cancer-related genes) ensures that most mutations detected are selectively inconsequential passengers. Even though a few somatic mutations in whole-exome or whole-genome sequence are drivers, they are so few that they have negligible influence on the signature. Finally, the large amount of data generated by whole-exome and especially whole-genome sequencing provides optimal statistical power to tease apart the signatures of different mutational processes or exposures.
The second aspect in which we must extend our knowledge consists of establishing connections between specific mutagens and their mutation signatures. This is likely to require experimental exposure of cells or animals to mutagens or their biochemically active metabolites, followed by next-generation sequencing of either clonal populations of exposed cells or of tumors that develop in exposed animals. Sequencing of the exposed genomes will connect specific mutagens to their mutation signatures in far more detail than is currently available. When mutation signatures cannot be found among the signatures of known mutagens, this would suggest the effects of an unknown exposure or mutational process, and point to the need for further epidemiological, toxicological, or biological research.
To our knowledge, there has been little work toward this goal, and our work on the mutation signature of AA and its application to detect AA exposure in HCC is an example [16].

Conclusions and future directions
We envision that the groundbreaking technical advances for detection of signatures in genome-and exome-wide catalogs of somatic mutations from thousands of tumors will enable the assembly of a wide-ranging compendium of mutation signatures from diverse cancer types and multiple geographical regions. This compendium would contain many more whole-genome catalogs of somatic mutations (as opposed to exome catalogs) than are currently available, and would encompass tumors from many more geographical regions, thus capturing a much wider range of mutagenic exposures. This compendium could be combined with experimental determination of the extended signatures of known and suspected mutagens, including, when necessary, their signatures in different tissues or cell types. Signatures with known causes would represent future opportunities for prevention. Signatures with unknown causes would point to the need for further investigation of exogenous mutagens or endogenous mutation processes.
The first part of this vision, the assembly of a compendium of mutation signatures from ever more cancer genomes, seems certain to happen because of the plummeting cost of sequencing and the many ongoing efforts to sequence tumor genomes. Nevertheless, there are many open questions on how best to deploy NMF or NMF-related procedures to assemble this compendium.
For example, what factors determine the power of these procedures to distinguish similar mutation signatures? As the number of genome-wide somatic mutation catalogs increases, will it become worthwhile to include additional information, such as strand bias or pentanucleotide context, in mutation signatures? Fortunately, NMFrelated procedures are an active area of machine learning research. For example, enhanced NMF procedures that prefer sparser solutions -solutions in which the mutation catalog of a given tumor is modeled as the mixture of a relatively small number of signatures -have been recently proposed [82][83][84][85]. Other proposed enhanced NMF procedures could favor solutions with fewer mutation signatures contributing to each tumor, leading to more interpretable results [85][86][87].
The second part of the vision, the experimental elucidation of signatures and the investigation of possible causes of signatures with unknown causes, will require concerted effort. There will surely be challenges in understanding the signatures of complex mutagens such as tobacco smoke, and challenges in understanding the differences in the mutagens' metabolisms and mutagenic activity across different tissues and cell types. Nevertheless, in the near term it will be possible to dissect and refine the worldwide repertoire of signatures and to assign some of these signatures to known causes as experimental studies advance. Of course, not all cancer is due to mutagenic exposures, but linking somatic mutation catalogs generated by next-generation sequencing to specific exposures via the mutation signatures of these exposures could substantially reduce the burden of avoidable cancer.