Mutation signatures of carcinogen exposure: genome-wide detection and new opportunities for cancer prevention
© The Author(s) 2014
Published: 31 March 2014
Skip to main content
© The Author(s) 2014
Published: 31 March 2014
Exposure to environmental mutagens is an important cause of human cancer, and measures to reduce mutagenic and carcinogenic exposures have been highly successful at controlling cancer. Until recently, it has been possible to connect the chemical characteristics of mutagens to actual mutations observed in human tumors only indirectly. Now, next-generation sequencing technology enables us to observe in detail the DNA-sequence-level effects of well-known mutagens, such as ultraviolet radiation and tobacco smoke, as well as endogenous mutagenic processes, such as those involving activated DNA cytidine deaminases (APOBECs). We can also observe the effects of less well-known but potent mutagens, including those recently found to be present in some herbal remedies. Crucially, we can now tease apart the superimposed effects of several mutational exposures and processes and determine which ones occurred during the development of individual tumors. Here, we review advances in detecting these mutation signatures and discuss the implications for surveillance and prevention of cancer. The number of sequenced tumors from diverse cancer types and multiple geographic regions is growing explosively, and the genomes of these tumors will bear the signatures of even more diverse mutagenic exposures. Thus, we envision development of wide-ranging compendia of mutation signatures from tumors and a concerted effort to experimentally elucidate the signatures of a large number of mutagens. This information will be used to link signatures observed in tumors to the exposures responsible for them, which will offer unprecedented opportunities for prevention.
Mutagenic environmental exposures are important causes of human cancer. This was first understood from Percival Pott's 18th century epidemiological observation of scrotal cancer in chimney sweeps . Causality was eventually confirmed experimentally by using coal tar to induce cancer in rabbits . Soon thereafter, polycyclic aromatic hydrocarbons were identified as carcinogens in coal tar . Much later, once the role of DNA as an information molecule was understood, the biochemical mechanisms for polycyclic aromatic hydrocarbon mutagenesis were elucidated . This led to a broader appreciation of the roles of DNA damaging agents in mutagenesis and to extensive study of numerous other mutagens [5, 6]. Subsequently, assays for mutagenicity became proxies for tests of carcinogenicity, with the Ames test, performed in a bacterial system, as a well-known example . However, tests of mutagenicity in artificial systems do not fully connect mutagenic exposures to the patterns of mutation observed in cancers.
More recently, it has become clear that specific mutagens produce characteristic patterns of somatic mutations in the DNA of malignant cells. We describe these patterns, called 'mutation signatures', in detail below. Briefly, mutation signatures usually include the relative frequencies of the various nucleotide mutations (such as A > C, A > G, A > T, C > A) plus, ideally, their trinucleotide contexts, that is, the identities of the bases on both sides of the mutated nucleotides. Previously, our knowledge of these signatures was based on short lengths (such as a few kilobases) of DNA sequence. With the advent of next-generation sequencing, it is now possible to infer these signatures from the sequences of all the exons in the genome ('whole exome') or from the sequence of the entire genome ('whole genome'). Characterization of mutagenicity based directly on observed mutations across whole exomes or genomes offers several advantages over previous approaches, including that many more mutations can be detected, which provides far greater statistical power and allows the parsing of the superimposed mutation signatures stemming from several exposures. Actual mutation signatures are the end result of a series of biochemical and biological processes, including the metabolism of pro-mutagens to active forms, biochemical damage to DNA, the efforts of the cell to repair the damage, and, rarely, selection for or against the resulting mutations. Thus, while not obviating the need for mechanistic studies of the biochemical mechanisms of mutagenicity, cataloging mutations by next-generation sequencing provides information about a critical endpoint: the actual mutations that occur in cell lines or in human cancers in response to mutagenic exposures.
We describe below the state of the art for determining mutation signatures by next-generation sequencing, the implications of this approach for detecting the carcinogenic impacts of mutagenic exposures, and its promise for prevention. We start by describing signatures of single mutagens. We then describe approaches for teasing apart superimposed signatures from multiple mutagenic processes, and conclude with a vision of how this could improve prevention.
Examples of exogenous mutagens and endogenous mutagenic processes
Studies reporting mutation signatures
A > T
(C|T)AG > (C|T)TG
No unusual challenges
C > T;
strand bias; CC > TT
TC > TT (C|T)C > (C|T)T
Prevalence of signature: 87% of melanoma 
No unusual challenges
Primarily C > A, some C > G and C > T
CG > AG
CG > TG;
CG > GG
Contains multiple carcinogens with individually unknown signatures
Primary G > T;some G > A
Signature in extended context not known
C > T
CC > TC; CT > TT
Present in 10% of glioblastomas; 9% of melanoma 
No unusual challenges
C > T;
C > A
Several mutagenic metabolites and signature in extended context not known
C > T
TCA > TTA
Present in 16 tumor types 
Signatures 2 and 13 are similar , except 2 has C > T and 13 has higher C > G
Mutated DNA polymerase epsilon
C > T
TCG > TTG; TCT > TAT
Present in 13.7% of uterus cancer and 36.7% of colorectal cancer 
No unusual challenges
Mismatch repair deficiency (MSI)
C > T;
C > A
CG > TG; CT > AT;
homopolymer and microsatellite length changes
Present in 9 tumor types 
No unusual challenges
Correlated with patient age
C > T
CG > TG
A majority of tumors of most types have this signature 
Interpretation of two similar signatures in  not clear
However, analysis of approximately 1 kb of sequence in a single gene (TP53)  offers limited statistical power to determine the sequence contexts in which the A > T mutations occur. In addition, the approach of assessing physical mutation signatures in TP53, a key tumor suppressor gene, runs the risk of bias caused by conflation of physical mutation signatures with the effects of intense selection during tumor evolution.
By way of technical explanation, if we consider a single DNA strand as a point of reference, there are 12 possible single-nucleotide mutations: four nucleotides times three possible mutations for each nucleotide. In some parts of the genome, it makes sense to use a particular strand as the reference sequence. In particular, in regions of the genome that are transcribed, we can use the transcribed strand, that is, the strand that serves as a template for the RNA polymerase, as the point of reference. However, in the non-transcribed regions, neither strand in particular is the obvious choice for the reference sequence. Therefore, the usual practice in the study of mutation signatures has been to not distinguish complementary mutations, but rather to group them together. For example, A > C mutations are grouped with the complementary T > G mutations, A > G mutations are grouped with T > C mutations, and so on.
With the availability of catalogs of somatic mutations from sequencing data, it has become possible to investigate the nucleotides that neighbor AA-induced A > T mutations. The trinucleotide sequence contexts of AA-associated mutations show a dramatic overrepresentation of cytosines and thymines immediately 5′ of mutated adenines (that is, [C|T]A; mutated adenine in bold) and overrepresentation of guanines 3′ of mutated adenines (that is, AG) (Figure 3c) [16–19]. This preference of A > T mutations for the (C|T)AG context has not been observed in non-AA-associated cancers (such as gastric cancer; Figure 3d), suggesting that this sequence context is a particular characteristic of AA mutagenesis.
In addition, the A > T mutations in AA-associated UTUCs are less common on the transcribed strands of genes than on the non-transcribed strands (Figure 3e). This strand bias suggests that AA adducts occurring on the transcribed strand were often corrected by transcription-coupled nucleotide excision repair. Similar strand bias is not seen for the relatively infrequent A > T mutations seen in other cancers, such as gastric cancer (Figure 3f).
Unexpectedly, recent examination of mutation signatures in hepatitis B virus-exposed human HCCs revealed some with obvious AA-like signatures (Figure 3g,h) , although this cancer type apparently was not previously linked to AA exposure . The signature shows a large proportion of A > T somatic mutations with strand bias (as seen in AA-exposed UTUCs; Figure 3e,g) and a trinucleotide context that strongly resembles that in AA-associated UTUC (compare Figure 3c and Figure 3h). It is possible that exposure to AA in conjunction with hepatitis B virus infection may contribute synergistically to HCC formation, much as hepatitis and aflatoxin do (see below). As AA had not been previously implicated as a risk factor for HCC, this finding may represent a new paradigm, in which environmental exposures contributing to specific cancers are deduced from observations of mutation signatures. It is likely that Aristolochia-containing herbal remedies are the source of AA exposure in these cancers. If so, appropriate measures to minimize exposure should be taken - for example, through education and more aggressive enforcement of bans on Aristolochia-containing remedies.
Ultraviolet (UV) radiation induces several kinds of mutations, primarily C > T (Figure 2g-k, Table 1) [6, 9]. It also induces double mutations CC > TT, in which adjacent cytosines mutate to thymines as a result of cytosine dimers generated by UV light. Earlier studies indicated that UV-induced C > T mutations often occur after a pyrimidine (C or T) [9, 21, 22]. Analysis of mutation catalogs from melanomas indicates that the trinucleotide context is often TCC . As with AA-induced A > T mutations, there is strand bias: UV-induced mutations are less likely to occur on the transcribed strand .
Tobacco smoking causes the vast majority of lung cancers and contributes strongly to many other cancers, including liver, colorectal, breast, prostate, and bladder cancers . Tobacco smoke contains many mutagenic carcinogens, including polycyclic aromatic hydrocarbons and N-nitrosamines [25, 50, 51]. The mutation signature of tobacco smoke was studied primarily in the context of the TP53 gene, in which exposure to tobacco-smoke mutagens often results in G > T mutations . Only a few studies extended the mutation signature to a trinucleotide context, and the preference for particular nucleotides 5' or 3' of the mutated nucleotides is weak (Table 1) [8, 24], possibly reflecting the complex mix of mutagens present in tobacco smoke. There are challenges in dissecting the tobacco-smoke mutation signature, because the signatures from different constituent mutagens are likely to differ, and their effects on different organs and tissues are also likely to differ . Thus, it would be highly informative to examine experimentally the signatures of individual mutagenic components of tobacco smoke in the genomes of exposed cell lines from different tissues (Figure 1f).
Aflatoxins are byproducts of mold growing on food , and among the aflatoxins, aflatoxin B1 (AFB1) is thought to be the most carcinogenic and is the most studied . The International Agency for Research on Cancer (IARC) classifies AFB1 as a Group I carcinogen (an agent that is definitely carcinogenic to humans) . AFB1 is metabolized to an epoxide compound that can form a covalent bond with the N7 atom of guanine, thereby leading to G > T mutations (Table 1) . In addition, AFB1 can induce 8-hydroxy-2'-deoxyguanosine, which also produces predominantly G > T mutations in in vitro experimental models . The mutation signature of AFB1 has been primarily studied in the TP53 gene, and indeed particular somatic mutations in TP53 are used as biomarkers for aflatoxin exposure in tumors [55, 56]. However, the extended mutation signature of AFB1 has not been studied (Table 1). Exposure to AFB1 is through food, but unfortunately, its contamination in food is difficult to detect. Consequently, convincing evidence that AFB1 is carcinogenic relied on studies showing that people with AFB1-derived adducts were more likely to develop cancer [29, 35]. The predominant cancer associated with AFB1 is HCC, and the risk associated with combined AFB1 exposure and hepatitis infection is far greater than each individual risk [29, 35].
Temozolomide is an alkylating agent commonly used for chemotherapeutic treatment of melanoma and central nervous system tumors [57, 58]. Temozolomide is quickly absorbed and undergoes spontaneous breakdown to form an active compound (methyltriazen-1-yl imidazole-4-carboxamide), which forms several DNA adducts: N7-methylguanine (70%), N3-methyladenine (9%), and O6-methylguanine (5%) . Both the N7-methylguanine and N3-methyladenine lesions are rapidly repaired by base excision repair . However, the O6-methylguanine adducts sometimes are not repaired, leading to point mutations [61, 62]. Although the mechanisms of temozolomide genotoxicity have been intensively studied in a therapeutic context, to our knowledge, the mutation signature of temozolomide has not been studied in experimental systems. However, Alexandrov et al.  detected a clear association between a CC > TC signature and temozolomide treatment in glioblastoma and melanoma patients (Table 1).
Occupational exposure to benzene is of particular concern, as it is widely used in a variety of industries, including manufacture of petrochemicals and other chemicals, as well as in manufacture of shoes, lubricants, dyes, detergents, drugs, and pesticides . Non-occupational exposures occur from automobile exhaust and gasoline fumes, industrial emissions, and especially cigarette smoking and second hand smoke . Benzene is classified as a Group 1 carcinogen by IARC . It is benzene's metabolites, such as phenol, hydroquinone, and related hydroxyl metabolites, that have been linked to leukemia in experimental models in vitro and in vivo[65, 66]. Benzene metabolites can exert their genotoxic effect through the formation of DNA adducts, oxidative stress, damage to the mitotic apparatus, and inhibition of topoisomerase II function . Although the genotoxic mechanisms of benzene have been studied, its mutation signature is poorly understood. Thus far, research using a reporter gene has found a preponderance of C > T and C > A mutations  (Table 1). However, there has been no genome-wide analysis of benzene's mutation signature in cell line models or in benzene-associated leukemias.
Aging by itself is a major risk factor for cancer development, and the majority of tumors are diagnosed in older patients [69–71]. DNA damage and mutations accumulate with age . Interestingly, there are different age-related mutation patterns in different tissues due to differences in functional characteristics such as mitotic rate, transcriptional activity, metabolism, and specific DNA repair mechanisms . Two distinct yet similar age-related mutation signatures have been detected in cancers (Table 1), and at least one of the two is present in the overwhelming majority of tumors .
In most tumors, somatic mutation catalogs comprise the superimposed results of several mutational exposures and processes. For example, lung adenocarcinomas usually show the signature of tobacco smoke [8, 14, 24] (Figure 4c). In addition, these tumors often simultaneously show mutation signatures due to exposure to endogenous activated DNA cytidine deaminases (APOBECs; Figure 4a), signatures of mutations that accumulate with age (Figure 4b), and other signatures of unknown origin  (Figure 4d).
Discovering the signatures relies on a computational analysis called non-negative matrix factorization (NMF). The input to NMF consists of the observed catalogs of somatic mutations from tens  to several thousands  of tumors. For each of the observed catalogs (one for each tumor), NMF sets up an equation such as the one shown in Figure 5. Then, for a pre-specified number, N, of undefined component signatures, NMF finds the N specific signatures and the contributions of each specific signature (the 'pie chart' circle, Figure 5b) that, for all the tumors simultaneously, provide the closest reconstructions of the observed catalogs. In its mathematical formulation, the collection of mutation catalogs (Figure 5d) is the approximate product of the matrix representing the mutation signatures (Figure 5a) and the matrix representing the contributions of each signature to each tumor (Figure 5b). In other words, Figure 5a and Figure 5b are factors that, when multiplied, yield an approximation of Figure 5d. These factors are constrained to be non-negative, because one cannot have a negative contribution of a mutation signature to a tumor, and because a mutation signature cannot have a negative proportion of mutations of a given class; this is the origin of the term non-negative matrix factorization. We emphasize that NMF simultaneously detects the signatures present in the somatic mutation catalogs of multiple tumors and determines the contribution of each signature to the somatic mutations in each tumor.
There are, of course, numerous fine points, salient among which is the question of how to find the right number, N, of signatures. This depends on the number of mutation catalogs (and the number of mutations) available for analysis, as well as on the actual diversity of mutational processes represented in the sampled tumors. A large international effort recently generated somatic mutation catalogs from 7,042 tumors encompassing 30 cancer types, and these catalogs allowed discernment of 21 mutation signatures . Across all the tumors analyzed, every cancer type had at least two mutation signatures; the cancers with the most signatures were those of the liver (seven signatures) and stomach and uterus (six signatures each). Figure 4e shows the example of lung adenocarcinomas, which usually show mixtures of several mutational processes.
Based on association with clinically documented exposures or correspondence to previously known mutational profiles, the origins of 11 of the 21 signatures in  were identifiable. Three signatures were attributed to exogenous exposures: tobacco smoke, UV radiation, and temozolomide. Other signatures were attributed to endogenous processes, including activation of APOBEC genes, mismatch repair deficiency, mutations in the POLE gene, and mutations in the BRCA1 or BRCA2 breast cancer genes. Finally, there were two signatures for which the level of the contribution to mutations in tumors was strongly correlated with the patient's age.
Despite the power conferred by analysis of mutation signatures across the 7,042 tumors, the environmental or biological factors underlying 10 of the 21 signatures could not be identified, and indeed only three signatures were linked to exogenous exposures . Furthermore, over two-thirds of the cancer types studied harbored signatures of unknown source. Thus, there is a large gap in our understanding of the environmental exposures and mutational processes that contribute to common human cancers. Conversely, there are mutagens with well-studied biochemistry - for example, aflatoxins , benzene , and AA - that were not detected in these tumors. Possibly none or few of the 7,042 tumors analyzed had been exposed to these mutagens. Indeed, it seems likely that none were exposed to AA, which has a very distinctive signature that would have been detected had it been present. This would suggest that still other important environmental exposures were not represented among the 7,042 tumors. Because environmental exposures vary widely by geography, it will be important to determine somatic mutation catalogs from a diversity of geographic regions. For example, we previously showed that different genes are mutated in cholangiocarcinomas from different geographical regions and with different etiologies [10, 11]. In addition, it is crucially important to have detailed clinical information associated with somatic mutation catalogs. It is possible that the mutagenic exposures responsible for some signatures in previous studies  could not be identified because the relevant clinical information was not available. For example, exposures to compounds such as aflatoxins would probably not be captured in clinical records. It is also possible that the mutation signatures of some exposures were not detected because the trinucleotide context and other characteristics of the mutations have not been determined from biochemical studies.
The examples of signatures described above focus on single-nucleotide mutations within trinucleotide contexts as the main distinguishing features of signatures. However, other characteristics of mutation catalogs can also be included as features of mutation signatures and analyzed by NMF [8, 76]. For example, strand bias could be included by considering the two strands separately for each class of mutation in transcribed regions; in this case one would consider C > T on the transcribed strand to be distinct from G > A (the complementary mutation). Other types of mutations, including small insertions and deletions and dinucleotide mutations such as those that occur as a result of UV exposure (CC > TT mutations), can also be included as features of mutation signatures. The framework can also be expanded to consider more bases adjacent to the mutated nucleotide - for example, a pentanucleotide rather than a trinucleotide context. The framework can also be applied to specific regions of the genome. For example, the APOBEC signature (Figure 4a) shows strand bias in exons, but not in introns . Given that both exons and introns are transcribed, the exonic strand bias does not seem to be the result of transcription-coupled repair, and the underlying mechanism remains unknown. However, by distinguishing mutations according to whether they occur in exons or introns, this information could be used to generate a more informative mutation signature. The utility of these possible extensions remains untested, but is likely to increase as additional tumor genomes, which capture about 50 times more mutation information than exomes, are sequenced.
Much of cancer is associated with exogenous exposures, and therefore in principle amenable to control by avoidance of those exposures. Examples include tobacco smoke, UV light, and many infectious exposures, such as hepatitis B and C, human papilloma virus, and Helicobacter pylori[77–79]. IARC lists 422 known or likely exogenous carcinogens . Indeed, prevention by avoidance of exogenous carcinogenic exposures has been an effective long-term strategy for the control of cancer, with tobacco smoking as the most salient example [49, 81]. However, evidence from recent work  indicates that many exogenous exposures remain unidentified. Notably, as described earlier, of the 21 mutation signatures identified in , 10 lacked any known underlying mutational process or exposure, and over two-thirds of cancer types were affected by signatures due to unknown causes. Furthermore, only three exogenous mutagens were identified: tobacco smoking (12% of all tumors), UV light (5% of all tumors), and temozolomide (0.5% of all tumors), and the cause of Signature 5 (found in 14% of all tumors) is unknown. Some cancers were disproportionately affected by signatures with unknown causes. For example, 89% of HCCs showed Signature 12, and 90% showed Signature 16, both with unknown causes. Conversely, the signatures of some well-known mutagens were not detected (Table 1), suggesting that cancers due to these mutagens were rare or non-existent among the 7,042 tumors studied. This implies that the signatures of many exposures have yet to be captured in sequenced tumor exomes or genomes. Thus, the analysis of mutation signatures in catalogs of somatic mutations from tumors is promising but in its infancy. To realize this promise, we must extend our knowledge in two aspects.
The first is to expand the diversities of tumor types and of their geographical origins. There is already rapid growth in the number of sequenced cancer genomes and their catalogs of somatic mutations. An important advantage of next-generation sequencing in this endeavor is that it is based on an inexpensive, commodity technology, the price of which will continue to drop. In addition, next-generation sequencing provides direct readouts of the mutations that actually occur in tumors. In this context, we note that using whole-exome or whole-genome sequencing to detect mutations (rather than sequencing targeted, cancer-related genes) ensures that most mutations detected are selectively inconsequential passengers. Even though a few somatic mutations in whole-exome or whole-genome sequence are drivers, they are so few that they have negligible influence on the signature. Finally, the large amount of data generated by whole-exome and especially whole-genome sequencing provides optimal statistical power to tease apart the signatures of different mutational processes or exposures.
The second aspect in which we must extend our knowledge consists of establishing connections between specific mutagens and their mutation signatures. This is likely to require experimental exposure of cells or animals to mutagens or their biochemically active metabolites, followed by next-generation sequencing of either clonal populations of exposed cells or of tumors that develop in exposed animals. Sequencing of the exposed genomes will connect specific mutagens to their mutation signatures in far more detail than is currently available. When mutation signatures cannot be found among the signatures of known mutagens, this would suggest the effects of an unknown exposure or mutational process, and point to the need for further epidemiological, toxicological, or biological research.
To our knowledge, there has been little work toward this goal, and our work on the mutation signature of AA and its application to detect AA exposure in HCC is an example .
We envision that the groundbreaking technical advances for detection of signatures in genome- and exome-wide catalogs of somatic mutations from thousands of tumors will enable the assembly of a wide-ranging compendium of mutation signatures from diverse cancer types and multiple geographical regions. This compendium would contain many more whole-genome catalogs of somatic mutations (as opposed to exome catalogs) than are currently available, and would encompass tumors from many more geographical regions, thus capturing a much wider range of mutagenic exposures. This compendium could be combined with experimental determination of the extended signatures of known and suspected mutagens, including, when necessary, their signatures in different tissues or cell types. Signatures with known causes would represent future opportunities for prevention. Signatures with unknown causes would point to the need for further investigation of exogenous mutagens or endogenous mutation processes.
The first part of this vision, the assembly of a compendium of mutation signatures from ever more cancer genomes, seems certain to happen because of the plummeting cost of sequencing and the many ongoing efforts to sequence tumor genomes. Nevertheless, there are many open questions on how best to deploy NMF or NMF-related procedures to assemble this compendium. For example, what factors determine the power of these procedures to distinguish similar mutation signatures? As the number of genome-wide somatic mutation catalogs increases, will it become worthwhile to include additional information, such as strand bias or pentanucleotide context, in mutation signatures? Fortunately, NMF-related procedures are an active area of machine learning research. For example, enhanced NMF procedures that prefer sparser solutions - solutions in which the mutation catalog of a given tumor is modeled as the mixture of a relatively small number of signatures - have been recently proposed [82–85]. Other proposed enhanced NMF procedures could favor solutions with fewer mutation signatures contributing to each tumor, leading to more interpretable results [85–87].
The second part of the vision, the experimental elucidation of signatures and the investigation of possible causes of signatures with unknown causes, will require concerted effort. There will surely be challenges in understanding the signatures of complex mutagens such as tobacco smoke, and challenges in understanding the differences in the mutagens' metabolisms and mutagenic activity across different tissues and cell types. Nevertheless, in the near term it will be possible to dissect and refine the worldwide repertoire of signatures and to assign some of these signatures to known causes as experimental studies advance. Of course, not all cancer is due to mutagenic exposures, but linking somatic mutation catalogs generated by next-generation sequencing to specific exposures via the mutation signatures of these exposures could substantially reduce the burden of avoidable cancer.
International Agency for Research on Cancer
non-negative matrix factorization, UTUC, upper urinary-tract urothelial cancer
We thank Ioana Cutcutache, Weng Khong Lim, and Iain Beehuat Tan for comments on the manuscript.
This article is published under license to BioMed Central Ltd. The licensee has exclusive rights to distribute this article, in any medium, for 12 months following its publication. After this time, the article is available under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.