Identification of somatic mutations in EGFR/KRAS/ALK-negative lung adenocarcinoma in never-smokers

Background Lung adenocarcinoma is a highly heterogeneous disease with various etiologies, prognoses, and responses to therapy. Although genome-scale characterization of lung adenocarcinoma has been performed, a comprehensive somatic mutation analysis of EGFR/KRAS/ALK-negative lung adenocarcinoma in never-smokers has not been conducted. Methods We analyzed whole exome sequencing data from 16 EGFR/KRAS/ALK-negative lung adenocarcinomas and additional 54 tumors in two expansion cohort sets. Candidate loci were validated by target capture and Sanger sequencing. Gene set analysis was performed using Ingenuity Pathway Analysis. Results We identified 27 genes potentially implicated in the pathogenesis of lung adenocarcinoma. These included targetable genes involved in PI3K/mTOR signaling (TSC1, PIK3CA, AKT2) and receptor tyrosine kinase signaling (ERBB4) and genes not previously highlighted in lung adenocarcinomas, such as SETD2 and PBRM1 (chromatin remodeling), CHEK2 and CDC27 (cell cycle), CUL3 and SOD2 (oxidative stress), and CSMD3 and TFG (immune response). In the expansion cohort (N = 70), TP53 was the most frequently altered gene (11%), followed by SETD2 (6%), CSMD3 (6%), ERBB2 (6%), and CDH10 (4%). In pathway analysis, the majority of altered genes were involved in cell cycle/DNA repair (P <0.001) and cAMP-dependent protein kinase signaling (P <0.001). Conclusions The genomic makeup of EGFR/KRAS/ALK-negative lung adenocarcinomas in never-smokers is remarkably diverse. Genes involved in cell cycle regulation/DNA repair are implicated in tumorigenesis and represent potential therapeutic targets.


Background
Lung cancer is the leading cause of cancer deaths worldwide [1]. In 2008, 1.38 million deaths were attributed to lung cancer, accounting for approximately 20% of cancer-related deaths. Lung cancer is a highly heterogeneous disease with regard to its etiology, prognosis, and response to therapy, complicating both prevention and treatment [1]. Non-small cell lung cancer accounts for approximately 85% of newly diagnosed lung cancers and can be classified into two major histologic subtypes: adenocarcinoma (approximately 50% of cases) and squamous cell carcinoma (approximately 30%). Although the majority of lung cancer cases are attributed to tobacco smoke, approximately 25% of lung cancer cases worldwide occur in never-smokers. Lung cancers in neversmokers have distinct clinicopathologic characteristics and clinical outcomes [2,3].
The discovery of driver mutations, such as epidermal growth factor receptor (EGFR) and anaplastic lymphoma kinase (ALK), has led to remarkable improvements in personalized therapies for lung adenocarcinoma [4]. For example, erlotinib and gefitinib has been particularly efficacious in patients with EGFR mutations, and crizotinib in patients with ALK fusions [5,6]. Actionable genetic alterations that are treatable with therapeutic agents have been identified in approximately 50% of lung adenocarcinomas and include mutations in EGFR, ERBB2, KRAS, ALK, BRAF, PIK3CA, AKT1, ROS1, NRAS, and MAP2K1 [4]. Therefore, identification of novel druggable targets in the remaining 50% of lung adenocarcinomas is a top research priority.
In this study, we analyzed exome sequencing data from 16 EGFR/KRAS/ALK-negative tumors and paired normal samples and an applicable expansion cohort of 54 EGFR/KRAS-negative lung adenocarcinomas to identify novel somatic mutations in lung adenocarcinomas of never-smokers.

Preparation of clinical samples
Tumor and adjacent normal lung fresh tissues were obtained by surgical procedures. Clinical information including age, sex, smoking history, tumor histology, and tumor stage based on the seventh edition of the American Joint Committee on Cancer staging system was collected. Never-smokers were defined as patients who had a lifelong history of smoking fewer than 100 cigarettes. All patients provided informed consent. The study was approved by the Institutional Review Board of Severance Hospital (4-2011-0891) and conducted in accordance with the Helsinki Declaration [15].

Whole exome sequencing
Extracted DNA was sheared and a genomic library prepared using the NEBNext kit (New England BioLabs, Inc., Ipswich, MA, USA) according to the manufacturer's instructions. Exon sequences were captured using the TruSeq Exome Enrichment Kit (Illumina, San Diego, CA, USA). Whole exome sequencing was performed using the Illumina HiSeq2000 platform. Sequencing data are accessible at Sequence Read Archive ( [16] accession number [SRA: SRP022932]).
Mutated loci were annotated using ANNOVAR and Polyphen2. Only non-synonymous single nucleotide variants and indels in coding exons and splicing sites were included. Known single nucleotide polymorphisms with minor allelic frequency >5% in the 1000 Genome Project Phase I East Asian (2012 April) and NHLBI Exome Sequencing Project 6500 (2012 Oct) were annotated and removed by ANNOVAR.

Validation by molecular-inversion probe capture
The molecular-inversion probe (MIP) capture method was used to validate 1,401 candidate loci identified in whole exome sequencing. We designed 3,726 probes to capture the candidate loci (Additional file 2: Table S7). The microarray-based MIP preparation and capture experiment followed MIP standard-operating procedures with modifications in the preparation of probes (manuscript in preparation) [17].
For MIP probe hybridization, 1 μg of genomic DNA, 1.5 μl of Ampligase buffer (Epicentre, Madison, WI, USA), 1 μl of probe mixture (genomic DNA to probe ratio, 1:90), and distilled H 2 O were combined to give a total volume of 15 μl. The reaction was carried out for 5 minutes at 95°C and then the temperature was decreased to 60°C at a rate of 0.1°C per second followed by incubation for 24 hours at 60°C. After addition of 0.2 μl of Phusion polymerase (New England BioLabs Inc.), 1 μl of deoxyribonucleotide triphosphate (New England Bio-Labs, Inc.), 0.2 μl of Ampligase buffer (Epicentre), 4 units of Ampligase DNA ligase (Epicentre), and 0.3 μl of distilled H 2 O, the mixture was incubated for an additional 24 hours.
PCR products were purified with a Qiagen gel extraction kit, mixed equally based on concentrations determined using a Qubit 2.0 fluorometer (Invitrogen, Carlsbad, CA, USA), and sent for Illumina HiSeq2000 sequencing.

Molecular-inversion probe capture data analysis
The raw data were aligned to the human reference genome (hg19) by Novoalign. Aligned data on the position of candidate loci were selected and transformed to pileup format by SAMtools. The candidate loci were defined as validated loci if the variant base was the same as that of whole exome sequencing and the following criteria were satisfied: variant allele frequency in tumor ≥5%, reads supporting variant allele in normal ≤2.

Validation by Sanger sequencing
Candidate driver mutations and randomly selected validated loci were chosen for additional validation and appropriate primer pairs for Sanger sequencing were designed (Additional file 2: Table S8). PCR products were purified and sent for Sanger sequencing (Macrogen, Seoul, Korea). Sequencing data were analyzed with SeqMan (DNASTAR, Madison, WI, USA).

Canonical pathway analysis
The pathway analysis was performed through the use of Ingenuity Pathway Analysis (Ingenuity® Systems [18]). Pathways associated with a set of focus genes were identified from the Ingenuity Pathways Analysis library of canonical pathways. The P-value was measured to decide the likelihood that the association between focus genes and a given pathway was due to random chance. The more focus genes involved, the more likely the association is not due to random chance, and thus the more significant the P-value. A right-tailed Fisher's exact test was used to calculate a P-value determining the probability of an association between the focus genes and the canonical pathway.

Statistical analyses
Continuous clinical data (for example, age) were compared using independent Student's t-tests. Categorical data (for example, sex, ethnicity, stage, mutation frequency) were compared using a chi-squared test. The falsediscovery rate was corrected for multiple comparisons using the method of Benjamini and Hochberg. All statistical tests were two-tailed, and a P-value ≤0.05 was considered statistically significant. Data analyses were performed using R statistical software version 2.15.3 [19].

Exome sequencing of EGFR/KRAS/ALK-negative tumors in never-smokers
We screened 230 surgically resected lung adenocarcinoma samples to identify EGFR/KRAS/ALK-negative tumors in never-smokers. A total of 16 tumors (7% of all non-small cell lung cancers) were eligible for exome sequencing. Tumor and normal samples had an average sequencing depth of 51.9× and 52.0× respectively with average coverage of 94.6% for each (Additional file 1: Table S1). Somatic variants were validated by targetcapture sequencing (154-fold depth; average coverage of 94%) and Sanger sequencing, which had a concordance rate of 94% with exome sequencing (Additional file 1: Tables S2 and S3). We detected a median number of 10 non-synonymous mutations and indels per tumor (range 3 to 27; Additional file 1: Table S4). The median rate of non-synonymous mutation was 0.32 mutations per megabase, which was comparable to that in previous reports for never-smokers [11,12]. The average ratio of transitions to transversions was 1.95; G:C → A:T transitions (37%) were the most frequent followed by A:T → G:C transitions (21%), consistent with a previous lung cancer exome study [11]. Validated loci were further analyzed for functional prediction of amino acid changes using four different prediction algorithms (SIFT, Poly-phen2, LRT, and Mutation Taster) (Additional file 1).

Investigation of functional domains in altered genes
Of 32 loci shown in Figure

Somatic mutations in an expansion cohort of EGFR/KRAS-negative tumors in never-smokers
To validate and expand our mutation analysis in lung adenocarcinoma in never-smokers, we collected an expansion dataset from five available lung adenocarcinoma studies [9][10][11][12][13] and a The Cancer Genome Atlas lung carcinoma study [21], with no overlap with the study of Imielinski et al. [11]. Clinical information of all patients including sex, age, tumor stage, and ethnicity is given in Table 1. A total of 54 EGFR/KRAS-negative tumors  from never-smokers were analyzed. Information on non-synonymous and splicing site mutations were extracted from a pooled dataset. The median rate of nonsynonymous mutations in EGFR/KRAS-negative neversmokers was approximately 0.65 mutations per megabase and the median number of non-synonymous mutations per patient was 19.0. The average ratio of transitions to transversions was 1.07 and G:C → A:T transitions (40%) were the most frequent, consistent with our data. Comparison of altered genes among the three cohorts is shown in Figure 2. SETD2 and CSMD3 were altered in all three cohorts (Figure 2A). Commonly altered genes with information on affected loci, amino acid changes, and functional predictions are summarized in Table 2 (full information is provided in Additional file 1: Table S6). The most frequently mutated gene was TP53, which was altered in 11% of tumors, followed by SETD2 (6%, 4 of 70 cases), CSMD3 (6%, 4 of 70 cases), and ERBB2 (6%, 4 of 70 cases). PTPRC, SYNE2, GRIN2A, CDH10, and SMAD4 were each altered in 3 of 70 cases (4%). SETD2 interacts with p53 and regulates genes downstream of p53 in addition to increasing p53 stability [22]. Mutations in SETD2 were nonsense mutations in three cases and missense mutation in one case. The missense mutation V1576F is located in the SET domain; one nonsense mutation, R839*, is a truncating mutation upstream of the SET domain, and two nonsense mutations, Q1981* and K2067*, are truncating mutation upstream of the WW domain. In addition to known cancer driver genes such as ERBB2 (6% of cases), NRAS (3%), MET (3%), PIK3CA (1%), AKT2 (1%), TSC1 (1%), and ERBB4 (1%), several putative cancer genes were identified, such as PTPRC [23], SYNE2 [24], GRIN2A [25], and CDH10 [26]. The mutation pattern is summarized in Figure 2B.
Pathway analysis of 1,760 genes that were altered in 70 EGFR/KRAS-negative tumors of never-smokers revealed alterations in genes related to DNA repair and the cell cycle, including components of p53/ATM signaling, G1/S or G2/M checkpoint regulation, and non-homologous end joining ( Figure 2C). The most significantly enriched pathway was cAMP-dependent protein kinase A signaling, which can activate the mitogen-activated protein kinase cascade in lung adenocarcinoma [27]. Other enriched functions of altered genes were calcium transport (P = 0.006), axonal guidance (P = 0.015), and Ephrin A signaling (P = 0.031).

Discussion
The somatic mutation profile in lung adenocarcinomas lacking targetable EGFR or KRAS mutations or ALK rearrangements in never-smokers is highly complex. Our exome analysis of 70 tumors identified several common mutations involving the known cancer genes TP53, NRAS, ERBB2, PIK3CA, and CTNNB1, but also mutations in SETD2, CSMD3, PTPRC, and SYNE2 ( Figure 3). SETD2 (mutated in 6% of cases) is a histone methyltransferase that is involved in transcriptional elongation and chromatin remodeling. Interaction with p53 is facilitated by the SET and WW domains and might increase p53 stability [22]. Interestingly, SETD2 and TP53 mutations were mutually exclusive in lung adenocarcinoma of never-smokers ( Figure 2B). CSMD3 (mutated in 6% of cases) is a transmembrane protein with CUB and sushi multiple domains that is thought to function in proteinprotein interactions and the immune response. Recent studies showed that loss of CSMD3 increases proliferation of airway epithelial cells [9] and may be involved in tumorigenesis in lung cancer. In our study, missense mutations (P667S, M1440I, K1928N, and Y2028C) in CSMD3 were predicted to be deleterious to protein function. PTPRC (mutated in 4% of cases) is a member of the protein tyrosine phosphatase family and regulates a variety of cellular processes including cell growth, differentiation, and tumorigenesis. PTPRC regulates the JAK/STAT signaling pathway and functional defects can activate JAK/STAT signaling [23]. We observed three missense mutations (Y444N, T453M, T1176M) in PTPRC, all of which were predicted to be deleterious. SYNE2 plays a role in cadherin-mediated cell-cell adhesion and regulates the Wnt signaling pathway [24].
More than 200 putative cancer-causing genes have been identified in recent genomic landscape studies using nextgeneration sequencing technology, and several cellular processes not previously implicated in cancer have been revealed, such as chromatin remodeling, splicing, and ubiquitination [31,32]. We identified alterations in genes involved in chromatin remodeling (PBRM1, SETD2), oxidative stress (CUL3, SOD2), immune response (CSMD3, SYK), and gamma-aminobutyric acid receptor signaling (GABRD, GABRG1) in lung adenocarcinoma. Interestingly, although somatic mutation is rare in EGFR/KRAS/ ALK-negative lung adenocarcinoma of never-smokers, the PCDHB14 (cell adhesion) Y670S mutation and YTHDF1 (RNA binding) I492V mutations were each found in two cases (12.5%). Future studies to elucidate the role of these newly implicated functions in tumorigenesis are warranted.

Conclusions
We identified novel somatic mutations in EGFR/KRAS/ ALK-negative lung adenocarcinoma in never-smokers and investigated the mutation frequency of altered genes. EGFR/KRAS/ALK-negative lung adenocarcinoma in neversmokers is highly heterogeneous at the somatic mutation level. However, most of the altered genes were involved in the cell cycle, and might represent novel therapeutic targets in lung adenocarcinoma. Future research on the functional role of chromatin remodeling, oxidative stress/ differentiation, and the immune response will enhance our understanding of the mechanisms of tumorigenesis.

Additional files
Additional file 1: Figure S1. Analysis flow chart for exome-sequencing data. Table S1. Summary of depth and coverage of whole exome sequencing. Table S2. Summary of depth and coverage in target capture sequencing for validation. Table S3. Validation results using target capture sequencing and Sanger sequencing. Table S4. Summary of validated somatic exonic mutations in EGFR/KRAS/ALK-negative lung adenocarcinomas. Table S5. Somatic mutations in EGFR/KRAS/ALK-negative lung adenocarcinoma exomes. Table S6. Mutated genes and loci information in EGFR/KRAS-negative lung adenocarcinoma.
Additional file 2: Table S7. Sequences of molecular inversion probes. Table S8. Sequences of primers used for Sanger sequencing.