Clinical and molecular characterization of virus-positive and virus-negative Merkel cell carcinoma

Merkel cell carcinoma (MCC) is a highly aggressive neuroendocrine carcinoma of the skin caused by either the integration of Merkel cell polyomavirus (MCPyV) and expression of viral T antigens or by ultraviolet-induced damage to the tumor genome from excessive sunlight exposure. An increasing number of deep sequencing studies of MCC have identified significant differences between the number and types of point mutations, copy number alterations, and structural variants between virus-positive and virus-negative tumors. However, it has been challenging to reliably distinguish between virus positive and UV damaged MCC. In this study, we assembled a cohort of 71 MCC patients and performed deep sequencing with OncoPanel, a clinically implemented, next-generation sequencing assay targeting over 400 cancer-associated genes. To improve the accuracy and sensitivity for virus detection compared to traditional PCR and IHC methods, we developed a hybrid capture baitset against the entire MCPyV genome and software to detect integration sites and structure. Sequencing from this approach revealed distinct integration junctions in the tumor genome and generated assemblies that strongly support a model of microhomology-initiated hybrid, virus-host, circular DNA intermediate that promotes focal amplification of host and viral DNA. Using the clear delineation between virus-positive and virus-negative tumors from this method, we identified recurrent somatic alterations common across MCC and alterations specific to each class of tumor, associated with differences in overall survival. Finally, comparing the molecular and clinical data from these patients revealed a surprising association of immunosuppression with virus-negative MCC and significantly shortened overall survival. These results demonstrate the value of high-confidence virus detection for identifying molecular mechanisms of UV and viral oncogenesis in MCC. Furthermore, integrating these data with clinical data revealed features that could impact patient outcome and improve our understanding of MCC risk factors.


Background
Merkel cell carcinoma (MCC) is a highly aggressive neuroendocrine carcinoma of the skin. Risk factors for developing MCC include advanced age, light skin color with excessive sunlight exposure, and a variety of immunocompromised conditions [1]. In 2008, Merkel cell polyomavirus (MCPyV) was first detected by Southern blot in some but not all MCC tumors with integration of viral DNA occurring at several different chromosomal sites. Importantly, an identical clonal integration pattern was detected in one primary tumor and corresponding metastatic lymph node [2]. This important insight implied that integration of the viral DNA was an early if not initiating event in virus-positive MCC oncogenesis. MCPyV infects most people, typically at an early age, and results in an asymptomatic and lifelong infection indicated by the persistent presence of antibodies to the viral coat protein VP1 [3,4]. Although MCPyV DNA can be readily detected on the skin, the cell types where the virus replicates in vivo have not been determined [5].
Since the original discovery of MCPyV, it has become increasingly clear that virus-positive MCC has a different etiology than virus-negative, UV-associated, MCC [1]. Viruspositive MCC expresses the viral oncogenes large T antigen (LT) and small T antigen (ST) and the tumor genome usually contains very few mutations in cellular oncogenes and tumor suppressor genes. In contrast, studies using whole exome or targeted hybrid capture sequencing have revealed that virus-negative MCC has an exceptionally high somatic mutation load predominated by UV-mediated mutations with frequent mutations in RB1, TP53, NOTCH1, and FAT1 [6,7]. Whole genome sequencing (WGS) of MCC confirmed virus-positive MCC exhibits a globally lower, non-UV-mediated, mutation burden as well as few somatic copy number amplifications, deletions, and rearrangements compared to virus-negative MCC, while providing new insights into the structure and mechanism of virus integration [8].
Accurate detection of the presence of MCPyV and distinguishing between virus-positive and virus-negative MCC is important for insight into the oncogenesis, cellof-origin, and therapeutic options. Currently, there is no routine clinical effort to distinguish between viruspositive MCC and virus-negative MCC. Several recent studies have suggested differences between virus-positive MCC and virus-negative MCC in presentation, age, and response to immunotherapy [9][10][11][12][13][14][15]. However, current techniques for determining viral status have yielded either inaccurate or ambiguous results. Although WGS provides much more genetic information on the tumor and viral genome compared to targeted approaches, it remains impractical for clinical evaluation of MCC.
The most common methods for detection of MCPyV in MCC include PCR amplification of MCPyV DNA from DNA isolated from MCC tumors or immunohistochemistry (IHC) staining for MCPyV LT using monoclonal antibodies CM2B4 and Ab3 [16,17]. However, both PCR and IHC have been shown to be unreliable in distinguishing between virus-positive from virus-negative MCC. For example, a recent study of 282 cases of MCC evaluated virus positivity by IHC with monoclonal antibodies CM2B4 and Ab3 or by PCR with a previously validated primer set [18]. Notably, there was concordance for all three assays in only 167 of 282 (59.2%) cases with an additional 62 cases positive for two of the three tests. The remaining 53 (18.8%) were positive for one test or none. This study assigned the MCC to be virus-positive if two or three tests were positive, implying that detection of viral DNA by PCR alone was not sufficient for a tumor to be called virus-positive MCC. Furthermore, because of the sensitivity of PCR in detecting DNA, a lower limit of 0.01 copy of MCPyV DNA per tumor cell was called virus-positive MCC. Tumors containing < 0.01 viral copies/cell were called virus-negative. A different study using RNA-ISH to detect mRNA specific for MCPyV LT and ST found this method to be as sensitive as qPCR when using two primer sets and the viral copy number was set to > 0.004/cell [19]. The AMERCK test detects circulating antibodies against the MCPyV ST [20]. The sensitivity of this test is low for detection of virus-positive MCC but, when positive, can be used as a biomarker for disease status [20].
The high somatic mutation burden in virus-negative MCC is predicted to yield more tumor neoantigens than melanomas or non-small cell lung cancers (median of 173, 65, and 111 neoantigens/sample, respectively) [21] [22]. As observed for other tumor types, the high neoantigen burden in virus-negative MCC corresponds to a higher degree of tumor infiltrating lymphocytes in some tumors, but these tumors also express PD-L1 rendering these lymphocytes ineffective [7]. Despite the numerous observed differences in mutation rate and number of predicted neoantigens, both virus-positive MCC and virus-negative MCC tumors have shown high response rates to PD-L1 and PD1 checkpoint blockade therapy [14,15].
For further advancements to be made in understanding MCC, especially for patients not responsive to current therapies, clear and accurate determination of the MCPyV virus status and actionable variants in these tumors are required. In this study, we developed a viral hybrid capture next-generation sequencing (NGS) method to detect the presence of integrated MCPyV DNA in FFPE clinical specimens for routine use in a clinical setting. This approach was combined with targeted sequencing of several hundred cancer-related genes to assess oncogenic changes in the tumor genome. Lastly, we compared the sensitivity and accuracy of this viral hybrid capture approach to more traditional approaches, PCR detection of viral DNA, IHC for detection of MCPyV LT, and synoptic assessment of MCC pathology.

Study design and participants
This study included all patients (n = 71) with a reported diagnosis of MCC at Dana-Farber/Brigham and Women's Cancer Center who underwent comprehensive genomic profiling by OncoPanel between May 2013 and April 2018. OncoPanel version 3 (POPv3) is a custom hybrid capture assay targeting the exons of 447 genes and 191 regions across 60 genes commonly rearranged in cancer [23,24]. A retrospective chart review collected demographic, clinical, disease, treatment, and outcome variables on all 71 patients. For 40 patients, sufficient DNA remained from the initial OncoPanel profiling or from additional FFPE tumor specimens to perform POPv3/ViroPanel. When available, FFPE sections were sectioned for immunohistochemistry with antibodies CM2B4 and Ab3 [17]. Sections stained with hematoxylin and eosin were evaluated by synoptic review [25].

Nucleic acid isolation, library preparation and sequencing
To perform ViroPanel with and without supplementation with the OncoPanel (v3) bait set, purified DNA was quantified using a Quant-iT PicoGreen dsDNA assay (Thermo Fisher). Library construction was performed using 200 ng of DNA, which was first fragmented to~250 bp using a Covaris LE220 Focused ultrasonicator (Covaris, Woburn, MA) followed by size-selected cleanup using Agencourt AMPureXP beads (Beckman Coulter, Inc. Indianapolis, IN) at a 1:1 bead to sample ratio. Fragmented DNA was converted to Illumina libraries using a KAPA HTP library kit using the manufacturer's recommendations (Thermo Fisher). Adapter ligation was done using xGen dual index UMI adapters (IDT, Coralville, IA).
Samples were pooled in equal volume and run on an Illumina MiSeq nano flow cell to quantitate the amount of library based on the number of reads per barcode. All samples yielded sufficient library (> 250 ng) and were taken forward into hybrid capture. Libraries were pooled at equal mass (3 × 17-plex and 1 × 18-plex) to a total of 750 ng. Captures were done using the SureSelect XT Fast target enrichment assay (Agilent, Technologies, Santa Clara, CA) with ViroPanel with and without supplementation with the OncoPanel (v3) bait set. Captures were sequenced on an Illumina 2500 in rapid run mode (Illumina Inc., San Diego, CA).

Sequence alignment and somatic variant calling
Pooled samples were de-multiplexed and sorted using Illumina's bcl2fastq software (v2.17). Reads were aligned to the reference sequence b37 edition from the Human Genome Reference Consortium as well as viral genomes targeted by the Virus Capture Baitset v2 using bwa mem (http://bio-bwa.sourceforge.net/bwa.shtml) [26]. The viral genomes and human genome were combined into one alignment reference so reads could map to the closest matching reference sequence.
Duplicate reads were identified using unique molecular indices (UMIs) and marked using the Picard tools. The alignments were further refined using the Genome Analysis Toolkit (GATK) for localized realignment around indel sites and base quality score recalibration [27,28].
Mutation analysis for single nucleotide variants (SNV) was performed using MuTect v1.1.4 (CEPH control was used as the "project normal") and annotated by Variant Effect Predictor v 79 (VEP) [29,30]. We used the Somati-cIndelDetector tool that is part of the GATK for indel calling. After initial identification of SNVs and indels by MuTect and GATK respectively, the variants were annotated using OncoAnnotate to determine what genes were impacted and their effect on the amino acid sequence. OncoAnnotate also applied additional filters using the Exome Sequencing Project (ESP) and gnomAD datasets to flag common SNPs.
Variants that affect protein coding regions underwent further filtering/classification based on frequency in the gnomAD, ESP, and COSMIC (version 80) databases. If the frequency of the variant was less than or equal to 1% in all gnomAD and ESP populations, the variant was flagged as "REVIEW_REQUIRED". If the frequency of the variant was greater than 1% and less than or equal to 10% in all gnomAD and ESP populations and present in "COSMIC" database at least two times, the variant was flagged as "RE-VIEW_REQUIRED". If the frequency of the variant was between 1% and less than or equal to 10% in all gnomAD and ESP populations and not present in "COSMIC" database at least two times, the variant is flagged as "NO_RE-VIEW_GERMLINE_FILTER". If the frequency of the variant was greater than 10% in any gnomAD and ESP populations, the variant was flagged as "NO_REVIEW_ GERMLINE_FILTER". Variants with a frequency greater than 10% in any gnomAD or ESP population were considered to be a common SNP irrespective of presence in the COSMIC database.
Variants in the viral genomes were called using samtools mpileup and bcftools from the aligned bam files. Called variants were filtered to have a minimum coverage of 5 reads and minimum allele frequency of 1% of total reads covering that base in a single sample. Variants were annotated based on the NC_010277.2 reference sequence in GenBank using SnpEff [31].

Recurrent copy number analysis
Copy number variant calling was performed using a combination of VisCap Cancer and CNVkit as previously described [32,33]. All resulting gene copy number variants from all patients were compared against each other with UV status and significant mutual exclusivity/co-occurrence was calculated using Fisher's exact test corrected by FDR for multiple comparisons in the R statistical environment. Using the network and iGraphs packages the significantly cooccurrent variants were clustered into networks. The genes belonging to each distinct network cluster with more than five member genes were then labeled and extracted. Using these gene lists as cluster definitions, each patient was evaluated for presence or absence of each CNV cluster. Presence of a CNV cluster was determined if more than 50% of the member genes of that cluster were modified in the same patient. Copy number variants from TCGA were retrieved from cBioPortal (https://www.cbioportal.org/) and plotted using ggplot2 in the R statistical environment [34,35].

Viral integration analysis
A custom perl script was written to extract, assemble, annotate, and visualize viral reads and determine viral integration sites. Viral reads and their mates were first identified and extracted by those that have at least one mate map to the viral genome. Additional reads containing viral sequence were identified by a bloom filter constructed of unique, overlapping 31 bp k-mers of the MCPyV genome [36]. The human genome positions for any read with a mate mapping to the viral genome were output into a bed file and the orientation of viral and human pairs was stored to accurately deconvolute overlapping integration sites. This bed file was then merged down into overlapping ranges based on orientation counting the number of reads overlapping that range. Skewdness in coverage of integration junctions was calculated by the difference in the fraction of virus-host read pairs overlapping the first and second halves of the aforementioned ranges. This skewdness value was used to determine the orientation of the viral-host junction (i.e., positive values, junction is on the 3′ end of the range; negative values, junction is on the 5′ end of the range), which was validated from the results of de novo assembly. Integrated viral genomes were assembled from extracted reads using SPAdes with default parameters [37]. The assembly graphs from SPAdes were annotated using blastn against hg19 and the MCPyV reference genome with an e-value cutoff of 1 × 10 − 10 . Annotated assembly graphs were visualized using the ggraph R package.
Integrations sites confirmed by reference guided alignment and assembly data were analyzed for stretches of microhomology between the human and viral genomes by selecting 10 bp upstream and downstream of the integration junction on the viral and human genomes. Within these sequences stretches of identical sequence at the same position longer than two base pairs were counted. Overall homology between the sequences was calculated by Levenshtein distance. Three integration junctions with indeterminate DNA sequence ranging from 1 to 25 bp inserted between viral and human DNA were excluded from analysis. Expected microhomology was calculated by randomly selecting 1000 20 bp pairs of non-N containing sequence from the human and MCPyV genomes.
Integration site proximity to repeat elements were determined using bedtools closest and repeatmasker annotations acquired from the UCSC genome browser [38]. Expected frequency of integration near repeat elements was determined by randomly selecting 1000 sites in the human genome. Sites within 2 kb of a repeat element were counted as close proximity.
Functional annotation of somatic mutations and viral integration events was performed using PANTHER (www.pantherdb.org) [39].

Statistics
The association between relapse and genomic characteristics are tested with Fisher's exact test using all patient sequencing data regardless of primary or recurrence biopsy. Overall survival (OS) is defined as the time from initial diagnosis to death, and patients who did not die are censored at the last follow-up date. The 95% confidence intervals of the median OS times are estimated using log(−log(OS)) methodology. Statistical significance is defined as p ≤ 0.05.
Associations between recurrent CNV, TMB, or viral copies and overall survival were calculated and graphed using GraphPad Prism 7. Fisher's exact test and Kaplan-Meier curves were computed with the R statistical environment. Significant enrichment of microhomology and repeat elements at integrations sites was determined using Fisher's exact test between observed and expected events.

Human subjects
This study was conducted according to the Declaration of Helsinki principles and approved by the Dana-Farber Cancer Institute institutional review board. Written informed consent was received from participants prior to inclusion in the study.

Summary of patient cohort
A total of 71 patients diagnosed with MCC were included in this study ( Table 1). The median (95% CI) follow-up duration from initial diagnosis of MCC was 47 (95% CI: 38-60) months based on inverse Kaplan-Meier estimation. Overall, 69 enrolled patients were white and two were black. Forty (56%) patients were male. The median age was 70 years (range < 50 to 93). The initial site of MCC presentation was in the head and neck (27%), upper extremity (20%), lower extremity (21%), and trunk (32%). The seventh edition TNM staging system of the American Joint Committee on Cancer (AJCC) was used to classify the initial presentation of MCC with 27% presenting at stage I, 14% stage II, 42% stage III, and 17% stage IV.
Copy number variants (CNVs) were examined individually as well as against each other and other likely functional somatic changes for significant co-occurrence or mutual exclusivity (Additional file 1: Table S2). Clusters of significantly co-occurrent CNVs were determined via network analysis (Fig. 2a, Additional file 2: Fig. S2 & Fig. S3). From these analyses, two distinct CNV clusters were each found to be altered in more than 36% of cases ( Fig. 2b, c). Chromosome 10 (cluster 14) had frequent copy number loss with 26 tumors showing heterozygous or homozygous loss of the chromosome (Fig. 2b) [41]. Some cancer-relevant genes on chromosome 10 include PTEN and SUFU, negative regulators of PI3K and Hedgehog signaling respectively, with deletions reported in prior studies of MCC [41,42]. A region of Chr1q (cluster 13) was amplified in 28 cases. This region includes MDM4 (also known as MDMX), whose protein product cooperates with MDM2 to promote the ubiquitination and subsequent degradation of p53 ( Fig. 2b) [43,44]. In addition, we observed a focal amplification of MYCL within a greater amplification of Chr1p (cluster 4), which was reported in an earlier study of MCC [45].
CNV clusters 13 and 14 were observed at nearly equal frequencies in both TMB-high and TMB-low cases (Fig. 2b, c). Six other CNV clusters were strongly associated with UV signature and high TMB (Fig. 2c). Functional annotation of the clusters revealed that the two largest UV-associated CNV clusters (1 and 3) had significant enrichment for genes related broadly to DNA damage response and S-phase DNA damage checkpoint likely enhancing tolerance for UV mutagenesis.
Cluster 5, corresponding to 6p22.3 to 6q26 and likely representing a gain of the entire chromosome 6, was the only cluster more than twice as frequent in TMB-low tumors than TMB-high tumors (Fig. 2c). Interestingly, 33.3% (6/18) of metastatic tumors carried cluster 5 and all but one of these metastatic tumors were TMB-low MCC. Furthermore, CNV cluster 5 was 2.5 times more frequent in TMB-low (25%, 11/44) than TMB-high (11%, 2/18) tumors in primary tumors. Both TMB-low and TMB-high patients with amplification of CNV cluster 5 had significantly improved overall survival compared to wild type carrying patients (p = 0.005). Restricting this analysis to only primary tumors, revealed that there were no deaths at the time of this study in patients carrying this amplification (p = 0.007)  Fig. S4).
The recurrent copy number events on chromosomes 1, 6, and 10 were compared within The Cancer Genome Atlas (TCGA) for similarities to other tumor types ( Fig. 2f-h). This analysis revealed that the chromosome 1 (cluster 13) amplification was also frequently observed in ovarian, breast, and bladder cancers, whereas the chromosome 10 (cluster 14) loss was most frequently seen in prostate cancer. Gain of chromosome 6 (cluster 5) was most frequently seen in ovarian, bladder, and esophago-gastric cancers.

Analysis of viral sequences in tumors
Of the 71 tumors analyzed by OncoPanel, 48 with sufficient remaining material were re-analyzed by OncoPanel (Profile/OncoPanel version 3, POPv3) combined with a hybrid-capture probe bait set targeting the entire genome of MCPyV and other known oncogenic viruses (ViroPanel). For the 48 cases, the number of MCPyV reads ranged from 0 to 21,095,751 with only a single case having zero MCPyV reads (Fig. 3a). In total, 28 cases had substantial reads (> 6800) mapping to the MCPyV genome that also supported integration of the virus into the host genome through reads and read pairs that span integration junctions. For the remaining 20 cases without evidence of integration, the number of viral reads ranged from 0 to 971. Generally, these cases had reads that covered less than 10% of the viral genome with the normalized coverage less than two logs compared to samples with evidence for virus integration (Fig. 3b, c). Concordantly, the viral reads from most of these cases were unable to be assembled into larger viral contigs. Two cases, MCC011 and MCC015, had 212 and 177 MCPyV reads that could be assembled into nine and five contigs each smaller than 761 base pairs, respectively. Case MCC007 had the most reads of any likely virus-negative sample and could be assembled into a single 5343 bp contig. However, analysis of the point and deletion variants in these aforementioned viral contigs revealed that they were identical to the virus sequence from patient MCC037 indicating that the viral reads resulted from low-level contamination (< 0.005% of MCC037 MCPyV reads were detected in other samples).
For the 28 cases with evidence for integration of the viral DNA into the tumor, the number of reads mapping to the viral genome ranged from 6824 to 21,095,751 (median 28,726). Consistent with previous reports, the integrated viral genome had undergone extensive mutagenesis a Representative network analysis clusters of significantly co-modified genes in MCC on chromosomes 1 (red), 6 (yellow), and 10 (blue). b Frequency of amplifications (red) and deletions (blue) for the genes comprising representative CNV clusters and their occurrence in each patient with UV, RB1, and TP53 status clustered by all variants. c Counts of each CNV cluster colored by TMB-low (blue), TMBintermediate (gray), and TMB-high (red) categories. Clusters that are nearly equivalent between TMB-low and TMB-high (< 2:1 ratio are highlighted by open triangles). The cluster that is more frequent in TMB-low than TMB-high is highlighted by a black-filled triangle. d Kaplan-Meier plot of overall survival stratified by chromosome 6 amplification for all patients. e Kaplan-Meier plot of overall survival stratified by chromosome 6 amplification for primary tumors. f-h Analysis of TCGA cancers for the two most abundant CNV clusters (13, 14, and 6, respectively) in MCC with large deletions (> 100 bp) particularly in the 3′ half of LT as well as in the viral coat protein genes VP1 and VP2 (Fig. 4). In 10 cases, approximately half of the total viral genome was deleted, 6 cases had approximately 25% of the viral genome deleted, while 12 cases had sequences corresponding to the entire or nearly complete genome (Figs. 3c and 4). In all but one of the cases with a nearly complete coverage of the viral genome, there was a clonal point mutation which inserted a premature stop codon in LT resulting in truncated proteins between 208 and 771 amino acids (Fig. 5a) similar to what has previously been seen in MCC cell lines and clinical cohorts. In a single case (MCC054), LT was truncated by a 5-bp deletion resulting in a frame shift that introduced a premature stop codon in frame. In all cases, the non-coding control region, the N-terminal 208 residues of LT, and an intact ST region of the viral genome were conserved.
Beyond indels and nonsense mutations, LT also carried numerous novel clonal missense mutations (Fig. 5a) unique to the patients in this cohort. In stark contrast, ST only had missense mutations at three residues, and the amino acid change A20S is consistent with a previously observed MCPyV strain difference (GenBank identical protein accession number: ACI25295.1). The other missense mutations occurred clonally at H41Y and N100S once in the entire cohort (Fig. 5b). Neither of these mutations are present in any of the ST sequences in GenBank and have not been previously reported.
The integration sites were mapped using the oncovirus tools suite (https://github.com/gstarrett/oncovirus_tools) (Fig. 6a, Table 2) [46]. As previously reported, integrations primarily fell into two categories: either those that appear as a single integration event or as two events separated by > 10 kilobases (kb) [8]. Interestingly, two cases had integration events in non-identical but overlapping sites in chromosome 1 (Fig. 6b). These represent the first reported cases of recurrent viral integration sites in MCC.
Based on previous MCC WGS, MCPyV integration sites frequently coincide with focal amplifications in the human genome. We can therefore infer that the regions between distant (> 10 kb) viral integration sites were amplified; however, no targeted exon were within these regions [8]. Because of the limited number of capture targets sequence by the OncoPanel platform, determining the exact boundaries of the expected virus-mediated amplifications in cases with junctions < 10 kb apart was  Table 2). When annotating these regions, we observed that they frequently contain enhancer regions that may contribute to oncogenesis as seen in HPV-associated tumors [47]. Uniquely, patient MCC026 had integrations in chromosomes 9, 16, and 18, all of which had integration sites separated by between 107.5 and 129.9 kbp appearing to be distinct events.
Using automated computational methods, we could not confidently determine an integration site for case MCC037 with the highest viral genome copy number in this study. Manually interrogating the human sequence hits from the assembly revealed that it matched a tandem repeat sequence flanked by MLT1H2 ERVL-MaLR elements. Based on the estimated copy number and the assembly graph, the viral component of this fusion DNA structure is likely larger than 10 Mbp (Additional file 3: Fig. S6).
With the high depth of coverage facilitated by the targeted NGS method, high-resolution assemblies for the integrated virus were generated. Many integrations that appeared as a single linear contig contained a single copy of the viral genome flanked by the host genome (Fig. 6c, Additional file 3: Fig. S6). However, other integrations generated more complex assembly graphs with a multiple contigs linked together in a "pigtails" conformation ( Fig. 6d, Additional file 3: Fig. S6). Based on coverage and conformation, this graph likely represents an integration event containing partially duplicated viral genome concatemers fused to different segments of the human genome. For samples with distant integration sites, the directionality of the virus-host junctions strongly supports a circular virus-host DNA fusion intermediate prior to reintegration into the host chromosome. This model is further supported by assemblies in which one arm of the fusion contains sequences from both distant sites of the human genome (Fig. 6e, Additional file 3: Fig. S6).
To address a possible mechanism for integration, we looked for microhomology between the human and MCPyV genomes at fusion junctions. We found significant enrichment for 4, 5, and 7 bp sequence microhomology at the site of integration compared to randomly selected sites in the human and MCPyV genomes  (Fig. 6f). There was no significant increase in overall homology between MCPyV and human DNA at integration sites versus randomly selected sites. Patient MCC027 had the integration site with the longest stretch of homology, and MCC041 had both the integration site with the greatest overall homology on its 3′ end and lowest homology with no microhomology greater than 1 bp on its 5′ end (Fig. 6g). Additionally, we annotated integration sites for proximity to repeat elements, including LINEs, SINEs, LTR retrotransposons, and simple repeats in the human genome. No type of repeat element was significantly enriched, but all integration sites were within 1.5 kb of a repeat element and there was a trend towards integrations near LTR retrotransposons and low-complexity regions (Fig. 6h).

Distinguishing virus-positive MCC from virus-negative MCC using somatic variants in comparison to immunohistochemistry and PCR
Given the striking differences in the number of mutations and mutational signature we observed in the Viro-Panel dataset that strongly correlated with virus integration, we compared the data from the OncoPanel and POPv3/ViroPanel datasets to determine the viral status of all 71 tumors studied (Table 3). From the OncoPanel sequencing, we identified off-target reads for MCPyV in a total of 18/71 cases, ranging from 1 to 194 reads total. When compared to the ViroPanel data, there was a rough correlation between the number of offtarget reads and the number of MCPyV reads in the Vir-oPanel dataset. There were 8 samples with MCPyV reads in the OncoPanel dataset that were not also analyzed by ViroPanel. None of these 8 cases have any evidence for a UV mutational signature.
We assessed the total number of mutations, TMB, UV signature, and detection of MCPyV reads to characterize each tumor as either virus-positive MCC or virus-negative MCC. Using these criteria, we called 25 tumors as virusnegative. All but one of the virus-negative MCC tumors had a UV mutational signature and had higher number of total mutations (18-73), higher TMB, and absence of integrated MCPyV compared to virus-positive MCC. The virus-negative MCC without a UV signature (MCC007) originally presented as a subcutaneous breast mass [48]. A total of 46 MCC tumors of the 71 analyzed were characterized as virus-positive. These virus-positive MCC had an absence of UV mutational signature, a lower number of total mutations (0-16), and lower TMB than did any of the virus-negative MCC. The TMB-low and -high categories had perfect concordance with virus-positive and virusnegative MCC determined by sequencing, respectively. The TMB-intermediate samples were mostly virusnegative (7/9), but the lowest two TMB patients in this category are likely virus-positive based on ViroPanel sequencing and absence of UV mutation signature.
FFPE sections were available for 28 of the 71 cases to assess for MCPyV LT by IHC with antibodies CM2B4 and Ab3. For 8 of the virus-negative MCC, all were negative by IHC with both antibodies. For 20 viruspositive MCC cases, we observed 16 stained positive with both antibodies and 4 were negative (Table 3). In addition, DNA was tested by PCR with 5 primer sets for 15 cases. In 9 virus-positive MCC cases, all returned positive results with 2 to 5 primer sets (Table 3). For 6 virus-negative cases, PCR was negative for 5 primer sets and one was positive with one primer set. Interestingly, the virus-negative MCC (MCC007) with one PCR primer set positive also ranked at the TMB borderline  (9.58) between virus-negative and virus-positive and did not score as having a UV mutational signature; rather, the majority of mutations were classified as APOBECassociated.
A synoptic review of dermatopathology was available for 19 cases (Additional file 1: Table S4) [25]. Criteria evaluated included procedure, site, size (mm), thickness (mm), lymphovascular invasion, tumor extension, mitotic rate, tumor infiltrating lymphocytes (TILs), growth pattern, neurotropism, and necrosis (%). TILS were largely absent in both virus-positive and virus-negative samples. An infiltrative growth pattern was observed in virus-positive MCC and nodular or nodular infiltrative observed in both forms of MCC. Neurotropism was present in three cases of virus-positive MCC and necrosis which ranged from 0 to 40%.

Statistical comparison of clinical and molecular characteristics
Overall, 28 patients remained disease free after initial therapy and 43 developed one or more relapses or persisted as stage IV (Additional file 2: Fig. S5). According to the biopsy type and first relapse status, patients could be grouped into primary biopsy with no further recurrence (N = 30), primary biopsy with further recurrence (N = 22), and recurrence biopsy (N = 19). For all biopsies annotated as a recurrence, the first recurrence occurred before the biopsy was obtained. Among the 19 recurrence biopsies, 15 were distant metastatic biopsies, one local recurrence (MCC027), one unspecified recurrence (MCC063), one second recurrence (MCC057), and one local recurrence with no prior chemo/XRT (MCC026). Regardless of the biopsy type, all patients and sequencing data were grouped into either no relapse (N = 30) or relapse (N = 41). Table 4 shows the association between relapse and genomic characteristics. Among 71 patients, 30 (42.3%) patients had no relapse and 41 (57.7%) had relapse after initial diagnosis. From Fisher's exact test results, UV, RB1 status, TP53 status, and virus status were all not significantly associated with relapse (Table 4). If the OncoPanel data obtained after relapse (and prior treatment) was excluded and restricted to the 52 patients with primary biopsy, UV, RB1 status, TP53 status, and virus status were all not significantly associated with relapse (Additional file 1: Table S5).
Consistent with known risk factors of MCC, 10 of the 71 cases had immunosuppression diagnosed prior to developing MCC. Remarkably, 8 of the 10 (80%) of the immunosuppressed cases were identified as virus-negative MCC with relatively high TMB compared to the 28% virusnegative MCC in immunocompetent patients (Fig. 7a, Table 5). Virus-negative MCC was present in three patients with solid organ transplantation; three with autoimmune diseases including myasthenia gravis, rheumatoid arthritis, and granulomatosis with polyangiitis; one with monoclonal gammopathy of undetermined significance (MGUS); and another with Waldenstrom's macroglobulinemia. In contrast, virus-positive MCC was identified in a patient with mantle zone lymphoma having been treated with Rituximab for 3 years and another with germline mutations in NF1 and GATA2 [49]. The median OS for patients with immunosuppression was 17.5 months (95% CI 5.6-24.4 months), significantly shorter than patients without immunosuppression (48.5 months, 95% CI 35.4-113.3 months, p < 0.01) (Fig. 7b, Table 5). Immunosuppressed patients also exhibited significantly shorter recurrence-free survival, 7.5 months (95% CI 3.5-20.1 months) and 20.2 months (95% CI 12.9-50.2 months, p = 0.01), respectively. We acknowledge that some cases have relatively short follow-up times that may impact survival analysis; however, the association of OS and immunosuppression only shifts slightly after keeping patients with follow-up times greater than 6 months (n = 63) remaining statistically significant. Under this criterion, the median OS for immunocompetent and immunosuppressed cases were 48.5 months (95% CI 35.4 to 113.3 months) and 21.6 months (95% CI 6.9 to 30.7 months, p value < 0.01), respectively.

Discussion
We undertook this study to develop an assay to more accurately distinguish between virus-positive and virusnegative MCC by genetic features. We built upon an  NGS platform that has been instituted as a routine part of clinical care at the Dana-Farber Cancer Institute, Brigham and Women's Hospital, and Boston Children's Hospital. The viral hybrid capture assay, ViroPanel, acquired a high number of MCPyV reads for many samples. Importantly, evidence for specific integration was associated with all cases with a high number of reads (> 6000). Spurious MCPyV reads were also detected in 19 of 20 MCC cases that were deemed to be virus-negative by TMB and UV mutations. There was no evidence for integration in these cases; rather, these reads could be traced to be extremely low-level contamination from MCC037 during library preparation or sequencing. In contrast, true virus-positive MCCs have low TMB with clear assemblies of virus-host junctions with MCChallmark deletions in the MCPyV genome.
Integration sites were observed in 12 different chromosomes with the most occurring on chromosome 5. In addition, two fully overlapping integration sites from two different tumors were observed on chromosome 1 separated by only 10-20 kb. Based on the clonality of deletions and point mutations in the MCPyV genome, these events most likely occurred before or during integration as was similarly determined from another study on MCC cell lines [50]. For both MCPyV and HPV, it has previously been proposed that integration initiates after DNA double strand breaks in the host genome and viral genomes, likely during viral genome replication as integrated viral concatemers are common [8,51]. In this study, we identified that integration is then likely mediated through erroneous DNA repair at sites of microhomology between the host and viral genomes. This is   [53]. The resulting large linear DNA then reintegrates into the chromosome and appears as amplified regions of the host genome in a tandem head-to-tail conformation interspersed with the viral genome [8].
Case MCC026 has three apparently separate integration events occurring on different chromosomes. The integration event on chromosome 16 only contains a small section of the viral genome from positions 2853-3521, which would only encode the helicase domain of LT and therefore is unlikely to contribute to tumor survival. Conversely, the event on chromosome 18 has a full copy of the viral genome while the event on chromosome 9 contains the NCCR, ST, and a truncated LT, likely sufficient to contribute to oncogenesis. Based on the assembly graphs and coverage, one or both of these integration events have more than one copy of ST and LT ( Fig. 4 and Additional file 3: Fig. S6). Distinct sequences derived during assembly and the distances between the intrachromosomal junctions (107-129kbp) indicate that these likely are separate events, but only long-read sequencing of this tumor can definitively determine that these are not part of a larger interchromosomal translocation.
The most common chromosomal copy number changes involved chromosomes 1 and 10. Amplification of 1p (cluster 4) involving MYCL was observed more commonly in virus-negative cases, but was identified in a few viruspositive cases. Interestingly, in virus-positive MCC, MCPyV ST binds MYCL and the EP400 chromatin modifying complex to activate transcription of several hundred target genes [54]. Amplification of MYCL is likely to be an oncogenic event that contributes to MCC aggressiveness. Amplification of Chr1q (cluster 13) was also observed in both virus-positive and virus-negative MCC. This region includes MDM4, whose protein product cooperates with MDM2 to promote the ubiquitination and subsequent degradation of p53 [43,44]. There may be additional prooncogenic genes in this cluster that contribute to MCC oncogenesis when p53 is mutated. Heterozygous loss of chromosome 10 (cluster 14) was observed in 26/71 MCC including both virus-positive and virus-negative tumors. Loss of chromosome 10 likely reduces PTEN levels contributing to activation of the PI3K signaling pathway. A recurrent amplification of chromosome 6 has previously been observed for MCC; however, this observation predated the discovery of MCPyV and was not associated with morphology or outcome [55]. In other cancers, such as basal cell carcinoma and ovarian cancer, this amplification is typically associated with worse outcome [55]. Although the chromosome 6 amplification in this study was significantly associated with better overall survival, it was also more frequent in metastasis. This amplification contains genes such as VEGFA, which promotes angiogenesis and has been observed to be expressed at higher levels in distant ovarian cancer metastases [56]. Interestingly, further analysis of genes that are more abundantly mutated in the sequenced metastatic lesions, revealed statistically significant (q = 2.47 × 10 − 7 ) pathway enrichment for angiogenesis as well as EGFR signaling and p53 dysregulation. FANCE is also carried on this chromosome and as a DNA damage response gene act to limit number of point mutations and copy number changes observed in virus-positive MCC, hindering tumor evolution. Together, this amplification cooccurring with MCPyV may represent a less fatal, but more metastatic subtype of MCC. Additionally, this result could be impacted by diagnosis, treatment, or study recruitment of metastatic MCC.
Unexpectedly, we observed that 8 of 10 cases with immunosuppression were virus-negative MCC. While it was recognized in the early 1990s that individuals with hematologic malignancies that developed MCC had a poor prognosis [57], it was not until 1997 when a direct link between immunosuppression and MCC was postulated [58]. At that time, a correlation was noted between medically induced immunosuppression with azathioprine and cyclosporine and the development and rapid spread of MCC. Early reports highlighted a prolonged period of immunosuppression prior to MCC development. Notably, the search for a viral pathogen in MCC was initiated because of reports linking MCC with immunosuppression and with HIV-1/AIDS [2]. A subsequent report has shown similar rates of MCPyVpositivity in immunocompetent and suppressed patients, but relied on PCR and IHC for virus detection [18].
In the present report, three solid organ transplant recipients, three with chronic autoimmune diseases, and two with hematologic malignancies developed virus-negative MCC. It is well established that the risk for developing MCC is increased in patients with chronic inflammatory disorders such as rheumatoid arthritis or medically induced immunosuppression for solid organ transplantation [58][59][60][61]. Within the latter population, skin cancers account for 40-50% of all posttransplant malignancies with squamous cell carcinoma (SCC) and basal cell carcinoma (BCC) comprising 90-95% of these skin cancers [62].
Importantly, some therapeutics used in organ transplantation are known to further increase risk for developing skin cancers. Azathioprine can sensitize cells to UVinduced damage through the incorporation of a metabolite into DNA that generates reactive oxygen species upon exposure to UV light [63]. In patients with rheumatoid arthritis, methotrexate and anti-TNF drugs were associated with an increased risk of nonmelanoma skin cancer  [64]. The increased risk for skin cancers in organ transplant recipients and rheumatoid arthritis is associated with UV-light-induced mutagenesis for SCC and BCC. Therefore, the increased risk for UV-induced skin cancers may also extend to virus-negative MCC. Although this study is one of the largest molecular studies on MCC genetics to date, the small sample size and inherent confounding factors of studying outcome in a cancer that manifests in older populations are important limitations. This also highlights the need for accurate determination of virus status and importance to continue to study this rare cancer to fully address the involvement of immune suppression on the etiology and outcome of this aggressive cancer. Despite the significant differences in the TMB between virus-positive and virus-negative MCC, there were few phenotypic differences in the two types of MCC. Based on histopathological features alone, two subtypes of MCC can be recognized: pure neuroendocrine tumors and combined tumors with neuroendocrine and divergent (mainly squamous) differentiation. Most pure tumors are MCPyV-positive and CK20-positive while combined tumors are uniformly MCPyV-negative and occasionally CK-20 negative [9,65]. Virus-negative MCC can also present as pure neuroendocrine-type MCC.
While genomic sequencing has revealed that virusnegative MCC has evidence for a high degree of UV damage, this does not exclude a role for UV exposure in the development of virus-positive MCC. The relative lack of UV damaged DNA in virus-positive MCC indicates that the etiologies are clearly different, suggesting that the precursor to virus-negative MCC was a recipient of lifelong intense UV exposure while the virus-positive MCC were not exposed to sunlight for the same degree or for as long. It was reported that the early promoter of MCPyV responds to UV exposure and that levels of ST mRNA increased in UV exposed skin from a healthy human volunteer [66]. Transient UV exposure could affect the immune response to virus-negative and viruspositive MCC etiology. The effect of UV radiation in the pathogenesis of MCC has been suggested to be more likely a result of immune modulation rather than direct effects on DNA itself [67].

Conclusions
Here we present a comprehensive characterization of the Merkel cell carcinoma genetics using a clinically implemented sequencing platform. This platform was augmented using a hybrid capture baitset against Merkel cell polyomavirus. From our analyses, we identified CNV clusters unique to and common to virus-negative and virus-positive, which reflect the evolutionary mechanisms of the tumors. We also accurately reconstructed the viral integration events providing clear evidence for a circular host-fusion DNA intermediate initiated by recombination at 4+ bp microhomology enriched at lowcomplexity regions in the human genome. Lastly, we observed a surprising number of virus-negative tumors in immunosuppressed patients in our cohort potentially reflecting a previously misunderstood risk population.
Additional file 1: Table S1. SNV data for all patients. Table S2. CNV data for all patients. Table S3. CNV cluster definitions. Table S4. Synoptic review of dermatopathology. Table S5. Association between relapse and genomic sequencing (N = 52).
Additional file 2: Fig. S1. Oncoprint for all genes in this study. Oncoprint summarizing point mutations and CNVs for all genes and patients in this study. Sample are in order of descending TMB and genes are in order of highest point mutations to least. Fig. S2. Network graph for recurrent CNVs. Network graph with all significantly (q < 0.05) cooccurrent CNVs showing how they cluster into reoccurring groups. Fig.   S3. Additional file 4.pdf: CNV frequency by cluster for all patients. Cluster number is shown in gray bars above the bar plots representing amplifications/gains (red) and deletions/losses (blue). Below the bar plots is a heat map of all CNVs (genes, x-axis) (amplifications/gains, red; deletions/losses, blue; no change, gray) across all samples (y-axis) annotated by cluster and chromosome. On the left side pRB, p53 shown in gray and black for 1 or 2 copy loss/mutant, respectively. Presence of UV mutations are shown in black. Fig. S4. RFS survival divided by CNV cluster 5. Kaplanmeier plot for MCC patients split by presence or absence of CNV cluster 5 showing no difference in regression free survival (RFS). Fig. S5: Plot of tumor sampling by patient for sequencing. Plot of all patients along the y axis and time since diagnosis on the x axis. Initial diagnosis, primary biopsy sequencing, recurrence biopsy sequencing, death, and last contact times are plotted as applicable.
Additional file 3: Fig. S6. Assembly graphs for all integration events. 28 assembly graphs annotated by MCPyV genome position (colors labeling each segment are under the header "as.factor(V9)"). The "V3" variable represents the coverage of each contig as determined by SPAdes and contigs are scaled to reflect this value. Ethics approval and consent to participate This study was conducted according to the Declaration of Helsinki principles and approved by the Dana-Farber Cancer Institute institutional review board (DFCI Protocol No. . Written informed consent was received from participants prior to inclusion in the study.

Consent for publication
Not applicable.
Competing interests JAD has received honoraria for participation in an advisory board with Merck & Co., Inc. JAD has received research funding from Constellation Pharmaceuticals, Inc. The remaining authors declare that they have no competing interests.