High-throughput analysis of chromosome translocations and other genome rearrangements in epithelial cancers

Genes that are broken or fused by structural changes to the genome are an important class of mutation in the leukemias and sarcomas but have been largely overlooked in the common epithelial cancers. Large-scale sequencing is changing our perceptions of the cancer genome, and it is now being applied to structural changes, using the 'paired end' strategy. This reveals more clearly than before the extent to which many cancer genomes are rearranged and how much these rearrangements contribute to the mutational burden of epithelial tumors. In particular, there are probably many fusion genes, analogous to those found in leukemias, to be found in common cancers, such as breast carcinoma, and some of these will prove to be important in cancer diagnosis and treatment.


Introduction
Somatic structural variations in the genome -referred to by cytogeneticists as translocations, inversions, duplications and insertions -can be powerful events in tumor evolution because they can create fusion genes. Fusion genes are formed when part of one gene is juxtaposed to another by a structural rearrangement, creating a hybrid transcript, or sometimes simply inserting a novel promoter upstream of a gene. These can be very powerful oncogenic mutations, not only increasing expression of a protein but also changing its activity, subcellular localization or binding specificity [1,2]. Such fusion genes are also clinically important, because some can predict outcome and determine management, and some may be targets for therapy [1]. For example, the BCR-ABL fusion gene defines a group of leukemias and is the target of treatment with the kinase inhibitor Glivec.
In stark contrast to leukemias, lymphomas and sarcomas, in which many important oncogenes have been identified at translocation breaks, we have a poor understanding of how structural variations contribute to carcinogenesis in common epithelial tumors [1,2]. Although we have relatively good knowledge of which genes can be point-mutated, amplified or deleted in these cancers, the sheer number and complexity of their genome rearrangements has made it difficult to identify genes at chromosome breakpoints [2]. We have known for several years that recurrent gene fusions are found in common epithelial cancers, following the discovery of the TMPRSS2-ERG and related fusions in prostate cancer [3] and EML4-ALK in lung cancer [4]. However, these fusions were discovered by essentially one-off methods and it remains to be seen whether these are isolated examples or the tip of an iceberg.
Stephens et al. [5] recently presented the first largescale survey of somatically acquired structural variation in the genomes of cancers, with the explicit goal of discovering genes disrupted and fused at chromosome breakpoints. The authors [5] used massively parallel paired end sequencing to find genome rearrangements in 24 breast cancers -9 of which were from immortal cell lines and 15 from primary tumors. Although these data pertain to breast cancer, we think many of the findings will also be relevant to other common cancers, and certainly they are consistent with a preceding pilot study of two lung cancer cell lines [6]. The Stephens et al. [5] study revealed that structural variants contribute significantly to the mutational burden of many breast cancers, but also that genes are often fused or otherwise disrupted by mechanisms we have, so far, not appreciated.

Massively parallel paired end sequencing
Massively parallel sequencing techniques generate very large numbers of sequence reads, but the reads are generally much shorter than in traditional sequencing, typically only tens of base pairs. To use these short sequence 'tags' efficiently to find structural rearrangements, 'paired end read' strategies have been developed (also known as 'mate pair' and 'end sequence profiling' strategies; Figure 1) [6]. The genome is broken into DNA fragments of selected size, for example 500 base pairs (bp) [5], and a short sequence, for example 37 bp, is read from each end of each DNA fragment to give paired sequences. Most of the fragments are normal, and their paired reads map back to the reference genome about 500 bp apart and in the correct orientation. Structural variants are discovered when read-pairs map unexpectedly, for example to two different chromosomes (translocation), too far apart (deletion), or in the wrong orientation (tandem duplication or inversion) ( Figure 1). Considerable bioinformatic processing is required to interpret the huge volume of sequence data, but millions of paired reads are pruned down to a hundred or so structural variants per tumor, most of which can be confirmed by PCR.
Stephens et al. [5] estimate that 50% of structural variations were detected in their study. This may seem like a low figure but, as the authors showed, it was sufficient to identify hundreds of structural variants and tens of fusion genes. The main reason for missing structural variants was that the amount of sequencing was not enough to sample all rearrangements. Also, breakpoints flanked by repeats may have been missed because reads from repetitive regions are currently discarded. We expect the proportion of structural variants detected to increase in the future as more sequencing reads are generated, the reads used are longer, and bioinformatic analysis is refined.

Rearrangements in breast cancers are more numerous than expected
There were many more structural variants than most in the field would have anticipated [5]. For cell lines, the median number of rearrangements per sample was 101 and ranged from 58 to 245. For the tumors, the median was 38 and ranged from 1 to 231. Approximately 85% were intrachromosomal and less than 2 Mb [5], which explains why earlier molecular cytogenetic approaches, such as spectral karyotyping, array comparative genomic hybridization (CGH) and array painting [7], under estimated the number of rearrangements. These aberrations would not have been visible in metaphase chromosomes and many were copy-number neutral or too small to have shown up in most array CGH experiments.

Many fusion genes were predicted and several were expressed
Many of the structural changes that Stephens et al. [5] found juxtaposed the coding regions of two genes. An important observation, extending earlier studies [2,7,8], was that some breast cancers can express several fused genes. Stephens et al. [5] showed that 21 novel fusion genes were expressed and in frame so potentially produced a functional fusion protein. Allowing for the estimated 50% detection rate, this would equate to two functional fusion genes per case. Most of the fusion genes were of unknown function but several involved known or likely cancer genes, such as ETV6, which is a known target of translocations and encodes a member of the oncogenic Ets transcription factor family, and EHF, which also encodes an Ets family member. Some genes seemed to be rearranged in several of the 24 samples but no recurrent gene fusions were identified by fluorescence in situ hybridization (FISH) or RT-PCR in a larger second set of tumors [5]. This may simply be a reflection of the heterogeneity of breast cancer -the samples used were chosen to represent a range of different tumor subtypesor it may be that aberrant expression of an important 3' gene can be driven by several different 5' fusion transcript partners, as happens, for example, to the Ets-related gene ERG in prostate cancers.

Unanticipated classes of structural variation
An unexpected finding [5] was a number of somatically acquired tandem duplications, a kind of structural change that has rarely been detected until recently but is interesting because it can lead to gene fusion [9]. A tandem duplication occurs when a small region from 3 kb to greater than 1 Mb is duplicated, usually in a head-to-tail orientation. Some tumors showed a distinctly higher number of tandem duplications than the others, which led the authors [5] to suggest that they were generated by a specific repair defect. The BRCA1 and BRCA2 mutant tumors had fewer tandem duplications than average, so the aberrant mechanism was probably not related to these pathways. The second surprising finding [5] was that many small tandem duplications, inversions and deletions were entirely within genes. In many cases this affected the exon structure at the transcript level and novel isoforms were observed. Some of these rearrangements were in putative oncogenes, such as the transcription-factor-encoding gene RUNX1, so it is plausible that oncogenic activation could have occurred by removing or reshuffling exons that encode a repressive protein domain. Well-characterized tumor suppressor genes such as the retinoblastoma gene RB also had internal rearrangements and it is possible these genes were inactivated through frame shift in the transcript or by removing important protein domains.
Two questions arise from these observations [5]: firstly, whether the roles of genes such as RUNX1 and RB have been underestimated in breast cancer, because these kinds of mutation would not be detected by Sanger sequencing studies on individual coding exons; and secondly, whether there are numerous small rearrangements of this kind in other, karyotypically normal, cancers.

Drivers and passengers?
It is remarkable how many mutations, whether sequencelevel, epigenetic or structural, are now being discovered in cancer genomes [5,10,11]. Many are probably 'passenger' mutations, that is, random mutational noise, but some must be selected, 'driver' events and, as the number and variety of known mutations increases, estimates for the number of 'driving' mutations in cancer are tending to increase [2,12].
The problem of distinguishing driver and passenger mutations is as acute for structural mutations as it is for point mutations [10][11][12][13]. Stephens et al. [5] estimate that approximately 2% of genome rearrangements of the types they found would generate an in-frame fusion gene by chance. They observed 1.6%, which suggests that the majority of gene fusions, like the majority of point mutations, are not selected events.

Conclusions
The Stephens et al. [5] study is the first indication that genome-wide structural analysis of a relatively large number of samples, including primary tumors, is already an achievable goal. More importantly, it illustrates that such studies are worthwhile as they can create a large yield of new candidate oncogenes and tumor suppressor genes.
Clearly, the next step is to find genes or gene families that are recurrently fused or rearranged in a subset of tumors. Thanks to the methodologies and bioinformatic tools already validated by pilot studies [5,6] we can expect large surveys of several cancer types to appear within 2 or 3 years. This will allow us to address the question of recurrence and move on to establish the clinical relevance and potential for targeted intervention.
For the time being, massively parallel paired end sequencing will remain a research tool, but the basic cost of an analysis like that of Stephens et al. [5] is already down to a few thousands of euros per case, so it is conceivable that we will see it used in the clinic in the not too distant future. Indeed, while this article was in press, Velculescu and colleagues [14] announced a possible clinical application, using paired end reads to find a structural 'fingerprint' of a tumor that could be detected in the patient's serum and so used to monitor progression.