Isoform level expression profiles provide better cancer signatures than gene level expression profiles
© Zhang et al.; licensee BioMed Central Ltd. 2013
Received: 15 January 2013
Accepted: 17 April 2013
Published: 17 April 2013
Skip to main content
© Zhang et al.; licensee BioMed Central Ltd. 2013
Received: 15 January 2013
Accepted: 17 April 2013
Published: 17 April 2013
The majority of mammalian genes generate multiple transcript variants and protein isoforms through alternative transcription and/or alternative splicing, and the dynamic changes at the transcript/isoform level between non-oncogenic and cancer cells remain largely unexplored. We hypothesized that isoform level expression profiles would be better than gene level expression profiles at discriminating between non-oncogenic and cancer cellsgene level.
We analyzed 160 Affymetrix exon-array datasets, comprising cell lines of non-oncogenic or oncogenic tissue origins. We obtained the transcript-level and gene level expression estimates, and used unsupervised and supervised clustering algorithms to study the profile similarity between the samples at both gene and isoform levels.
Hierarchical clustering, based on isoform level expressions, effectively grouped the non-oncogenic and oncogenic cell lines with a virtually perfect homogeneity-grouping rate (97.5%), regardless of the tissue origin of the cell lines. However, gene levelthis rate was much lower, being 75% at best based on the gene level expressions. Statistical analyses of the difference between cancer and non-oncogenic samples identified the existence of numerous genes with differentially expressed isoforms, which otherwise were not significant at the gene level. We also found that canonical pathways of protein ubiquitination, purine metabolism, and breast-cancer regulation by stathmin1 were significantly enriched among genes thatshow differential expression at isoform level but not at gene level.
In summary, cancer cell lines, regardless of their tissue of origin, can be effectively discriminated from non-cancer cell lines at isoform level, but not at gene level. This study suggests the existence of an isoform signature, rather than a gene signature, which could be used to distinguish cancer cells from normal cells.
The past decade has witnessed unprecedented developments in high-throughput technologies, and their application has led to the molecular classification of many cancers . Molecular profiling of gene expression, using microarrays, has shown that heterogeneity in outcome and survival of patients with cancer can be explained, in part, by genomic variation within the primary tumor. These technologies have helped identify many genetic and epigenetic modifications involved in the initiation and progression of various cancers, but their precise molecular mechanisms remain unclear. Furthermore, novel drugs have been developed to target some of the molecular pathways underlying the carcinogenic processes and maintenance of cancer phenotypes [2, 3] yet, these drugs have provided limited survival benefits to only a small subset of patients with cancer, and only a small number of practically useful biomarkers are presently available. Improved molecular classification of cancers is essential to identify highly sensitive and specific biomarkers and therapeutic targets that reflect the molecular mechanisms functionally involved in tumor type-specific survival, drug resistance, tumor relapse, and patient outcome .
One of the reasons for the limited success in the quest for genomic-based, personalized medicine is the assumption of a 'one gene → one protein → one functional pathway' paradigm in most of the studies investigating molecular classification or therapeutic targets for cancer . Recently, by making use of chromatin immunoprecipitation sequencing (ChIP-seq) and mRNA sequencing (mRNA-seq) approaches, we and others have discovered widespread use of alternative promoters and alternative splicing in mammalian genes in various tissues, developmental stages, and cell lines [6–9]. In fact, numerous genes displaying complex gene regulation via use of alternative promoters and alternative splicing, have been known for some time, and recent evidence suggests that almost all multi-exon human genes have more than one mRNA isoform. During alternative splicing, the coding and non-coding regions of a single gene are rearranged to generate several messenger RNA transcripts, yielding distinct protein isoforms with differing biological functions. Notably, there is growing evidence linking aberrant use of alternative mRNA isoforms with cancer formation; several oncogenes and tumor-suppressor genes (for example, LEF1, TP63, TP73, HNF4A, RASSF1, and BCL2L1) are already known to have multiple promoters and alternative splice forms [10–16]. Moreover, it is known that the aberrant use of one isoform over another in some of these genes is directly linked to cancer cell growth . Although the prevalence of alternative splicing in cancer genomes has been discussed in the literature [18–20], and it has been shown that use of splice forms provides better classification of normal and cancerous prostate tissue, it is not clear whether the use of genome-wide isoform level gene-expression profiles can provide a better global discriminative signature for cancer and normal cells.
Microarray expression profiling remains a powerful tool for identifying different subtypes of cancers. However, almost all microarray-based studies reported to date have measured the expression of the gene at gene level in a given locus, although a few exceptions in recent years have used exon arrays to measure differences at the exon and/or at transcript variant level. The recent application of exon arrays  and the advent of massive parallel sequencing is allowing whole cancer genomes and transcriptomes to be sequenced with extraordinary speed and accuracy, providing insight into the bewildering complexity of isoform level expression of gene transcripts . The Encyclopedia of DNA Elements (ENCODE) consortium, a collective effort to facilitate and accelerate the annotation of functional elements in the human genome, is generating genome-wide expression data in various human cell lines through the use of exon microarrays . Among the available data are gene-expression datasets, generated by the ENCODE consortium using an Affymetrix platform (GeneChip Human Exon 1.0 ST Array), across various cell lines that can be classified as either oncogenic (tumor/cancer) or non-oncogenic (normal). The arrays interrogate transcripts across their entire length, which can detect splicing differences between various types of samples [22–24]. Exons within a gene are represented on the microarray by multiple probe sets. The exon expression can thus be obtained by summarizing all the probe sets for this exon on the microarray. Once the exon-level expressions are obtained, the individual transcript expression of the gene and the total expression of the gene itself then can be inferred from the calculated exon expression, based on assumptions that the isoform structures and number of isoforms of the gene are known beforehand.
With genome-wide isoform level and gene level expression profiles in hand, it is natural to ask how the isoform level expression profiles of different oncogenic and non-oncogenic samples will cluster together, and whether isoform level expression profiles can lead to more accurate discriminators between oncogenic and non-oncogenic samples compared with gene level expression profiles. If the answer is yes, it is important to know which genes and pathways contribute to the improvement of discrimination at isoform level compared with gene level.
In the present study we analyzed Affymetrix exon-array data-sets collected from the public domain, primarily the ENCODE project from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database, which comprises 160 datasets from various cell lines of either non-oncogenic or oncogenic tissue origin. These data-sets were used to test the hypothesis that isoform level expression analysis provides abetter discriminator between non-oncogenic and oncogenic cell types than gene level expression analysis.
Unprocessed gene-expression datasets, generated using a whole-transcript GeneChip platform (Human Exon 1.0 ST Array; Affymetrix Inc., Santa Clara, CA, USA), were downloaded from the GEO public data depository, deposited mainly by the ENCODE project. The GEO records GSE15805 , GSE17778, GSE19090  and GSE17349  contain, respectively, 79, 36, 83, and 8 samples of various cell lines. After excluding samples that were related to blood, progeria fibroblast, and stem cells, we had a total of 160 exon-array datasets, corresponding to 87 non-oncogenic and 73 oncogenic cell lines of various tissue origins. From the 160 datasets included in the analysis, we used 8 melanoma samples and 4 non-oncogenic melanocyte samples to form the first matched non-oncogenic and oncogenic pair, and used 4 datasets representing non-oncogenic human mammary epithelial cells (HMEC) and 8 datasets from a human breast adenocarcinoma cell line (MCF7) to form the second matched non-oncogenic and oncogenic pair. The complete classification and labeling information of cell lines used in this study are summarized in the supplementary information (see Additional file 1: Table S1).
The isoform level (transcript-level) and gene level expression estimates were obtained by the Multi-Mapping Bayesian Gene eXpression (MMBGX) algorithm for Affymetrix whole-transcript arrays , based on the Ensembl database (version 56) , which contains a total of 114,930 different transcript annotations that correspond to 35,612 different gene models. We set the burn-in iteration at 8,192 and real iteration at 16,384 for both gene and isoform levels. All other parameters were set to their default values in the stand-alone algorithm. The algorithm gave a stable estimation of both gene level and isoform level expressions. For example, two independent runs on the same sample provided almost identical expression levels even with different seeds for the algorithm (correlation coefficient > 0.999, data not shown), whereas runs on different samples gave comparable results, but with much lower correlation (correlation coefficient of about 0.97). Expression estimates across all the samples were then normalized using the locally weighted scatterplot smoothing (loess) algorithm [30, 31], also incorporated in the package.
We used the general hierarchical cluster algorithm to cluster the samples, using Euclidean distance as a measurement for dissimilarities . We also applied consensus hierarchical clustering to assess the stability of the clustering results by multiple runs of the clustering algorithm on resampled data [32, 33], and calculated consensus index as reported previously . Briefly, the consensus index is defined for each pair of samples, that is, the consensus index of sample pair (i, j) records the number of times that samples i and j are assigned to the same cluster, divided by the total number of times both samples are selected. To find the differential genes between two conditions, we used the limma method [34, 35]. An isoform or gene was selected if both its fold change was greater than a cut-off value of 2, and the false discovery rate (FDR)-adjusted P value was smaller than a cut-off value of 0.01 for all comparisons between the two conditions. Ingenuity Pathway Analysis (IPA)  was used to associate the identified gene sets with biological functions, canonical pathways, and networks. To identify pathway differences arising from gene sets identified at either isoform or gene level, we used the counting method on the P values of pathways from the IPA analysis; the P values were used as an indicator of association strength between the gene sets and pathways. For the three pairwise oncogenic/non-oncogenic comparisons (all oncogenic cell lines versus non-oncogenic cell lines, melanocyte versus melanoma, HMEC versus MCF7), a pathway was selected and reported if it was significantly associated with the gene sets identified at isoform-level in all three pairs of comparisons, but was not significantly associated with the gene sets identified at gene-level in all three pairs of comparisons, or vice versa. The significance level was set at P < 0.05 for all comparisons. All calculations were performed using Bioconductor (version 2.8 or above; Open Source software for bioinformatics, http://www.bioconductor.org) and R platform (version 2.10; The R Project for Statistical Computing, http://www.r-project.org) .
The study protocol was approved by the institutional review board, and all data collection and analyses adhered to the protocols approved by the institutional review board. Informed consent was obtained from all participants.
Women with primary operable breast cancer undergoing breast surgery at the Hospital of the University of Pennsylvania were asked to participate in our tissue-banking protocol. The study cohort included four women diagnosed with breast cancer between 2010 and 2011. Clinical characteristics, including age at diagnosis, ethnicity, histology, tumor size, tumor grade, and number of involved (positive) axilla nodes are provided (see Additional file 2: Table S2A).
After completion of surgery, the breast cancer within the surgical specimen was examined by surgical pathologists. Upon completion of gross examination and inking of the tumor specimen, fresh tumor tissue was taken from the center of the tumor without interfering with margin assessment as determined by the pathologists. A small portion of the tumor tissue and a small portion of normal adjacent breast tissue were collected, then immediately immersed in liquid nitrogen and stored at -80°C. RNA was isolated using this snap-frozen tumor tissue.
Expression of transcripts/isoforms for seven genes in HMEC, MCF7, MDA-MBA-231, and T47D cell lines and expression of two TPM4 isoforms in primary breast-cancer tissues were measured by reverse transcriptase -quantitativePCR (RT-qPCR). Total RNA from cells and tissues were using TRI reagent (Sigma-Aldrich Inc., St. Louis, MO, USA) in accordance with the manufacturer's instructions. For breast-cancer and normal breast tissues, up to 50 mg of frozen tissue was transferred to 1 ml of TRI reagent, then the tissue was immediately homogenized and RNA extraction protocol was followed. Briefly, 0.5 μg of total RNA was reverse-transcribed in a 20 μl reaction with SuperscriptII reverse transcriptase (Invitrogen Inc.) in accordance with the manufacturer's instructions. RT-qPCR was performed using a master mix (Power SYBR Green; Applied Biosystems Inc., Foster City, CA, USA) and fold expression was calculated using the 2-ΔΔCT method. RT-qPCR results were normalized based on the expression of GAPDH for TPM4 and TBP for the other transcripts. The measured isoforms and the primers used for the isoform-specific PCR are presented (see Additional file 2: Table S2B).
We expected the clustering of samples to result in a first-level grouping of different tissues, followed by a second-level grouping of cancer and non-oncogenic cell lines within each tissue type. Unexpectedly, however, we found almost uniform grouping of cancer and non-oncogenic cell lines into two large clusters, with an overall cluster purity of 97.5% at isoform level (Figure 1A). Further, the samples belonging to same cell/tissue type within each cancer/non-oncogenic group were clustered together, confirming the discriminatory power of isoform level gene-expression profiles. For example, the paired non-oncogenic melanocyte and cancerous melanoma samples, and the matched pairs of MCF7 (breast-cancer cell line) and the HMEC samples (non-oncogenic origin) were separated correctly into either non-oncogenic or cancer groups, with one exception from MCF7 samples (Figure 1A). Overall, only four samples, two each from non-oncogenic and cancer cell lines, were clustered into the wrong group. The cancer cell lines that were clustered into the non-oncogenic group were one MCF7/mammary gland adenocarcinoma (GEO accession number GSM472936) and one pancreatic carcinoma cell line (GEO GSM472939), and the non-oncogenic samples that were assigned to the cancer group were one adult non-oncogenic human epidermal keratinocyte (NHEK) sample (GEO GSM472937) and one non-oncogenic umbilical vein endothelial cell (HUVEC) sample (GEO GSM472935).
Although the clustering at gene-level showed some power of discrimination between non-oncogenic and cancer cell lines, the overall grouping was significantly less efficient than the clustering at isoform-level. The gene level cluster purity was 75%, with 20 samples from the non-oncogenic and cancer cell lines clustered into the wrong group (Figure 1B). The better separation of non-oncogenic and cancer cell lines at isoform-level (97.5% cluster purity) compared with gene -(75% cluster purity) implies that gene-expression profiles in cancer cells are more specifically altered at isoform-level for numerous genes, which could not be detected using gene level analysis.
We next focused our analysis on two specific cancers, breast cancer, and melanoma, for which we had matched oncogenic and non-oncogenic cell lines, in addition to the comparison of all oncogenic versus all non-oncogenic cell lines.
Isoform expression in breast-cancer cell lines as measured by reverse transcriptase -quantitative PCR (RT-qPCR).
These 14 genes are WNT5B, CCND2, SERPINB7, GPR176, INHBA, EFNB1, PTRF, CDH11, ZBTB16, GJA1, COL5A2, NID1, PRDM1, and TCF4. Except for CCND2 and GPR176, all other genes in our database have more than one isoform. Four genes (SERPINB7, INHBA, GJA1, and NID1) have two isoforms that have significantly different expression between the cancer and non-oncogenic groups. Interestingly, 12 of these 14 genes belong to the same gene network, involved in hematological system development and function, tissue morphology, and cellular development. According to the Ingenuity Pathway Knowledge Base, the network consists of a total of 27 different genes, which suggests that almost 50% of the genes belonging to this network are dysregulated either at the gene or isoform level between non-oncogenic and cancer cells (Figure 5C). Moreover, most of these genes have already been implicated both in tumorigenesis and in several developmental processes [43–48]. For example, it was shown that the phosphorylation-dependent interaction between c-Jun and TCF4 regulates intestinal tumorigenesis by integrating c-Jun kinase (JNK) and adenomatous polyposis coli (APC)/β-catenin, two distinct pathways activated by Wnt signaling . Multiple alternatively spliced transcript variants that may encode different protein isoforms of these genes (for example, TCF4, WNT5B) have been described. Therefore, it would be interesting to evaluate the components of this gene network in different cancers at isoform level.
Human cancer is a complex disease. It is known that most of the genes in mammalian genomes generate different transcript variants and protein isoforms, which often function in a cell/tissue type-specific and developmental stage-specific manner in non-oncogenic cells. Cancer results from the sequential acquisition of a number of genetic and epigenetic alterations, and these mutations may alter the expression of specific isoforms but not the others of a gene. Despite this growing knowledge, most biomarker and drug-discovery studies still evaluate expression differences and study gene regulatory mechanisms at gene level rather than at isoform level. In this study, we have shown that oncogenic cell lines could be more accurately discriminated from non-oncogenic cell lines, regardless of their cells of origin, by gene-expression profiling at isoform level compared with gene level. In spite of the differences in tissues of origin, the cell lines were broadly clustered into two groups, non-oncogenic and oncogenic, by isoform level gene-expression profiles. We noted that numerous genes showed differential expression at isoform level but not at gene level. For some of these genes, the differential expression of alternative transcripts occurred in the opposite direction; while some of the isoforms of the same gene were upregulated, others were downregulated, resulting in them cancelling each other out and producing insignificant expression differences at gene level between cancer and non-oncogenic groups. Our findings are in agreement with a previous study on prostate cancer that investigated the expression of 1532 splice forms for 364 prostate cancer-related genes, using data from a customized exon junction array . The authors found that many genes were differentially expressed at splice-form level but not at gene level and this increase in the number of differentially expressed variables at splice-form level contributed to a 92% accuracy for a 128 splice-form-based classifier for normal and cancerous prostate tissue, whereas the accuracy was 87% using a classifier based on 32 genes. That study profiled 1532 mRNA splice forms from 364 potential prostate cancer-related genes, whereas in the current study, we used genome-wide exon-array data that identified the expression of 114,930 transcripts/isoforms corresponding to 35,612 different genes, including all known non-coding genes in the Ensembl database. In addition, our study focused on discriminating oncogenic and non-oncogenic cells in general, irrespective of their tissue of origin. Using genome-wide isoform level versus gene-level expression information, we found that oncogenic and non-oncogenic cells could be segregated using isoform level information with 97.5% accuracy versus 75% accuracy for gene level information, and even a smaller signature of 18 isoforms was effective in separating the two groups, with equal accuracy. These subtle differences at isoform level in discriminating non-oncogenic and oncogenic cell lines reflect the fact that gene level expression measurements, whose estimates are generally the summation of all the isoform level expression estimates of individual genes, are less accurate in characterizing cancer and non-oncogenic cells.
The pathway-enrichment analysis of genes that are differentially expressed in cancer cell lines at isoform level but not at gene level produced three interesting pathways that have been implicated in various cancers. It is well known that protein phosphorylation and protein ubiquitination regulate most aspects of cell life, and defects in these control mechanisms cause cancer and many other diseases . Similarly, abnormalities in purine metabolism and over-expression of Stathmin 1 (STMN1) are characteristic features of many human tumors [51, 52]. The key genes of these pathways (for example, STMN1, PNP, RPS27A and UBA52) transcribe different transcript variants, some of which encode different protein isoforms. It is therefore necessary to evaluate the gene-expression differences and to study gene regulatory mechanisms at isoform level rather than at gene level between non-oncogenic and disease conditions, such as cancer. Recent advances in cancer genomics have shown that gene-expression signatures are useful for biomarker identification and drug discovery . In this regard, the present study highlights the importance of studying gene-expression signatures at isoform level rather than at gene level, and makes a strong case for isoform level gene/protein-expression profiling methods for improved cancer biomarker and therapeutic discovery.
In conclusion, we have identified a common, isoform level signature that can be used to discriminate effectively between cancer and non-cancer cell lines. We found numerous genes for which the differential expression of alternative transcripts occurred in opposing directions, with some of the isoforms of the same gene being upregulated while others were downregulated, resulting in insignificant expression differences at gene level between cancer and non-oncogenic groups. This is supported by our experimental validation of opposing expression patterns for TPM4 gene isoforms in non-oncogenic and oncogenic tissue samples from breast cancer patients. The present study highlights the importance of studying gene-expression signatures at isoform level rather than at gene level in characterizing the cancer transcriptome.
This work was supported by American Cancer Society Research Scholar Grant (number RSG-07-097-01) to RD. RD holds a Philadelphia Healthcare Trust Endowed Chair Position; research in his laboratory is partially supported by the Philadelphia Healthcare Trust. The use of resources in the Bioinformatics Shared Facility of Wistar Institute Cancer Center (grant number P30 CA010815) is gratefully acknowledged. We thank Dr Zhiyan Fu for his reading of the manuscript.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.