Isoform level expression profiles provide better cancer signatures than gene level expression profiles

Background The majority of mammalian genes generate multiple transcript variants and protein isoforms through alternative transcription and/or alternative splicing, and the dynamic changes at the transcript/isoform level between non-oncogenic and cancer cells remain largely unexplored. We hypothesized that isoform level expression profiles would be better than gene level expression profiles at discriminating between non-oncogenic and cancer cellsgene level. Methods We analyzed 160 Affymetrix exon-array datasets, comprising cell lines of non-oncogenic or oncogenic tissue origins. We obtained the transcript-level and gene level expression estimates, and used unsupervised and supervised clustering algorithms to study the profile similarity between the samples at both gene and isoform levels. Results Hierarchical clustering, based on isoform level expressions, effectively grouped the non-oncogenic and oncogenic cell lines with a virtually perfect homogeneity-grouping rate (97.5%), regardless of the tissue origin of the cell lines. However, gene levelthis rate was much lower, being 75% at best based on the gene level expressions. Statistical analyses of the difference between cancer and non-oncogenic samples identified the existence of numerous genes with differentially expressed isoforms, which otherwise were not significant at the gene level. We also found that canonical pathways of protein ubiquitination, purine metabolism, and breast-cancer regulation by stathmin1 were significantly enriched among genes thatshow differential expression at isoform level but not at gene level. Conclusions In summary, cancer cell lines, regardless of their tissue of origin, can be effectively discriminated from non-cancer cell lines at isoform level, but not at gene level. This study suggests the existence of an isoform signature, rather than a gene signature, which could be used to distinguish cancer cells from normal cells.


Background
The past decade has witnessed unprecedented developments in high-throughput technologies, and their application has led to the molecular classification of many cancers [1]. Molecular profiling of gene expression, using microarrays, has shown that heterogeneity in outcome and survival of patients with cancer can be explained, in part, by genomic variation within the primary tumor. These technologies have helped identify many genetic and epigenetic modifications involved in the initiation and progression of various cancers, but their precise molecular mechanisms remain unclear. Furthermore, novel drugs have been developed to target some of the molecular pathways underlying the carcinogenic processes and maintenance of cancer phenotypes [2,3] yet, these drugs have provided limited survival benefits to only a small subset of patients with cancer, and only a small number of practically useful biomarkers are presently available. Improved molecular classification of cancers is essential to identify highly sensitive and specific biomarkers and therapeutic targets that reflect the molecular mechanisms functionally involved in tumor typespecific survival, drug resistance, tumor relapse, and patient outcome [4].
One of the reasons for the limited success in the quest for genomic-based, personalized medicine is the assumption of a 'one gene one protein one functional pathway' paradigm in most of the studies investigating molecular classification or therapeutic targets for cancer [5]. Recently, by making use of chromatin immunoprecipitation sequencing (ChIP-seq) and mRNA sequencing (mRNA-seq) approaches, we and others have discovered widespread use of alternative promoters and alternative splicing in mammalian genes in various tissues, developmental stages, and cell lines [6][7][8][9]. In fact, numerous genes displaying complex gene regulation via use of alternative promoters and alternative splicing, have been known for some time, and recent evidence suggests that almost all multi-exon human genes have more than one mRNA isoform. During alternative splicing, the coding and noncoding regions of a single gene are rearranged to generate several messenger RNA transcripts, yielding distinct protein isoforms with differing biological functions. Notably, there is growing evidence linking aberrant use of alternative mRNA isoforms with cancer formation; several oncogenes and tumor-suppressor genes (for example, LEF1, TP63, TP73, HNF4A, RASSF1, and BCL2L1) are already known to have multiple promoters and alternative splice forms [10][11][12][13][14][15][16]. Moreover, it is known that the aberrant use of one isoform over another in some of these genes is directly linked to cancer cell growth [17]. Although the prevalence of alternative splicing in cancer genomes has been discussed in the literature [18][19][20], and it has been shown that use of splice forms provides better classification of normal and cancerous prostate tissue, it is not clear whether the use of genome-wide isoform level geneexpression profiles can provide a better global discriminative signature for cancer and normal cells.
Microarray expression profiling remains a powerful tool for identifying different subtypes of cancers. However, almost all microarray-based studies reported to date have measured the expression of the gene at gene level in a given locus, although a few exceptions in recent years have used exon arrays to measure differences at the exon and/or at transcript variant level. The recent application of exon arrays [21] and the advent of massive parallel sequencing is allowing whole cancer genomes and transcriptomes to be sequenced with extraordinary speed and accuracy, providing insight into the bewildering complexity of isoform level expression of gene transcripts [7]. The Encyclopedia of DNA Elements (ENCODE) consortium, a collective effort to facilitate and accelerate the annotation of functional elements in the human genome, is generating genome-wide expression data in various human cell lines through the use of exon microarrays [20]. Among the available data are gene-expression datasets, generated by the ENCODE consortium using an Affymetrix platform (GeneChip Human Exon 1.0 ST Array), across various cell lines that can be classified as either oncogenic (tumor/ cancer) or non-oncogenic (normal). The arrays interrogate transcripts across their entire length, which can detect splicing differences between various types of samples [22][23][24]. Exons within a gene are represented on the microarray by multiple probe sets. The exon expression can thus be obtained by summarizing all the probe sets for this exon on the microarray. Once the exon-level expressions are obtained, the individual transcript expression of the gene and the total expression of the gene itself then can be inferred from the calculated exon expression, based on assumptions that the isoform structures and number of isoforms of the gene are known beforehand.
With genome-wide isoform level and gene level expression profiles in hand, it is natural to ask how the isoform level expression profiles of different oncogenic and nononcogenic samples will cluster together, and whether isoform level expression profiles can lead to more accurate discriminators between oncogenic and non-oncogenic samples compared with gene level expression profiles. If the answer is yes, it is important to know which genes and pathways contribute to the improvement of discrimination at isoform level compared with gene level.
In the present study we analyzed Affymetrix exon-array data-sets collected from the public domain, primarily the ENCODE project from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database, which comprises 160 datasets from various cell lines of either non-oncogenic or oncogenic tissue origin. These data-sets were used to test the hypothesis that isoform level expression analysis provides abetter discriminator between non-oncogenic and oncogenic cell types than gene level expression analysis.

Summary of exon-array datasets
Unprocessed gene-expression datasets, generated using a whole-transcript GeneChip platform (Human Exon 1.0 ST Array; Affymetrix Inc., Santa Clara, CA, USA), were downloaded from the GEO public data depository, deposited mainly by the ENCODE project [18]. The GEO records GSE15805 [25], GSE17778, GSE19090 [26] and GSE17349 [27] contain, respectively, 79, 36, 83, and 8 samples of various cell lines. After excluding samples that were related to blood, progeria fibroblast, and stem cells, we had a total of 160 exon-array datasets, corresponding to 87 non-oncogenic and 73 oncogenic cell lines of various tissue origins. From the 160 datasets included in the analysis, we used 8 melanoma samples and 4 non-oncogenic melanocyte samples to form the first matched non-oncogenic and oncogenic pair, and used 4 datasets representing non-oncogenic human mammary epithelial cells (HMEC) and 8 datasets from a human breast adenocarcinoma cell line (MCF7) to form the second matched non-oncogenic and oncogenic pair. The complete classification and labeling information of cell lines used in this study are summarized in the supplementary information (see Additional file 1: Table S1).

Estimation of isoform level and gene level expression values from exon-array data
The isoform level (transcript-level) and gene level expression estimates were obtained by the Multi-Mapping Bayesian Gene eXpression (MMBGX) algorithm for Affymetrix whole-transcript arrays [28], based on the Ensembl database (version 56) [29], which contains a total of 114,930 different transcript annotations that correspond to 35,612 different gene models. We set the burn-in iteration at 8,192 and real iteration at 16,384 for both gene and isoform levels. All other parameters were set to their default values in the stand-alone algorithm. The algorithm gave a stable estimation of both gene level and isoform level expressions. For example, two independent runs on the same sample provided almost identical expression levels even with different seeds for the algorithm (correlation coefficient > 0.999, data not shown), whereas runs on different samples gave comparable results, but with much lower correlation (correlation coefficient of about 0.97). Expression estimates across all the samples were then normalized using the locally weighted scatterplot smoothing (loess) algorithm [30,31], also incorporated in the package.

Clustering and pathway analyses
We used the general hierarchical cluster algorithm to cluster the samples, using Euclidean distance as a measurement for dissimilarities [32]. We also applied consensus hierarchical clustering to assess the stability of the clustering results by multiple runs of the clustering algorithm on resampled data [32,33], and calculated consensus index as reported previously [33]. Briefly, the consensus index is defined for each pair of samples, that is, the consensus index of sample pair (i, j) records the number of times that samples i and j are assigned to the same cluster, divided by the total number of times both samples are selected. To find the differential genes between two conditions, we used the limma method [34,35]. An isoform or gene was selected if both its fold change was greater than a cut-off value of 2, and the false discovery rate (FDR)-adjusted P value was smaller than a cut-off value of 0.01 for all comparisons between the two conditions. Ingenuity Pathway Analysis (IPA) [36] was used to associate the identified gene sets with biological functions, canonical pathways, and networks. To identify pathway differences arising from gene sets identified at either isoform or gene level, we used the counting method on the P values of pathways from the IPA analysis; the P values were used as an indicator of association strength between the gene sets and pathways. For the three pairwise oncogenic/non-oncogenic comparisons (all oncogenic cell lines versus nononcogenic cell lines, melanocyte versus melanoma, HMEC versus MCF7), a pathway was selected and reported if it was significantly associated with the gene sets identified at isoform-level in all three pairs of comparisons, but was not significantly associated with the gene sets identified at gene-level in all three pairs of comparisons, or vice versa. The significance level was set at P < 0.05 for all comparisons. All calculations were performed using Bioconductor (version 2.8 or above; Open Source software for bioinformatics, http://www.bioconductor.org) and R platform (version 2.10; The R Project for Statistical Computing, http://www.r-project.org) [37].

Ethics approval
The study protocol was approved by the institutional review board, and all data collection and analyses adhered to the protocols approved by the institutional review board. Informed consent was obtained from all participants.

Clinical characteristics of study cohort
Women with primary operable breast cancer undergoing breast surgery at the Hospital of the University of Pennsylvania were asked to participate in our tissue-banking protocol. The study cohort included four women diagnosed with breast cancer between 2010 and 2011. Clinical characteristics, including age at diagnosis, ethnicity, histology, tumor size, tumor grade, and number of involved (positive) axilla nodes are provided (see Additional file 2: Table S2A).

Sample collection
After completion of surgery, the breast cancer within the surgical specimen was examined by surgical pathologists. Upon completion of gross examination and inking of the tumor specimen, fresh tumor tissue was taken from the center of the tumor without interfering with margin assessment as determined by the pathologists. A small portion of the tumor tissue and a small portion of normal adjacent breast tissue were collected, then immediately immersed in liquid nitrogen and stored at -80°C. RNA was isolated using this snap-frozen tumor tissue.

RNA isolation and reverse transcriptase-quantitative PCR experiment
Expression of transcripts/isoforms for seven genes in HMEC, MCF7, MDA-MBA-231, and T47D cell lines and expression of two TPM4 isoforms in primary breast-cancer tissues were measured by reverse transcriptase -quantitati-vePCR (RT-qPCR). Total RNA from cells and tissues were using TRI reagent (Sigma-Aldrich Inc., St. Louis, MO, USA) in accordance with the manufacturer's instructions.
For breast-cancer and normal breast tissues, up to 50 mg of frozen tissue was transferred to 1 ml of TRI reagent, then the tissue was immediately homogenized and RNA extraction protocol was followed. Briefly, 0.5 μg of total RNA was reverse-transcribed in a 20 μl reaction with SuperscriptII reverse transcriptase (Invitrogen Inc.) in accordance with the manufacturer's instructions. RT-qPCR was performed using a master mix (Power SYBR Green; Applied Biosystems Inc., Foster City, CA, USA) and fold expression was calculated using the 2 -ΔΔCT method. RT-qPCR results were normalized based on the expression of GAPDH for TPM4 and TBP for the other transcripts. The measured isoforms and the primers used for the isoform-specific PCR are presented (see Additional file 2: Table S2B).

Results
Clustering of samples using isoform level expression estimates provided more homogeneous grouping than gene level expression estimates of oncogenic and non-oncogenic cell lines gene level Initial processing of the exon-array datasets generated expression estimates for a total of 114,930 different transcripts and 35,612 different genes in 160 different datasets or samples. To test our hypothesis that the isoform level expression profiles are better than the gene level expression profiles at discriminating non-oncogenic and cancer cellsgene level, we performed unsupervised clustering of 160 samples. Hierarchical clustering was performed by selecting the transcripts/genes showing the most variable expression, as determined by coefficient of variation for the estimated isoform-/gene level expression values across all samples. The dendrograms showed more homogeneous clustering of samples for isoform level expression analysis ( Figure 1A) than for gene level expression analysis ( Figure 1B). Similar clustering results were obtained by selecting different sets of the isoforms/genes with the greatest variation (see Additional file 3: Figures S1 and S2).
We expected the clustering of samples to result in a first-level grouping of different tissues, followed by a second-level grouping of cancer and non-oncogenic cell lines within each tissue type. Unexpectedly, however, we found almost uniform grouping of cancer and non-oncogenic cell lines into two large clusters, with an overall cluster purity of 97.5% at isoform level ( Figure 1A). Further, the samples belonging to same cell/tissue type within each cancer/non-oncogenic group were clustered together, confirming the discriminatory power of isoform level geneexpression profiles. For example, the paired non-oncogenic melanocyte and cancerous melanoma samples, and the matched pairs of MCF7 (breast-cancer cell line) and the HMEC samples (non-oncogenic origin) were separated correctly into either non-oncogenic or cancer groups, with one exception from MCF7 samples ( Figure 1A). Overall, only four samples, two each from non-oncogenic and cancer cell lines, were clustered into the wrong group. The cancer cell lines that were clustered into the nononcogenic group were one MCF7/mammary gland adenocarcinoma (GEO accession number GSM472936) and one pancreatic carcinoma cell line (GEO GSM472939), and the non-oncogenic samples that were assigned to the cancer group were one adult non-oncogenic human epidermal keratinocyte (NHEK) sample (GEO GSM472937) and one non-oncogenic umbilical vein endothelial cell (HUVEC) sample (GEO GSM472935).
Although the clustering at gene-level showed some power of discrimination between non-oncogenic and cancer cell lines, the overall grouping was significantly less efficient than the clustering at isoform-level. The gene level cluster purity was 75%, with 20 samples from the non-oncogenic and cancer cell lines clustered into the wrong group ( Figure 1B). The better separation of non-oncogenic and cancer cell lines at isoform-level (97.5% cluster purity) compared with gene -(75% cluster purity) implies that gene-expression profiles in cancer cells are more specifically altered at isoform-level for numerous genes, which could not be detected using gene level analysis.
We also appliedconsensus hierarchical clustering to compare the stability of the isoform-based approach to the gene-based approach [33,38]. The empirical cumulative distribution function (CDF) plot of the consensus index ( Figure 2A) indicated that isoform-based clustering gives more stable results than gene-based clustering. We further plotted the silhouette width for isoform-based and genebased clustering ( Figure 2B and 2C, respectively) [39]. The larger silhouette width of one sample indicates higher similarity to its own group than to any other group member. The average silhouette width for isoform-based clustering was 0.22, which was larger than the gene-based width of 0.18, indicating that the clustering based on isoforms is more homogenous than that based on genes.
We next focused our analysis on two specific cancers, breast cancer, and melanoma, for which we had matched oncogenic and non-oncogenic cell lines, in addition to the comparison of all oncogenic versus all non-oncogenic cell lines.
Transcript variants of numerous genes were differentially expressed between non-oncogenic and cancer cell lines After performing the three comparisons independently, we overlapped the identified gene sets to identify those genes or gene isoforms that were consistently upregulated or downregulated in cancer compared with nononcogenic cell lines (see Additional file 7, Table S6). We denoted the genes that were found to be significantly different between non-oncogenic and cancer groups in all the three comparisons as the core set of genes/gene isoforms ( Figure 3D). Interestingly, we found numerous genes that were significantly differentially expressed at isoform level but not at gene level. A gene was declared as differentially expressed at isoform level if at least one of its isoforms showed significant differential expression between non-oncogenic and cancer groups. For example, 29 and 13 genes were found to be significantly upregulated and downregulated, respectively, at isoform level but not at gene-level in all three pairwise comparisons ( Figure 3D). Overall, we found a total of 260 different transcript variants or gene isoforms ( Figure 3E) of 182 unique genes ( Figure 3F) that had significant geneexpression differences at either isoform or gene level.
In each pair of comparisons, of the total genes identified to be significant at isoform level, at least 30% (range from 30% to 55%) were found to be significant only at isoform level. In other words, more than one isoforms of these genes displayed differential expression between nononcogenic and cancer samples, but the gene-level expression differences were cancelled out by the combined effect of various isoforms of the same gene. These genes displayed alternate splicing between non-oncogenic and cancer cell lines. This observation strongly supports previous reports such as that by David and Manley [18]. For example, the MITF (micro-ophthalmia transcription factor) gene uses at least nine different promoters and first exons to generate a remarkably diverse set of mRNAs and protein isoforms that differ at the N-terminus. The gene platform we used (Affymetrix GeneChip Human Exon 1.0 ST Array) has probe sets corresponding to 16 different transcript variants of this gene, based on Ensembl gene annotations. The alternative promoters of MITF reflect the tissue specificity of its isoforms, which are selectively expressed in melanocytes, macrophages, osteoclasts, heart   muscle, or retinal pigmented epithelium. MITF, generally believed to play a primary role in melanocyte stem-cell proliferation and expression of a set of pigment-related genes [40], has been shown to be amplified in a small percentage (10 to 20%) of melanomas, and seems to confer a poor prognosis when overexpressed [41]. In the comparative analysis between non-oncogenic melanocytes and melanoma cell lines ( Figure 3C), no differential expression of MITF was found by the gene level analysis. However, the MITF isoform ENST00000352241 was found to be significantly overexpressed in melanoma compared with melanocytes (FC = 3.4), whereas the transcript variants ENST00000433517 (FC = -5.6) and ENST00000472437 (FC = -3.4) were underexpressed in melanoma ( Figure 4A). Similarly, the TPM4 gene was seen to have weak differential expression in MCF7 compared with HMEC cell lines samples. However, although one of the TPM4 isoforms (ENST0000030933) was found to be strongly overexpressed (FC = 5.47), another isoform (ENST00000344824) was found to be significantly underexpressed in MCF7 samples (FC = -7.75) ( Figure 4B). These two isoforms thus cancelled each other out, resulting in the overall gene expression not being significantly different between the non-oncogenic and oncogenic cell lines. TPM4 has been reported to be differentially expressed in breast cancer [42]. Our analysis suggests that whereas gene level expression estimates of TPM4 and MITF contribute little to the discrimination of cancer cell lines from non-oncogenic cells, expression estimates specific to one or more isoforms of these genes have a better discriminating power. Interestingly, we found a total of 294 isoforms, corresponding to 110 genes in melanoma (see Additional file 6: Table S5), and 75 transcript isoforms, corresponding to 16 genes in breast cancer (see Additional file 5: Table S4), that showed opposing expression patterns at isoform level.

Experimental validation of differentially expressed transcript variants in breastcancer cell lines and samples
To validate the existence of the opposing expression patterns of isoforms for various genes, we measured isoform expression by RT-qPCR for two opposing isoforms of seven genes in three breastcancer cell lines relative to the non-oncogenic HMEC cell line (Tables 1). The expression pattern of isoforms obtained from exon-array and RT-qPCR experiments were similar for all seven genes in MCF7 cell lines. However, in the case of MDA-MB-231 and T47D cell lines isoforms of four and two of the seven genes, respectively had an expression pattern similar to that seen in the exon-array data for MCF7. To further validate opposing transcript expression in patient samples between non-oncogenic and cancer tissues, we selected the TPM4 gene in breast cancer as an example. The opposing expression patterns of the TPM4 isoforms were confirmed in the estrogen receptorpositive and triple-negative breast cancer sample. Although the Her2+ sample did not show the opposing pattern of expression, one isoform had the significantly highest fold change of all the samples ( Figure 4C). In all samples, the simple Student t-test results between the averaged fold expressions of the two isoforms were all significant (P < 0.001). These results strongly support our hypothesis that isoforms of multi-transcript genes can have opposing roles in cancer.

Supervised analysis identified an isoform set able to separate the tumor lines from normal lines in an almost perfect pattern
We performed IPA (version 6.0; Ingenuity ® Systems, Redwood City, CA, USA) [36] to find significant molecular functional categories enriched in the differentially expressed gene sets, and transformed the target genes into a set of relevant networks by using literature-based records that are maintained in the IPA Base. We first performed this analysis using the core gene set of 182 genes that were consistently up-regulated or downregulated in cancer cell lines ( Figure 3F). The analyses found 10 molecular and cellular functions that were significantly enriched in the core gene set (see Additional file 8: Table S7). Interestingly, the top five molecular and cellular functions identified by IPA were 'Role of BRCA1 in DNA damage Therefore, we considered that the core genes involved in these pathways might also be useful to effectively separate cancer from non-oncogenic samples. To test this hypothesis, we repeated the clustering analysis, using the core genes and their isoforms that belonged to the five most significant pathways ( Figure 5). The clustering analysis performed by filtering out the isoforms for which there was relatively little variation in expression estimates across all the samples produced an almost identical result (18 isoforms, corresponding to 14 unique genes, Figure 5A) to that obtained by using all the gene isoforms ( Figure 1A). However, at gene level, the clustering based on these 14 genes produced a comparable result, but with relatively lower cluster purity (92.5%, or 12 of 160 cell lines were grouped in the wrong cluster) than at isoform level (97.5% or 4 of 160 were grouped in the wrong cluster) ( Figure 5B). These 14 genes are WNT5B, CCND2, SERPINB7, GPR176, INHBA, EFNB1, PTRF, CDH11, ZBTB16, GJA1, COL5A2, NID1, PRDM1, and TCF4. Except for CCND2 and GPR176, all other genes in our database have more than one isoform. Four genes (SERPINB7, INHBA, GJA1, and NID1) have two isoforms that have significantly different expression between the cancer and non-oncogenic groups. Interestingly, 12 of these 14 genes belong to the same gene network, involved in hematological system development and function, tissue morphology, and cellular development. According to the Ingenuity Pathway Knowledge Base, the network consists of a total of 27 different genes, which suggests that almost 50% of the genes belonging to this network are dysregulated either at the gene or isoform level between non-oncogenic and cancer cells ( Figure 5C). Moreover, most of these genes have already been implicated both in tumorigenesis and in several developmental processes [43][44][45][46][47][48]. For example, it was shown that the phosphorylation-dependent interaction between c-Jun and TCF4 regulates intestinal tumorigenesis by integrating c-Jun kinase (JNK) and adenomatous polyposis coli (APC)/β-catenin, two distinct pathways activated by Wnt signaling [49]. Multiple alternatively spliced transcript variants that may encode different protein isoforms of these genes (for example, TCF4, WNT5B) have been described. Therefore, it would be interesting to evaluate the components of this gene network in different cancers at isoform level.

Gene level and isoform level analyses identified interesting pathways associated with cancer
To test whether the genes that are differentially expressed at isoform level but not at gene level could reveal interesting pathways associated with cancer, we focused the pathway-enrichment analysis on two different gene sets: 1) genes that are significant at isoform level only and 2) genes that are significant at gene level only Figure 3A-C; genes without overlaps in the middle). We performed this analysis separately for each of the three comparisons (all cancer versus all non-oncogenic; HMEC versus MCF7; and melanocytes versus melanoma cell-lines) between the non-oncogenic and cancer pairs. This analysis led to the identification of three canonical The fold expression of the indicated transcripts in the three cell lines was calculated relative to HMEC (N) cells.
pathways (protein ubiquitination, purine metabolism, and breast-cancer regulation by stathmin1) that were significantly enriched in isoform level gene sets, but not in gene level gene sets ( Figure 6).

Discussion
Human cancer is a complex disease. It is known that most of the genes in mammalian genomes generate different transcript variants and protein isoforms, which often function in a cell/tissue type-specific and developmental stagespecific manner in non-oncogenic cells. Cancer results from the sequential acquisition of a number of genetic and epigenetic alterations, and these mutations may alter the expression of specific isoforms but not the others of a gene. Despite this growing knowledge, most biomarker and drug-discovery studies still evaluate expression differences and study gene regulatory mechanisms at gene level rather than at isoform level. In this study, we have shown that oncogenic cell lines could be more accurately discriminated from non-oncogenic cell lines, regardless of their cells of origin, by gene-expression profiling at isoform level compared with gene level. In spite of the differences in tissues of origin, the cell lines were broadly clustered into two groups, non-oncogenic and oncogenic, by isoform level gene-expression profiles. We noted that numerous genes showed differential expression at isoform level but not at gene level. For some of these genes, the differential expression of alternative transcripts occurred in the opposite direction; while some of the isoforms of the same gene were upregulated, others were downregulated, resulting in them cancelling each other out and producing insignificant expression differences at gene level between cancer and non-oncogenic groups. Our findings are in agreement with a previous study on prostate cancer that investigated the expression of 1532 splice forms for 364 prostate cancer-related genes, using data from a customized exon junction array [20]. The authors found that many genes were differentially expressed at splice-form level but not at gene level and this increase in the number of differentially expressed variables at splice-form level contributed to a 92% accuracy for a 128 splice-form-based classifier for normal and cancerous prostate tissue, whereas the accuracy was 87% using a classifier based on 32 genes. That study profiled 1532 mRNA splice forms from 364 potential prostate cancer-related genes, whereas in the current study, we used genome-wide exon-array data that identified the expression of 114,930 transcripts/ isoforms corresponding to 35,612 different genes, including all known non-coding genes in the Ensembl database. In addition, our study focused on discriminating oncogenic and non-oncogenic cells in general, irrespective of their tissue of origin. Using genome-wide isoform level versus gene-level expression information, we found that oncogenic and non-oncogenic cells could be segregated using isoform level information with 97.5% accuracy versus 75% accuracy for gene level information, and even a smaller signature of 18 isoforms was effective in separating the two groups, with equal accuracy. These subtle differences at isoform level in discriminating non-oncogenic and oncogenic cell lines reflect the fact that gene level expression measurements, whose estimates are generally the summation of all the isoform level expression estimates of individual genes, are less accurate in characterizing cancer and non-oncogenic cells.
The pathway-enrichment analysis of genes that are differentially expressed in cancer cell lines at isoform level but not at gene level produced three interesting pathways that have been implicated in various cancers. It is well known that protein phosphorylation and protein ubiquitination regulate most aspects of cell life, and defects in these control mechanisms cause cancer and many other diseases [50]. Similarly, abnormalities in purine metabolism and over-expression of Stathmin 1 (STMN1) are characteristic features of many human tumors [51,52]. The key genes of these pathways (for example, STMN1, PNP, RPS27A and UBA52) transcribe different transcript variants, some of which encode different protein isoforms. It is therefore necessary to evaluate the gene-expression differences and to study gene regulatory mechanisms at isoform level rather than at gene level between non-oncogenic and disease conditions, such as cancer. Recent advances in cancer genomics have shown that gene-expression signatures are useful for biomarker identification and drug discovery [53]. In this regard, the present study highlights the importance of studying gene-expression signatures at isoform level rather than at gene level, and makes a strong case for isoform level gene/protein-expression profiling methods for improved cancer biomarker and therapeutic discovery.

Conclusions
In conclusion, we have identified a common, isoform level signature that can be used to discriminate effectively between cancer and non-cancer cell lines. We found numerous genes for which the differential expression of alternative transcripts occurred in opposing directions, with some of the isoforms of the same gene being upregulated while others were downregulated, resulting in insignificant expression differences at gene level between cancer and non-oncogenic groups. This is supported by our experimental validation of opposing expression patterns for TPM4 gene isoforms in nononcogenic and oncogenic tissue samples from breast cancer patients. The present study highlights the importance of studying gene-expression signatures at isoform level rather than at gene level in characterizing the cancer transcriptome.