The prognostic potential of alternative transcript isoforms across human tumors
© The Author(s). 2016
Received: 18 May 2016
Accepted: 27 July 2016
Published: 17 August 2016
Phenotypic changes during cancer progression are associated with alterations in gene expression, which can be exploited to build molecular signatures for tumor stage identification and prognosis. However, it is not yet known whether the relative abundance of transcript isoforms may be informative for clinical stage and survival.
Using information theory and machine learning methods, we integrated RNA sequencing and clinical data from The Cancer Genome Atlas project to perform the first systematic analysis of the prognostic potential of transcript isoforms in 12 solid tumors to build new signatures for stage and prognosis. This study was also performed in breast tumors according to estrogen receptor (ER) status and melanoma tumors with proliferative and invasive phenotypes.
Transcript isoform signatures accurately separate early from late-stage groups and metastatic from non-metastatic tumors, and are predictive of the survival of patients with undetermined lymph node invasion or metastatic status. These signatures show similar, and sometimes better, accuracies compared with known gene expression signatures in retrospective data and are largely independent of gene expression changes. Furthermore, we show frequent transcript isoform changes in breast tumors according to ER status, and in melanoma tumors according to the invasive or proliferative phenotype, and derive accurate predictive models of stage and survival within each patient subgroup.
Our analyses reveal new signatures based on transcript isoform abundances that characterize tumor phenotypes and their progression independently of gene expression. Transcript isoform signatures appear especially relevant to determine lymph node invasion and metastasis and may potentially contribute towards current strategies of precision cancer medicine.
Tumors advance through stages that are generally characterized by their size and spread to lymph nodes and other parts of the body . Establishing the stage of a tumor is critical to determine patient prognosis and to select the appropriate therapeutic strategy . Even though stage is generally defined from a number of tests carried out on a patient, this information may sometimes be incomplete or inconclusive. Advances in the molecular characterization of tumors have led to improvements in stage classification and clinical management of patients . Although tumors originate primarily from genetic lesions, their progression involves other molecular transformations, which are related to the activation of specific aggressive phenotypes, like tumor spread and metastasis, and are often reflected in gene expression changes [4, 5]. Accordingly, the development of gene expression signatures has been instrumental to complement and improve stage identification and prognosis [6–9]. On the other hand, gene expression summarizes the output of RNA transcripts from a gene locus, which is mostly explained by one transcript isoform . Furthermore, we described before how solid tumors present frequent changes in the relative abundances of isoforms in comparison to normal tissues . This prompts the question of whether transcript isoform changes, which remain largely unexplored as predictive signatures of tumor stage and survival, could hold relevant novel mechanisms of tumor progression. We investigated the potential of the relative abundances of transcript isoforms to determine tumor staging and clinical outcome in 12 different tumor types, integrating RNA sequencing (RNA-seq) and clinical annotation data for 12 tumor types from The Cancer Genome Atlas (TCGA) project. Our analyses revealed new signatures that characterize tumor phenotypes and their progression that are largely independent of gene expression. Knowledge about the relative abundance of transcript isoforms in tumors can potentially help predicting stage and clinical outcome and contribute towards current molecular strategies in precision cancer medicine.
Relative abundances of transcript isoforms are predictive of tumor stage
Number of samples analyzed for each tumor type and stage
Breast invasive carcinoma
147 (T3, T4)
171 (N2, N3)
45 (T1, T2)
Head and neck squamous cell carcinoma
166 (N2, N3)
77 (S1, S2)
19 (T3, T4)
19 (S3, S4)
Kidney renal clear cell carcinoma
186 (T3, T4)
Kidney renal papillary carcinoma
38 (T3, T4)
16 (N1, N2)
38 (S3, S4)
Lung squamous cell carcinoma
59 (T3, T4)
37 (N2, N3)
76 (S3, S4)
57 (T3, T4)
70 (N2, N3)
242 (S3, S4)
Ovarian serous cystadenocarcinoma
93 (T3, T4)
Skin cutaneous melanoma
179 (T3, T4)
To characterize the functional involvement of the found discriminant isoforms, we performed an enrichment analysis of cancer hallmarks (see “Methods”) (Additional file 4). Testing discriminant isoforms for each stage class and tumor type independently yielded frequent enrichment of MYC targets, oxidative phosphorylation, mTORC signaling, DNA repair, and Interferon response (Additional file 2: Figure S1b). Notably, aggregating all tumor types for each clinical class, the discriminant transcripts show enrichment in MYC targets and genes involved in oxidative phosphorylation (Fig. 1b). On the other hand, combining discriminant isoforms from different clinical classes in the same tumor type, only five of the 12 tumor types tested show enriched hallmarks (Fig. 1c), which include the enrichment of MYC targets in skin cutaneous melanoma (SKCM) and kidney papillary carcinoma (KIRP). These results indicate that there are frequent transcripts isoform changes in cancer-relevant pathways during tumor progression, many of which may be driven by MYC activity. To test some of our findings, we compared the ΔPSI values of the discriminant transcripts for metastasis in SKCM with the ΔPSI values measured between metastatic (SKMel147)  and non-metastatic (Mel505)  melanoma cells (see “Methods”). Of the 958 discriminant isoforms in SKCM, 817 had expression in the cell lines. From these, 504 (61.7 %) show a change in PSI in the same direction and 253 of them have |ΔPSI| > 0.1 in both comparisons (Additional file 2: Figure S1c).
To build signatures of tumor stage based on transcript isoforms, we applied a multivariate feature selection method on the discriminant isoforms selected before to obtain a non-redundant subset of predictive transcripts, which we used to build logistic model trees (LMT) for each tumor type and stage class (Fig. 1a) (models given in Additional file 5). Each one of these models represents a transcript signature for each stage class and each tumor type. Using cross-validation on the annotated TCGA samples, the mean accuracy of the models in terms of the area under the ROC curve (AUC) is 0.783 (Fig. 1d), with similar average precision-recall values (Additional file 2: Figure S1d). T-models show the best accuracies (mean AUC = 0.824), with the models for KIRP, kidney chromophobe (KICH), colon adenocarcinoma (COAD), and neck squamous cell carcinoma (HNSC) being the most accurate (mean AUC >0.87).
The KIRP T-model includes an isoform for PAX6. Increased inclusion of exon 5 of this gene has been related to neuronal differentiation , which we see associated with the late T stage (Additional file 2: Figure S2a). The best N-models correspond to KIRP and prostate adenocarcinoma (PRAD) (mean AUC >0.89). The KIRP N-model includes an isoform in the MAP kinase MKNK1 (Additional file 2: Figure S2a), suggesting a similar involvement in cancer as MKNK2 . The PRAD N-model (mean AUC = 0.986) includes an isoform of IDO1 (Fig. 1e), a gene related to anti-tumor defense . The best M-model corresponds to SKCM (mean AUC = 0.93) and includes an isoform change in the transmembrane gene TM6SF1 (Additional file 2: Figure S2a) and the tyrosine kinase SYK (Fig. 1e). In metastatic melanoma samples, SYK shows an increase in the abundance of the long form and a decrease of the short form, as previously observed in breast tumors . Finally, the best S-models correspond to COAD, breast invasive carcinoma (BRCA), KICH, and ovarian serous cystadenocarcinoma (OV) (AUC >0.9). Interestingly, OV S-model includes an isoform in the cancer driver GAS7 (Fig. 1e). In general, we found no overlap between the different stage models. A notable exception is an isoform of NSUN7 that appears in all models for kidney renal clear cell carcinoma (KIRC) with high PSI values at late-stage and an isoform of SKA3 that appears in the N-, T-, and S-models for KIRP, with low PSI values at late stages. The low general overlap is consistent with pathological transformations being associated with multiple molecular alterations.
Transcript isoform changes are predictive of survival in patients with unknown stage
Survival analysis between early and late-stage patient groups
p = 0.375
p = 0.00012
p = 0.008
p = 0.0007
p = 0.0011
p = 0.011
p = 1.48e-05
p = 0.012
p = 0.051
p = 0.0137
p = 2.49e-07
p = 0.00896
p = 0.00896
p = 2e-15
p = 0.0125
p = 0
p = 0
p = 0.0043
p = 0.005
p = 8.86e-007
p = 0.029
p = 0.071
p = 0.025
p = 7.02e-09
p = 3.26e-06
p = 0.165
p = 7.02e-09
p = 0.0537
p = 0.456
p = 1
p = 0.418
p = 0.324
p = 0.597
p = 2.49e-07
No relation of isoform signatures with stromal and immune cell content
To assess whether the purity of the samples could be a potential confounding factor of the derived signatures, we tested the correlation between the transcript PSI values of our models against signatures of stromal and immune cell content  (see “Methods”). Overall, all signatures showed low correlation with stromal content (mean Pearson |R| <0.4), and all except the N-model in BRCA (Pearson R = 0.433) had mean |R| <0.4 with immune cell content (Additional file 6). From the 547 transcript isoforms tested, 95 % show a correlation |R| <0.4 (Pearson) for both stromal and immune scores. Among the few cases with |R| >0.5 there is an isoform of ENAH (Additional file 2: Figure S2d), which is present in the T-models in KIRP and COAD and that was previously linked to an invasive phenotype . Recent analyses have shown that clinical stage does not correlate with tumor purity in the TCGA samples . Our analysis further supports those results and indicates that isoform-based signatures of stage do not reflect stromal or immune cell content.
No universal transcript isoform signature for tumor staging
Our results prompt the question of whether there might be a universal signature of stage and survival based on transcript isoform changes. To test this, we grouped all annotated samples from the different tumor types according to the stage class and applied the same analyses as before. We could only build M- and S-models due to the lack of common isoforms with discriminant power for the other classes (Additional file 7). The average AUC values for M- and S-models were lower than before, with mean AUC of 0.5 and 0.685, respectively. Aggregating samples from BRCA, COAD, and LUAD, we observed a slight increase in accuracy (mean AUC = 0.702). Similarly, analyzing KIRC, KIRP, and KICH samples together, the S-model achieves mean AUC = 0.809. In this case, approximately half of the isoforms were present in the previous models. Finally, analyzing the squamous tumors together (HNSC and LUSC), we derived N- and S-models with mean AUC = 0.72. For other combinations, we could not find accuracies greater than AUC = 0.5. This indicates that despite some overlapping features across tumor types, there is no common signature for all the tested tumor types.
Transcript signatures provide better predictions than event-based signatures in retrospective datasets
We tested whether local alternative splicing events, as opposed to transcript isoform changes, could also determine stage. We applied our analysis pipeline using PSI values for all events in the same tumor samples used before. For most of the stage classes we observed similar or smaller accuracy values for events compared to transcript models (average AUC 0.617 versus 0.778, respectively) (Fig. 2d and Additional file 8). Only 23.5 % of the isoforms in models overlap with at least one alternative splicing event from the event-based models: 16.51 % overlap with alternative 5′/3′ splice-sites, mutually exclusive exons, retained introns, or cassette exon events, and 6.54 % overlap with alternative first or last exon events. Moreover, 82.39 % of isoforms in models overlap with at least one of the pre-calculated alternative splicing events. This indicates that a considerable number of changes in exon-intron structures described by the isoform models that are predictive of tumor stage cannot be captured in terms of simple alternative splicing events.
Transcript signatures provide relevant information about tumor metastasis and lymph node invasion independently of gene expression
Previously proposed molecular classifiers of stage were based on gene expression [7, 8]. We thus tested the relation of our transcript signatures with gene expression. We observed that the proportion of genes with differential expression (DE) vary markedly between transcript signatures (Additional file 5 and “Methods”). For M-models, nine (18 %) genes in the SKCM and four (18 %) genes in KIRC showed DE. For N-models, we only found three (14 %) in PRAD and 13 (68 %) in THCA. In contrast, T-models presented frequent changes across the different tumor types, with 17 (46 %) in KIRP, seven (27 %) in KIRC, six (33 %) in LUAD, four (50 %) in THCA, and one in HNSC (5 %). Similarly, S-models also showed frequent DE: 16 (52 %) in KIRC, 12 (46 %) in KIRP, one (25 %) in LUAD, and one (7 %) in BRCA.
Next, we compared the discriminative power of transcript and gene expression signatures. We thus applied our pipeline to gene expression values to derive gene-based signatures of stage (see “Methods”). The overall accuracy for gene-based signatures was similar to isoform-based models (average AUC values 0.783 and 0.781 for isoforms and genes, respectively) (Fig. 2d and Additional file 9). Interestingly, isoforms had better mean accuracies for the M-model in LUAD (0.883 versus 0.535) (Fig. 2d, upper left panel) and for the N-model in PRAD (0.986 versus 0.839) (Fig. 2d, lower left panel), compared to gene models. In contrast, the gene-based S-model for THCA showed higher accuracy (0.529 versus 0.836) (Fig. 2d, upper right panel). Gene and isoform based models generally involved different genes with only few exceptions, including CD72 in SKCM M-models, PTGS2 and VIPR1 in the THCA T-models, SLC14A1 in COAD S-models, and DNASE1L3 KICH S-models. Interestingly, gene-based S-models were predictive of survival for samples lacking stage S annotation (p value = 0.0024) (Fig. 2e), whereas no significant difference in survival was found with the gene-based M- and N-models (p values = 0.983 and 0.161, respectively).
The results described above suggest that genes and transcripts provide independent information and may yield better signatures when combined together. We thus built mixed models of gene expression and transcript relative abundance. We started with all gene and transcript discriminant features and selected a non-redundant set of features to build logistic-model trees. The accuracy of these mixed models was on average better (mean AUC = 0.831) than using only transcripts or genes (Additional file 2: Figure S3a). Notably, the transcript signatures performed better than the mixed signatures for the LUAD M-model (AUC = 0.883 versus 0.814) or the PRAD N-model (AUC = 0.986 versus 0.938). In contrast, the mixed model performed better in the COAD M-model (AUC = 0.831 versus 0.864) and the HNSC S-model (AUC = 0.676 versus 0.778). Additionally, mixed models are able to predict survival differences between early and late stages for the N and S stage classes (p value = 0.041 and 0.033, respectively) (Additional file 3: Figure S3b and S3c).
Finally, we compared our transcript signatures with an expression signature of 44 genes built to differentiate metastatic and late-stage samples in colon cancer  (see “Methods”). The mean AUC values obtained for the metastatic annotation (M) and the overall stage (S) were 0.612 and 0.649, respectively, for the gene expression signature and 0.82 and 0.94 for our transcript signatures. Notably, none of the genes involved in our transcript models for COAD presented DE. Our analyses indicate that changes in the relative abundance of transcripts hold relevant information about tumor transformation independently of gene expression changes.
Transcript relative abundances as prognostic markers in ER-negative breast tumors
ER-negative (ER–) and ER-positive (ER+) breast tumor subgroups
48 (T3, T4)
37 (N2, N3)
55 (S3, S4)
29 (T3, T4)
36 (N2, N3)
43 (S3, S4)
We further compared our transcript signatures with known gene expression signatures for breast tumors: OncotypeDX ; MammaPrint ; and PAM50  (see “Methods”). Although these signatures were not originally designed to identify tumor stage, they bear predictive value for this purpose . Their accuracies to separate the ER+ and ER– subgroups were very similar to our transcript signatures (Fig. 3e). This is expected for PAM50 and OncotypeDX, as they include ESR1. We then tested how well the gene signatures differentiate stage within each subset, ER+ or ER–, independently. In general, PAM50 performed better than the two other signatures, except for S in ER– and for N in ER+, where MammaPrint performs better, and for T in ER–, where OncotypeDX performs better (Fig. 3e). Notably, in all cases the transcript signature had better accuracies. We conclude that transcript isoform models can provide relevant information to determine stage and hence complement current clinical signatures.
Transcript relative abundances characterize an invasive phenotype and survival in melanoma
To test whether the melanoma phenotypes are associated with different transcript transformations during tumor progression, we studied the MITF+ and MITF– sets independently to derive signatures of survival. We selected samples in the top and bottom 40 % according to days of survival (36 samples per group) and used our pipeline to calculate the isoforms that best separate these groups within each phenotype. The discriminant isoforms in the invasive phenotype (MITF–) were enriched for multiple cancer hallmarks, whereas the proliferative phenotype (MITF+) presented enrichment only for activation of KRAS signaling, which does not appear in the invasive phenotype (Additional file 2: Figure S5b). We then built models of survival for each subset independently using LMTs (Additional file 9). Cross-validation yielded for MITF+ (34 isoforms) and MITF– (46 isoforms) accuracies of AUC = 0.854 and 0.896, respectively (Fig. 4d and Additional file 2: Figure S5e). Notably, the MITF– model includes a transcript isoform for the MAP Kinase-Activating Death Domain gene MADD (Fig. 4e), which does not change expression at the gene level. MADD is a cancer driver and it was shown before that expression of isoforms that skip exon 16 has anti-apoptotic effects . Interestingly, the PSI of the MADD isoform that skips exon 16 is higher in the group with worse prognosis, suggesting that the anti-apoptotic function of MADD is related to worse prognosis in invasive melanoma. Taken together, our results provide evidence of distinct transcript abundance patterns linked to melanoma phenotypes and survival.
We described the first systematic analysis of the potential of transcript relative abundances to determine stage and clinical outcome in multiple solid tumors. We derived novel molecular signatures for 12 different tumor types that can separate tumors according to clinical stage or metastatic status in retrospective datasets. Importantly, a blind test on patients with unknown stage or metastatic status can separate patients according to survival. Moreover, transcript isoforms provide better accuracies than local alternative splicing events and can describe more complex changes in exon-intron structures. Although a multi-cancer signature of clinical outcome based on gene expression has been proposed , our results argue against a generic transcript-based signature for all tumors types. Rather, transcript isoform changes appear linked to tumor-type specific processes, with several of them related to MYC activity, in concordance with recent findings .
We observed a widespread association between transcript isoform changes and expression changes in T- and S-models across tumor types and around 60 % of the genes with DE in all models correspond to KIRC and KIRP, indicating that transcript and gene changes are tightly coupled during progression of these tumors. In contrast, this association is low or absent for most tumor types for M- and N-models. The blind test showing that patients with predicted metastasis or late N-stage have a worse prognosis was for tumor types for which none of these models show gene expression differences. The value of the transcript signatures is further highlighted when compared to known and newly derived gene expression signatures or with mixed models combining gene expression and transcript abundances. These results indicate that transcripts signatures provide information independent from gene expression to describe tumor progression and especially in metastasis and lymph node invasion.
We also extracted signatures for specific tumor subtypes in breast cancer and melanoma. We reported many significant transcript isoform changes between breast tumors according to ER expression and between melanoma samples according to MITF expression. Additionally, we observed a widespread association between transcript isoform and expression changes in relation to ER expression and according to MITF expression. An interesting possibility is that the activity of these two transcription factor genes could trigger expression and transcript isoform changes in the same genes in these tumors, pointing to new mechanisms of gene regulation worth investigating further. We further derived transcript signatures of stage independently in each sample subset that involved different genes, thereby highlighting the relevance of determining the transcriptome repertoire in tumor samples to derive accurate molecular signatures of tumor progression.
We observed partial reproducibility of the discriminant isoforms in experiments using cell lines. Transcriptional differences between cell lines and tumor tissues are thought to stem from the loss of the stromal and immune components by cells in culture . Our analyses discard an association between the transcript signatures and the composition of stromal and immune cells in the tissue samples. It could be possible that part of the signatures reflects the interaction of tumor cells with their environment in tissues samples, which would be undetectable in cell lines. Our results support the notion that phenotypic states of tumor cells, like invasiveness, may be reflected on the relative abundance of transcript isoforms and may be partly triggered by external cues, such as inflammation or metabolic stress . On the other hand, the observed commonalities between tumor cells and tissues suggest that some of these alterations could be investigated further using cell lines.
The clinical validity of our findings remains to be tested. Our cross-fold validation in retrospective datasets shows good accuracies in general for transcript signatures and, in particular, isoform-based M- and N-models are generally more accurate compared to gene expression models on the same datasets. Additionally, we have shown that predicted metastatic and late lymph-node involvement in non-annotated patients is associated with worse prognosis. These results thus suggest that isoform models may be potentially useful to indicate metastasis or lymph node invasion. To conclusively establish the validity of these signatures and their value to improve current methods of stage determination, it would be necessary to perform more validations on independent cohorts and to verify that they provide stage information before it is visible by other means in prospective studies. However, further studies are currently hampered by the scarcity of large enough RNA-seq datasets with clinical annotation comparable to TCGA .
Our analyses reveal that transcript isoforms hold useful information to build signatures of stage and clinical outcome independently of gene expression. Transcript isoform signatures appear especially relevant to determine lymph node invasion and metastasis. We anticipate that these and similar models may become relevant to understand the progression of tumors beyond DNA and gene expression alterations, thereby complementing current molecular approaches in precision cancer medicine.
Processed RNA-seq data from TCGA (https://gdc.nci.nih.gov/) was compiled for 12 different tumor types: BRCA, COAD, HNSC, KICH, KIRC, KIRP, LUAD, LUSC, PRAD, SKCM, THCA, and OV. The abundance of every transcript per sample was calculated in transcripts per million (TPM) from the transcript-estimated read counts and the isoform lengths. Genes were defined to be a set of transcripts that overlap in the same genomic locus and strand and share at least one splice-site (Additional file 1). A gene TPM was defined as the sum of TPMs for all transcripts in the gene. The relative abundance of each isoform (PSI), was calculated by normalizing the isoform TPM to the gene TPM. Only genes with a minimum TPM of 0.1 were considered. Additionally, we used RNA-seq data from the knockdown of ESR1 and controls in MCF7 cells (GSE53533) , from metastatic melanoma cells (SKMel147) and melanocytes (GSE68221) , and from the knockdown of MITF and controls using non-metastatic melanoma cells (Mel505) (GSE61967) . For each sample, transcript abundances were calculated with Sailfish . Relative abundances (PSI) of transcripts were calculated as above and the ΔPSI values between conditions were calculated as the difference between conditions of the mean values from the replicates. Alternative splicing events and their PSI values were obtained from .
Clinical stage and survival information for patients was obtained from TGCA. We used the available annotation for the TNM staging system (www.cancerstaging.org/), where T followed by a number (1–4) describes the size of the tumor; N followed by a number (1–3) describes spread to lymph nodes according to number and distance; and M followed by 1 or 0 indicates whether the tumor has metastasized or not, respectively. We also considered the numbered stage annotation (S), which goes from 0 to 4, with each number corresponding approximately to a combination of the TNM numbers. When any of the stages were subdivided, only the label of the common class was included (e.g. T1a, T1b, and T1c were considered as T1). Only patients with defined stage were used to build the predictive models.
Selection of relevant features
Only isoforms and events with a difference in mean relative abundance (PSI) of at least 0.1 in absolute value between the compared patient subgroups were considered to calculate discriminant isoforms. To obtain discriminant genes, those with log-fold change of the mean gene TPM values between the two groups greater than 2 were considered. Next, a subsampling approach was used to compare two patient groups through 100 iterations, by extracting the same number of samples from each group randomly from the input dataset, using a minimum of 10 samples per group. For pooled tumor types, the same number of samples per tumor type was selected at each iteration step. At each iteration step, three different univariate discriminant measures were applied (see below), and a permutation of the group labels was performed and the univariate measures re-calculated. After 100 iterations, and for each univariate measure, two distributions of 100 points each are produced for each transcript, corresponding to the observed and expected values. Transcripts with a positive difference of the means of the two distributions for all three measures were considered discriminant and were kept for further analysis.
We applied the following information-based measures in the subsampling: information gain (IG), gain ratio (GR), and symmetrical uncertainty (SU). IG is defined as the mutual information between the group labels of the training set S and the values of a feature (or attribute) A, e.g. an isoform: IG(S,A) = MI(S,A) = H(S) - H(S|A), where H(S) is Shannon’s entropy according to the two sample classes and H(S|A) is the conditional entropy of S with respect to the attribute A. GR is the mutual information of the group labels and the attribute, normalized by the entropy contribution from the proportions of the samples according to the partitioning by the attribute: GR(S,A) = MI(S,A)/H(A). Finally, SU provides a symmetric measurement of feature correlation with the labels and it compensates possible biases from the other two measures: SU(S,A) = 2 · MI(S,A)/(H(S) + H(A)) . The group labels are the clinical stages (early, late), survival groups (low, high), or phenotype group (invasive, proliferative); and the attribute values are the PSI values for transcript isoforms or alternative splicing events or the gene TPM values for gene expression analyses. The continuous PSI or TPM values were discretized as previously described .
Cancer hallmarks and drivers
Enrichment analysis of the 50 cancer hallmarks from the Molecular Signatures Database v4.0  was performed with the discriminant isoforms. For each hallmark, Fisher’s exact test was performed with the genes with selected isoforms using as controls genes expressed (TPM >0.1) and with multiple transcripts. A Benjamini–Hochberg correction was applied and only cases with a false discovery rate <0.05 were kept. Known and predicted cancer drivers were obtained as described in .
Transcript isoforms that showed a positive difference between the means of the 100 observed and the 100 randomized values for all three univariate measures (IG, GR, SU) were analyzed with a Correlation Feature Selection (CFS) . This selects transcripts with similar discriminating power but lower redundancy among them , thereby mitigating the problem of overfitting. This was repeated for each comparison between clinical stages, survival groups, or tumor subtypes. Using the selected transcript isoforms, an LMT was built with Rweka . LMTs are classification trees with logistic regression functions at the leaves. The accuracy of the classifiers was evaluated using the AUC. Additionally, we considered the area under the precision-recall curve (PRC). AUC and PRC take values between 0 (worst prediction) and 1 (best prediction). These values were estimated for each classifier through a tenfold cross-validation, repeated 100 times. The same approach was used for gene, event, and mixed models. To apply known gene expression signatures to our sample groups we used robust Z-scores per gene and per sample as described before . These values were then used for the genes in various signatures [21, 26–28]. As before, accuracies were estimated using a tenfold cross-validation to calculate AUC and PRC values.
For samples without stage annotation, which were not used to build the models, we predicted the missing stage (early/late) or metastatic state, using the corresponding model for the same tumor type. These newly predicted samples were then aggregated per clinical class according to early and late, or metastatic and non-metastatic, to test the survival differences between groups. The blind test was performed using only those tumor types that already showed significant differences in the survival between early and late stages for the annotated samples (Table 2). This analysis was not performed for T-models, as all samples had a T annotation.
Differential expression analysis
We performed DE analysis for all genes between the different groups considered in this analysis, using the same method as described previously . Genes were considered differentially expressed if the absolute value of the log2-fold change was greater than 0.5 and corrected p value < 0.05. Results can be found in Additional files 5 and 9.
Survival curves were calculated with the Kaplan–Meier method and compared between patient subsets using a Cox proportional hazards regression model . Survival was measured as date of death minus collection date for deceased patients and as last contact date minus collection date for the other patients.
Stromal and immune cell content analysis
To estimate a stromal and immune signature for a set of samples from a tumor type, we collected a list of stromal and immune signature genes based on . We transformed the RNA-Seq by Expectation-Maximization (RSEM) read counts of these two gene lists into a gene set score using gene set variation analysis (GSVA)  for each sample. Using the resulting scores per sample, we then calculated the Pearson correlations of the stromal and immune GSVA scores with the transcript isoform PSIs using all tumor samples, including intermediate stages.
AUC, area under the ROC curve; CFS, correlation feature selection; ER, estrogen receptor; GR, gain ration; IG, information gain; LMT, logistic model tree; PRC, area under the precision-recall curve; PSI, percent/proportion spliced in; ROC, receiver operating characteristic; SU, symmetrical uncertainty
We would like to thank Victor Moreno, Jun Yokota, Rubén Pio, Angel Rubio, and Luis Montuenga for useful discussions and their comments on earlier versions of the manuscript. This work was supported by grants BIO2014-52566-R and Consolider RNAREG (CSD2009-00080) from the MINECO (Spanish Government) and FEDER, by AGAUR (2014-SGR1121) and by the Sandra Ibarra Foundation for Cancer (FSI2013).
EE proposed and supervised the study, JLT carried out work. AP and ES contributed with some of the software components and processed datasets. JLT and EE wrote the paper with inputs from ES. All authors read and approved the manuscript.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Sobin LH. TNM: evolution and relation to other prognostic factors. Semin Surg Oncol. 2003;21:3–7.View ArticlePubMedGoogle Scholar
- Dancey JE, Bedard PL, Onetto N, Hudson TJ. The genetic basis for cancer treatment decisions. Cell. 2012;148(3):409–20.View ArticlePubMedGoogle Scholar
- Teschendorff AE, Miremadi A, Pinder SE, Ellis IO, Caldas C. An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer. Genome Biol. 2007;8(8):R157.View ArticlePubMedPubMed CentralGoogle Scholar
- Ottewell PD, O’Donnell L, Holen I. Molecular alterations that drive breast cancer metastasis to bone. Bonekey Rep. 2015;4:643.View ArticlePubMedPubMed CentralGoogle Scholar
- Suvà ML, Riggi N, Bernstein BE. Epigenetic reprogramming in cancer. Science. 2013;339(6127):1567–70.View ArticlePubMedGoogle Scholar
- Eschrich S, Yang I, Bloom G, Kwong KY, Boulware D, Cantor A, Coppola D, Kruhøffer M, Aaltonen L, Orntoft TF, Quackenbush J, Yeatman TJ. Molecular staging for survival prediction of colorectal cancer patients. J Clin Oncol. 2005;23(15):3526–35.Google Scholar
- Taherian-Fard A, Srihari S, Ragan MA. Breast cancer classification: linking molecular mechanisms to disease prognosis. Brief Bioinform. 2015;16(3):461–74.View ArticlePubMedGoogle Scholar
- Okayama H, Schetter AJ, Ishigame T, Robles AI, Kohno T, Yokota J, Takenoshita S, Harris CC. The expression of four genes as a prognostic classifier for stage I lung adenocarcinoma in 12 independent cohorts. Cancer Epidemiol Biomarkers Prev. 2014;23(12):2884–94.Google Scholar
- Sadanandam A, Lyssiotis CA, Homicsko K, Collisson EA, Gibb WJ, Wullschleger S, Ostos LC, Lannon WA, Grotzinger C, Del Rio M, Lhermitte B, Olshen AB, Wiedenmann B, Cantley LC, Gray JW, Hanahan D. A colorectal cancer classification system that associates cellular phenotype and responses to therapy. Nat Med. 2013;19(5):619–25.Google Scholar
- Gonzàlez-Porta M, Frankish A, Rung J, Harrow J, Brazma A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 2013;14(7):R70.View ArticlePubMedPubMed CentralGoogle Scholar
- Sebestyén E, Zawisza M, Eyras E. Detection of recurrent alternative splicing switches in tumor samples reveals novel signatures of cancer. Nucleic Acids Res. 2015;43(3):1345–56.View ArticlePubMedPubMed CentralGoogle Scholar
- Vardabasso C, Gaspar-Maia A, Hasson D, Pünzeler S, Valle-Garcia D, Straub T, et al. Histone variant H2A.Z.2 mediates proliferation and drug sensitivity of malignant melanoma. Mol Cell. 2015;59(1):75–88.View ArticlePubMedPubMed CentralGoogle Scholar
- Laurette P, Strub T, Koludrovic D, Keime C, Le Gras S, Seberg H, Van Otterloo E, Imrichova H, Siddaway R, Aerts S, Cornell RA, Mengus G, Davidson I. Transcription factor MITF and remodeller BRG1 define chromatin organisation at regulatory elements in melanoma cells. Elife. 2015;24:4.Google Scholar
- Azuma N, Tadokoro K, Asaka A, Yamada M, Yamaguchi Y, Handa H, Matsushima S, Watanabe T, Kohsaka S, Kida Y, Shiraishi T, Ogura T, Shimamura K, Nakafuku M. The Pax6 isoform bearing an alternative spliced exon promotes the development of the neural retinal structure. Hum Mol Genet. 2005;14(6):735–45.Google Scholar
- Adesso L, Calabretta S, Barbagallo F, Capurso G, Pilozzi E, Geremia R, Delle Fave G, Sette C. Gemcitabine triggers a pro-survival response in pancreatic cancer cells through activation of the MNK2/eIF4E pathway. Oncogene. 2013;32(23):2848–57.Google Scholar
- Metz R, Smith C, DuHadaway JB, Chandler P, Baban B, Merlo LM, Pigott E, Keough MP, Rust S, Mellor AL, Mandik-Nayak L, Muller AJ, Prendergast GC. IDO2 is critical for IDO1-mediated T-cell regulation and exerts a non-redundant function in inflammation. Int Immunol. 2014;26(7):357–67.Google Scholar
- Prinos P, Garneau D, Lucier J-F, Gendron D, Couture S, Boivin M, Elela SA. Alternative splicing of SYK regulates mitosis and cell survival. Nat Struct Mol Biol. 2011;18(6):673–9.Google Scholar
- Yoshihara K, Shahmoradgoli M, Martínez E, Vegesna R, Kim H, Torres-Garcia W, Treviño V, Shen H, Laird PW, Levine DA, Carter SL, Getz G, Stemke-Hale K, Mills GB, Verhaak RG. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat Commun. 2013;4:2612.Google Scholar
- Pino MS, Balsamo M, Di Modugno F, Mottolese M, Alessio M, Melucci E, et al. Human Mena + 11a isoform serves as a marker of epithelial phenotype and sensitivity to epidermal growth factor receptor inhibition in human pancreatic cancer cell lines. Clin Cancer Res. 2008;14:4943–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Aran D, Sirota M, Butte AJ. Systematic pan-cancer analysis of tumour purity. Nat Commun. 2015;6:8971. doi:10.1038/ncomms9971.View ArticlePubMedPubMed CentralGoogle Scholar
- Schell MJ, Yang M, Missiaglia E, Delorenzi M, Soneson C, Yue B, Nebozhyn MV, Loboda A, Bloom G, Yeatman TJ. A composite gene expression signature optimizes prediction of colorectal cancer metastasis and outcome. Clin Cancer Res. 2016;22(3):734–45.Google Scholar
- De Sousa E, Melo F, Wang X, Jansen M, Fessler E, Trinh A, de Rooij LP, de Jong JH, de Boer OJ, van Leersum R, Bijlsma MF, Rodermond H, van der Heijden M, van Noesel CJ, Tuynman JB, Dekker E, Markowetz F, Medema JP, Vermeulen L. Poor-prognosis colon cancer is defined by a molecularly distinct subtype and develops from serrated precursor lesions. Nat Med. 2013;19(5):614–8.Google Scholar
- Choi W, Porten S, Kim S, Willis D, Plimack ER, Hoffman-Censits J, Roth B, Cheng T, Tran M, Lee IL, Melquist J, Bondaruk J, Majewski T, Zhang S, Pretzsch S, Baggerly K, Siefker-Radtke A, Czerniak B, Dinney CP, McConkey DJ. Identification of distinct basal and luminal subtypes of muscle-invasive bladder cancer with different sensitivities to frontline chemotherapy. Cancer Cell. 2014;10;25(2):152–65.Google Scholar
- Caizzi L, Ferrero G, Cutrupi S, Cordero F, Ballaré C, Miano V, Reineri S, Ricci L, Friard O, Testori A, Corà D, Caselle M, Di Croce L, De Bortoli M. Genome-wide activity of unliganded estrogen receptor-α in breast cancer cells. PNAS. 2014;111(13):4892–7.Google Scholar
- Venables JP, Klinck R, Koh C, Gervais-Bird J, Bramard A, Inkel L, Durand M, Couture S, Froehlich U, Lapointe E, Lucier JF, Thibault P, Rancourt C, Tremblay K, Prinos P, Chabot B, Elela SA. Cancer-associated regulation of alternative splicing. Nat Struct Mol Biol. 2009;16(6):670–6.Google Scholar
- Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T, Hiller W, Fisher ER, Wickerham DL, Bryant J, Wolmark N. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004;351(27):2817–26.Google Scholar
- van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–6.Google Scholar
- Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, Quackenbush JF, Stijleman IJ, Palazzo J, Marron JS, Nobel AB, Mardis E, Nielsen TO, Ellis MJ, Perou CM, Bernard PS. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27(8):1160–7.Google Scholar
- TCGA. Cancer Genome Atlas Network. Genomic Classification of Cutaneous Melanoma. Cell. 2015;161(7):1681–96.View ArticleGoogle Scholar
- Hoek KS, Goding CR. Cancer stem cells versus phenotype-switching in melanoma. Pigment Cell Melanoma Res. 2010;23:746–59.View ArticlePubMedGoogle Scholar
- Li FZ, Dhillon AS, Anderson RL, McArthur G, Ferrao PT. Phenotype switching in melanoma: implications for progression and therapy. Front Oncol. 2015;5:31.View ArticlePubMedPubMed CentralGoogle Scholar
- Hoek KS, Eichhoff OM, Schlegel NC, Döbbeling U, Kobert N, Schaerer L, Hemmi S, Dummer R. In vivo switching of human melanoma cells between proliferative and invasive states. Cancer Res. 2008;68(3):650–6.Google Scholar
- Babaei-Jadidi R, Li N, Saadeddin A, Spencer-Dene B, Jandke A, Muhammad B, Ibrahim EE, Muraleedharan R, Abuzinadah M, Davis H, Lewis A, Watson S, Behrens A, Tomlinson I, Nateri AS. FBXW7 influences murine intestinal homeostasis and cancer, targeting Notch, Jun, and DEK for degradation. J Exp Med. 2011;208(2):295–312.Google Scholar
- Chiaverini C, Beuret L, Flori E, Busca R, Abbe P, Bille K, Bahadoran P, Ortonne JP, Bertolotto C, Ballotti R. Microphthalmia-associated transcription factor regulates RAB27A gene expression and controls melanosome transport. J Biol Chem. 2008;283:12635–42.Google Scholar
- Lefave CV, Squatrito M, Vorlova S, Rocco GL, Brennan CW, Holland EC, et al. Splicing factor hnRNPH drives an oncogenic splicing switch in gliomas. EMBO J. 2011;30:4084–97.View ArticlePubMedPubMed CentralGoogle Scholar
- Martinez-Ledesma E, Verhaak RG, Treviño V. Identification of a multi-cancer gene expression biomarker for cancer clinical outcomes using a network-based algorithm. Sci Rep. 2015;5:11966.View ArticlePubMedGoogle Scholar
- Koh CM, Bezzi M, Low DH, Ang WX, Teo SX, Gay FP, Al-Haddawi M, Tan SY, Osato M, Sabò A, Amati B, Wee KB, Guccione E. MYC regulates the core pre-mRNA splicing machinery as an essential step in lymphomagenesis. Nature. 2015;523(7558):96–100.Google Scholar
- Vincent KM, Findlay SD, Postovit LM. Assessing breast cancer cell lines as tumour models by comparison of mRNA expression profiles. Breast Cancer Res. 2015;17:114.View ArticlePubMedPubMed CentralGoogle Scholar
- Gillies RJ, Verduzco D, Gatenby RA. Evolutionary dynamics of carcinogenesis and why targeted therapy does not work. Nat Rev Cancer. 2012;12(7):487–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Rubin MA. Health: Make precision medicine work for cancer care. Nature. 2015;520(7547):290–1.View ArticlePubMedGoogle Scholar
- Patro R, Mount SM, Kingsford C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol. 2014;32(5):462–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Sebestyén E, Singh B, Miñana B, Pagès A, Mateo M, Pujana MA, Valcarcel J, Eyras E. Large-scale analysis of genome and transcriptome alterations in multiple tumors unveils novel cancer-relevant splicing networks. Genome Res. 2016;26(6):732–44.Google Scholar
- Hall M. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. In: Langley P, editor. Proceedings of the Seventeenth International Conference on Machine Learning. Stanford, CA. San Francisco: Morgan Kaufmann; 2000. p. 359-366.Google Scholar
- Fayyad UM, Irani KB. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the thirteen joint conference of Artificial Intelligence, Chambery, France. San Francisco: Morgan Kaufmann; 1993. p. 1022-27.Google Scholar
- Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The molecular signatures database hallmark gene set collection. Cell Syst. 2015;1(6):417–25.View ArticlePubMedGoogle Scholar
- Landwehr N, Hall M, Frank E. Logistic model trees. Mach Learn. 2005;59(1-2):161–205.View ArticleGoogle Scholar
- Cox DR. Regression models and life-tables. J R Stat Soc Ser B. 1972;34(2):187–220.Google Scholar
- Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. 2013;14:7.View ArticlePubMedPubMed CentralGoogle Scholar