Sample-level enrichment analysis unravels shared stress phenotypes among multiple cancer types
© Gundem and Lopez-Bigas; licensee BioMed Central Ltd. 2012
Received: 10 June 2011
Accepted: 29 March 2012
Published: 29 March 2012
Skip to main content
© Gundem and Lopez-Bigas; licensee BioMed Central Ltd. 2012
Received: 10 June 2011
Accepted: 29 March 2012
Published: 29 March 2012
Adaptation to stress signals in the tumor microenvironment is a crucial step towards carcinogenic phenotype. The adaptive alterations attained by cells to withstand different types of insults are collectively referred to as the stress phenotypes of cancers. In this manuscript we explore the interrelation of different stress phenotypes in multiple cancer types and ask if these phenotypes could be used to explain prognostic differences among tumor samples.
We propose a new approach based on enrichment analysis at the level of samples (sample-level enrichment analysis - SLEA) in expression profiling datasets. Without using a priori phenotypic information about samples, SLEA calculates an enrichment score per sample per gene set using z-test. This score is used to determine the relative importance of the corresponding pathway or module in different patient groups.
Our analysis shows that tumors significantly upregulating genes related to chromosome instability strongly correlate with worse prognosis in breast cancer. Moreover, in multiple tumor types, these tumors upregulate a senescence-bypass transcriptional program and exhibit similar stress phenotypes.
Using SLEA we are able to find relationships between stress phenotype pathways across multiple cancer types. Moreover we show that SLEA enables the identification of gene sets in correlation with clinical characteristics such as survival, as well as the identification of biological pathways/processes that underlie the pathology of different cancer subgroups.
Complex genetic diseases such as cancer are characterized by phenotypic heterogeneity reflected at the molecular level in the form of variations in the activity of certain signaling pathways. In support of this notion, recent cancer genome studies point to the idea that distinct types of alterations in different genes tend to accumulate in pathways central to the control of cell growth and cell fate determination [1–4]. It has been proposed that expression signatures indicative of activity status of pathways can be used to define specific molecular phenotypes that characterize individual tumors . A number of methods have been developed to analyze the transcriptomic changes specific to tumor samples and identify patterns of pathway deregulation that differentiate distinct patient subgroups [6–12]. These methodologies are based on the idea that analysis of pathway-level differences among samples could have an advantage of reflecting the true oncogenic phenotypes achieved through consistent expression of a set of genes compared with the acute expression of a single gene. However, each of these methods has been designed to address specific questions and, thus, have limited use for a more general application. For instance, that of Xia and Wishart is specific to metabolomic data , and that of Bild et al.  requires cell line perturbation data in a platform comparable to that of the tumor data. The methodologies developed by Edelman et al. , Verhaak et al.  and Yi et al.  require a priori information of phenotypic classification of the samples. In this manuscript, we propose a new methodology, sample-level enrichment analysis (SLEA), that overcomes these limitations and has a more general use for enrichment analysis (EA) at the level of samples. The pathways or modules are represented as lists of genes, which can be obtained from literature or online repositories such as Gene Ontology, as well as determined through other high-throughput assays. Without using a priori phenotypic information about the samples, SLEA calculates an enrichment score per sample per gene set using z-test. This score is used to determine the relative importance of the corresponding module or pathway in different patient groups. We use this approach to test the hypothesis described in the following paragraph.
It has been proposed that, during the progression of cancer, the capacity of cancer cells to survive in the hypoxic and nutrient-deprived tumor microenvironment is a crucial step towards malignancy . Adaptation to survival under these stress signals can override normal cellular stress responses, leading to the persistence and progression of the carcinogenic phenotype. Different types of stress insults, such as senescence-induced, metabolic, and oxidative, represent a common set of oncogenesis-associated cellular barriers that cancer cells must tolerate through stress support pathways . For example, to overcome the senescence barrier, malignant cells have been proposed to deregulate proteins in senescence-mediating pathways such as Rb signaling. These alterations are collectively referred to as the stress phenotypes of cancers . In this study, we asked if stress phenotypes of tumor samples could be used to explain their prognostic differences. To this end, we used publicly available gene expression profiles of patient cohorts of different types of cancers and gene signatures related to different stress phenotypes. We performed EA in each tumor sample in each patient cohort in order to detect differentially enriched modules. We show that EA with a chromosomal instability (CIN)-related gene signature has prognostic power in some cancer types but not in others. In all cancer types, however, patient sup-groups positively enriched for the same gene set shared key properties related to their stress phenotypes, indicating dependence of these tumors in certain stress support pathways.
Tumor profiling data sets used in the study
Ivshina et al. 2006 
Pawitan et al. 2005 
Wang et al. 2005 
Kim et al. 2010 
TCGA 2008 
Tothill et al. 2008 
Crijns et al. 2009 
TCGA, 2011 
TCGA: ovarian serous
Bild et al. 2006 
Raponi et al. 2006 
Smith et al. 2010 
List of modules extracted from expression data
Number of genes
A signature of genes upregulated in chromosomal instability and predictive of clinical outcome
Rb-E2F interaction network built computationally using protein interaction databases
Down in senescence bypass
Genes downregulated in fibroblasts that bypass RAS-induced senescence
Up in senescence bypass
Genes upregulated in fibroblasts that bypass RAS-induced senescence
Down in senescence
Genes downregulated in fibroblasts in replicative senescence
Up in senescence
Genes upregulated in fibroblasts in replicative senescence
Pujana ATM network
Computational network around Atm built using expression profiling and functional and genomic data
Pujana BRCA1 network
Computational network around Brca1 built using expression profiling and functional and genomic data
Pujana BRCA2 network
Computational network around Brca2 built using expression profiling and functional and genomic data
Pujana CHEK2 network
Computational network around Chek2 built using expression profiling and functional and genomic data
Pujana XPRSS network
Computational network around Xprss built using expression profiling and functional and genomic data
Bortezomib treatment DOWN
Genes downregulated in cancer cells treated with bortezomib
Bortezomib treatment UP
Genes upregulated in cancer cells treated with bortezomib
Eeyarestatin treatment DOWN
Genes downregulated in cancer cells treated with eeyarestatin
Eeyarestatin treatment UP
Genes upregulated in cancer cells treated with eeyarestatin
Downreg in PI3K-hyper
Genes downregulated in Rb-deficient breast cancer cell line treated with rapamycin
Upreg in Pi3K-hyper
Gene upregulated in hormone therapy-resistant breast cancer
PTEN mutation signature
PTEN mutation signature upregulated in PTEN-mutant breast cancer
Up in TSC1 mTORC1
Genes upregulated in Tsc1-/- mutant versus WT MEFs
Down in TSC1 mTORC1
Genes downregulated in Tsc1-/- mutant versus WT MEFs
We used the 'coxph' function from the 'survival' package of R  (Figure 1c). In survival analysis with the CIN signature, the survival data of the samples with positive enrichment for the signature (positive z-scores with corrected P-value < 0.05) are compared to all the rest of the samples (non-enriched) in the dataset. For the survival analysis related to upregulation of the two-gene signature (CDKN2A and MKI67) , we compare the samples with an expression value greater than the standard deviation of the row for both genes to all the rest of the samples in the dataset.
To facilitate the representation and interpretation of the results generated by our analyses, we created a web service using Onexus  that allows navigation of all the heat maps and details of the statistical results for each of the dataset and modules analyzed along with the datasets included in the analysis .
Some considerations of the SLEA approach as presented here are important to take into account. First, the z-test requires normality on data. Since SLEA uses the distribution of means of random sets of genes, due to the central limit theorem, even if the expression data do not follow normal distribution, the distribution of the sample mean is normal provided that the number of permutations is large (we use 10, 000 permutations). The distribution of the sample median, on the other hand, may not be normal, although for large numbers of permutations it is usually close to it. However, the median is a measure more robust to outliers; hence, we performed the same EAs with sample mean and median separately and compared the results. The z-scores obtained with the different test statistics are almost identical (r = 0.99) (Figure S2 in Additional file 1). We use the median for all the plots and results of EA shown in the manuscript.
The second important consideration is the robustness of SLEA with regard to changes in the cohort and how it is affected by the sizes of the datasets (that is, the number of samples included). To assess how this influences the results obtained with SLEA and to identify the number of samples under which our methodology works best, we devised a random sampling procedure (Figure S3 in Additional file 1). Using three datasets (Table 1), GSE4922 ([GEO:GSE4922]; breast cancer dataset with 289 tumor samples), TCGA-OV (ovarian cancer dataset with 521 samples) and GSE4573 ([GEO:GSE4573]; lung cancer dataset with 131 samples), we generated different populations of random datasets with the same number of samples. The sample size ranged from 11 to 201 with an increment of 10 for GSE4922 [GEO:GSE4922] and TCGA-OV datasets. For the smallest dataset, it was from 11 to 111 with an increment of 10. Each population contained 100 datasets producing a total of 2, 000 datasets for GSE4922 [GEO:GSE4922] and TCGA-OV and 1, 100 datasets for GSE4573 [GEO:GSE4573] (Figure S3 in Additional file 1). For each of those random datasets we performed median centering followed by the median z-test EA for the CIN signature. Next we performed correlations of the obtained z-scores for each pair of random datasets in each population and plotted box-and-whisker plots of correlation coefficients for each of the dataset sizes (Figure S3 in Additional file 1). This analysis shows that, for datasets with more than 71 samples, the correlations are always higher than 0.99 (Figure S4 in Additional file 1). We also did a t-test comparing the z-scores of all the samples in a population to the z-scores the same sample has in the population with the greatest number of samples (201 samples for GSE4922 [GEO:GSE4922] and TCGA-OV and 111 for GSE4573 [GEO:GSE4573]). This analysis shows that the proportion of samples that are significantly different (t-test corrected P-value < 0.05) is less than 0.05 for sample sizes greater than 81. In summary, we can conclude that SLEA results are highly robust for datasets with 81 or more samples.
In this study, we aim to demonstrate the use of the SLEA approach by detecting the biological processes underlying the differences between clinically distinct patient subgroups. To do this, we performed SLEA using Gitools  for 11 cancer datasets with various relevant gene sets (Tables 1 and 2). Gitools provides two main advantages for this type of analysis, i) one single run of Gitools is enough to perform EA for a large number of samples and modules, and ii) the results are shown in the form of an interactive heat map, which facilitates the comparison between samples and gene sets, and the interpretation of the results. For the sake of clarity and space considerations, we focus on the results for one breast cancer dataset (GSE4922 [GEO:GSE4922]; Table 1) and we point to similarities with and differences from the rest of the datasets, for both breast and other cancer types. The results of the 11 datasets along with the statistical details are accessible at the web service  and some results are shown as supplementary figures in Additional file 1.
Senescence is an important tumor suppressive barrier to the progression of cancer [37–41]. Molecular markers of senescence are observed in pre-malignant lesions while they are lost in the malignant counterparts [37–41]. Prompted by this idea, we set out to compare the CIN-positive tumors to the non-enriched tumors in terms of their expression of senescence-related transcriptional programs. We performed EA with genes that are differentially regulated in fibroblasts undergoing replicative senescence (with the modules named 'down and up in senescence')  and in fibroblasts that bypass RAS-induced senescence (with the modules named 'down and up in senescence-bypass') ). Indeed, in all breast cancer datasets, the primary tumors with the CIN signature were enriched for the senescence-bypass-related transcriptional program while they exhibited expression patterns opposite to that observed during senescence (Figure 2; online supporting material ). Furthermore, we checked the expression level of the genes CDKN2A and MKI67, biomarkers indicative of an abrogated response to senescence-inducing stimulus . These markers were previously shown to indicate compromised Rb signaling and predict subsequent tumor events in breast cancer patients diagnosed with ductal carcinoma in situ . Indeed, some of the CIN-positive tumors displayed concomitant over-expression of CDKN2A and MKI67 together with Rb targets CCNE1 and E2F3 (Figure 2; online supporting material ), indicating deregulation of the Rb pathway. As a better measure of Rb signaling status, we used a set of genes repressed by Rb-E2F (with the module name 'Rb-E2F genes') when Rb signaling is functional . EA with this gene signature confirmed that, although the overlap between the two signatures is low (Jaccard index = 0.22), CIN-positive breast tumors have positive enrichment for Rb-E2F targets, and thus have signs of compromised Rb signaling (Figure 2; online supporting material ). All these results indicate that CIN-positive tumors have activated transcriptional programs indicative of an abrogated response to senescence.
Finally, we compared the prognostic power of the CIN signature to that of concomitant overexpression of CDKN2A and MKI67 (positive normalized expression values for both genes in the same sample) . As seen in Figure 2 (Figure S5 in Additional file 1; online supporting material ), the CIN signature is more informative than the two-gene signature (smaller P-values). As many samples with upregulation of the CIN signature have p53 mutations, we sought to determine if the prognostic power of the CIN signature is independent of p53 mutation status. We performed survival analysis in the datasets with p53 mutation status information excluding the tumors with p53 mutations. Of 289 tumors, 189 had wild-type p53 in the GSE4922 dataset [GEO:GSE4922]. In breast cancer, enrichment with the CIN signature is strongly related to bad prognosis even among samples with wild-type p53, indicating that indeed the predictive power of this signature is independent of p53 mutation (Figure S6 in Additional file 1).
EA is an effective way to analyze the statistically significant gene sets obtained using high-throughput functional genomics data. In this work, we propose an alternative approach for the analysis of tumor genomics data to detect clinically relevant patient subgroups. Instead of finding genes differentially expressed between two groups, we identify differentially enriched modules by performing sample-level EA (SLEA). Our method does not require information related to phenotypic classification of samples and can directly take gene sets as input. Moreover, by comparing enrichment results with available clinical information, SLEA enables the understanding of pathways/processes that underlie the clinical phenotypes such as survival. We applied our methodology to test the prognostic power of a gene signature related to chromosomal instability and to study the prevalence of stress phenotypes in different patient subgroups defined by the expression of this gene signature. The tumors significantly upregulating this signature were strongly correlated with worse prognosis in the three breast cancer datasets studied, but not in other tumor types. In all cancer types, however, the tumors with positive enrichment for this gene signature displayed a transcriptional program pointing to evasion of the senescence barrier and particular stress phenotypes, indicating strong interdependencies between these different pathways and therapeutic vulnerabilities for the tumor.
Gene Expression Omnibus
mammalian target of rapamycin
sample-level enrichment analysis
We acknowledge funding from the Spanish Ministry of Science and Technology (grant number SAF2009-06954) and the Spanish National Institute of Bioinformatics (INB). GG is supported by a fellowship from AGAUR of the Catalonian Government.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.