Sample-level enrichment analysis unravels shared stress phenotypes among multiple cancer types

Background Adaptation to stress signals in the tumor microenvironment is a crucial step towards carcinogenic phenotype. The adaptive alterations attained by cells to withstand different types of insults are collectively referred to as the stress phenotypes of cancers. In this manuscript we explore the interrelation of different stress phenotypes in multiple cancer types and ask if these phenotypes could be used to explain prognostic differences among tumor samples. Methods We propose a new approach based on enrichment analysis at the level of samples (sample-level enrichment analysis - SLEA) in expression profiling datasets. Without using a priori phenotypic information about samples, SLEA calculates an enrichment score per sample per gene set using z-test. This score is used to determine the relative importance of the corresponding pathway or module in different patient groups. Results Our analysis shows that tumors significantly upregulating genes related to chromosome instability strongly correlate with worse prognosis in breast cancer. Moreover, in multiple tumor types, these tumors upregulate a senescence-bypass transcriptional program and exhibit similar stress phenotypes. Conclusions Using SLEA we are able to find relationships between stress phenotype pathways across multiple cancer types. Moreover we show that SLEA enables the identification of gene sets in correlation with clinical characteristics such as survival, as well as the identification of biological pathways/processes that underlie the pathology of different cancer subgroups.


Background
Complex genetic diseases such as cancer are characterized by phenotypic heterogeneity reflected at the molecular level in the form of variations in the activity of certain signaling pathways. In support of this notion, recent cancer genome studies point to the idea that distinct types of alterations in different genes tend to accumulate in pathways central to the control of cell growth and cell fate determination [1][2][3][4]. It has been proposed that expression signatures indicative of activity status of pathways can be used to define specific molecular phenotypes that characterize individual tumors [5]. A number of methods have been developed to analyze the transcriptomic changes specific to tumor samples and identify patterns of pathway deregulation that differentiate distinct patient subgroups [6][7][8][9][10][11][12]. These methodologies are based on the idea that analysis of pathway-level differences among samples could have an advantage of reflecting the true oncogenic phenotypes achieved through consistent expression of a set of genes compared with the acute expression of a single gene. However, each of these methods has been designed to address specific questions and, thus, have limited use for a more general application. For instance, that of Xia and Wishart is specific to metabolomic data [9], and that of Bild et al. [6] requires cell line perturbation data in a platform comparable to that of the tumor data. The methodologies developed by Edelman et al. [7], Verhaak et al. [8] and Yi et al. [10] require a priori information of phenotypic classification of the samples. In this manuscript, we propose a new methodology, samplelevel enrichment analysis (SLEA), that overcomes these limitations and has a more general use for enrichment analysis (EA) at the level of samples. The pathways or modules are represented as lists of genes, which can be obtained from literature or online repositories such as Gene Ontology, as well as determined through other high-throughput assays. Without using a priori phenotypic information about the samples, SLEA calculates an enrichment score per sample per gene set using z-test. This score is used to determine the relative importance of the corresponding module or pathway in different patient groups. We use this approach to test the hypothesis described in the following paragraph.
It has been proposed that, during the progression of cancer, the capacity of cancer cells to survive in the hypoxic and nutrient-deprived tumor microenvironment is a crucial step towards malignancy [13]. Adaptation to survival under these stress signals can override normal cellular stress responses, leading to the persistence and progression of the carcinogenic phenotype. Different types of stress insults, such as senescence-induced, metabolic, and oxidative, represent a common set of oncogenesis-associated cellular barriers that cancer cells must tolerate through stress support pathways [14]. For example, to overcome the senescence barrier, malignant cells have been proposed to deregulate proteins in senescence-mediating pathways such as Rb signaling. These alterations are collectively referred to as the stress phenotypes of cancers [14]. In this study, we asked if stress phenotypes of tumor samples could be used to explain their prognostic differences. To this end, we used publicly available gene expression profiles of patient cohorts of different types of cancers and gene signatures related to different stress phenotypes. We performed EA in each tumor sample in each patient cohort in order to detect differentially enriched modules. We show that EA with a chromosomal instability (CIN)-related gene signature has prognostic power in some cancer types but not in others. In all cancer types, however, patient sup-groups positively enriched for the same gene set shared key properties related to their stress phenotypes, indicating dependence of these tumors in certain stress support pathways.

Transcriptomic data
We collected 11 publicly available expression profiling datasets from the Gene Expression Omnibus (GEO) and TCGA data portal [1,6,[15][16][17][18][19][20][21][22][23] (Table 1). Each dataset consists of microarray expression data for primary tumors. We selected as datasets to include those that are on a single-channel platform, have survival information and contain more than 81 patients (see 'Robustness analysis' section below). The sample number varies from 111 to 766 across all datasets. Before EA, the data were pre-processed as follows (raw data were downloaded for all datasets). For Affymetrix data (9/11 datasets), CEL files were processed and normalized using the rma function in the 'affy' package [24] from R Bioconductor [25]. The result of normalization is log2-transformed absolute readings. For non-Affy experiments (2/11), expression data were normalized using the vsn normalization method from R Bioconductor [25]. After normalization, the input data were obtained by median-centering the expression value of each gene across all the samples (row median) and dividing the value by the standard deviation (row standard deviation). The expression value obtained in this step is a measure of how much a gene is expressed in a sample compared to all the other samples in the dataset. Hence, the heterogeneity and number of the tumor samples in the dataset affect the relative expression values. The stratification of the samples based on their enrichment patterns and the interpretation of this stratification, therefore, is sensitive to the clinical characteristics of the samples in the dataset. For example, the meaning of the median-centered expression value is different if the dataset includes normals in addition to cancer samples compared to if it includes tumor samples only. The selection of datasets should be done taking into account the type of question to be addressed. With this in mind, in our study, we include datasets that contain primary tumor samples only in order to answer the question of which modules/ pathways are differentially enriched among different groups of samples of the same tumor type. All datasets used are provided on the SLEA website [26].

Gene modules
Gene modules (gene sets) were collected from Gene Ontology [27], MSigDB [28] and the supplementary datasets of the indicated publications (Table 2). Using Gitools [29], we performed overlap analysis between the modules used. Some modules from Gene Ontology and MsigDB have high overlap (Jaccard index > 0.25) ( Figure  S1 in Additional file 1). We interpreted the results taking this into consideration. All modules used are provided on the SLEA website [26].

Sample-level enrichment analysis
EA for each sample in each dataset was performed using Gitools [29,30] (Figure 1b). Gitools is a java application for genomic data analysis and visualization the main distinctive feature of which is that data and results are represented using interactive heat maps. Among other tests, Gitools provides different statistical methods to assess the enrichment of gene modules in high-throughput genome-wide profiling data. The main advantage of Gitools for the type of analysis presented in this manuscript is that it can perform many EAs (one per sample and module in this case) in one single run and the results are provided in the form of interactive heat maps, which are useful to compare the results between different samples and different modules. Modules can be literature-based as well as consist of sets of genes obtained through analysis of other types of genomewide studies. In this study, we used the z-score method as described previously [31]. This method compares the mean (or median) expression value of genes in each module to a distribution of mean (or median) of 10, 000 random modules of the same size drawn from the expression values for the same sample. The result of this EA is a z-score, which is a measure of the difference between the observed and expected mean (or median) expression values for a gene set. The P-value related to each z-score is automatically corrected for multiple testing using the Benjamini-Hochberg method [32]. We define modules as 'positively enriched' in a sample if they have a positive z-score and a corrected P-value < 0.05, and 'non-enriched' otherwise. The results are visualized as heat maps of z-scores in Gitools, which is useful for the identification and interpretation of enrichment patterns among samples.

Survival analysis
We used the 'coxph' function from the 'survival' package of R [25] (Figure 1c). In survival analysis with the CIN signature, the survival data of the samples with positive enrichment for the signature (positive z-scores with corrected P-value < 0.05) are compared to all the rest of the samples (non-enriched) in the dataset. For the survival analysis related to upregulation of the two-gene signature (CDKN2A and MKI67) [33], we compare the samples with an expression value greater than the standard deviation of the row for both genes to all the rest of the samples in the dataset.

Web server
To facilitate the representation and interpretation of the results generated by our analyses, we created a web service using Onexus [34] that allows navigation of all the heat maps and details of the statistical results for each EP Figure 1 General schema of the approach. First, module and expression data repositories are created. (a) Left: gene modules were obtained from a number of sources such as Gene Ontology and MSigDB as well as from expression datasets listed in Table 2 of the dataset and modules analyzed along with the datasets included in the analysis [26].

Technical consideration of SLEA and robustness analysis
Some considerations of the SLEA approach as presented here are important to take into account. First, the z-test requires normality on data. Since SLEA uses the distribution of means of random sets of genes, due to the central limit theorem, even if the expression data do not follow normal distribution, the distribution of the sample mean is normal provided that the number of permutations is large (we use 10, 000 permutations). The distribution of the sample median, on the other hand, may not be normal, although for large numbers of permutations it is usually close to it. However, the median is a measure more robust to outliers; hence, we performed the same EAs with sample mean and median separately and compared the results. The z-scores obtained with the different test statistics are almost identical (r = 0.99) ( Figure S2 in Additional file 1). We use the median for all the plots and results of EA shown in the manuscript. The second important consideration is the robustness of SLEA with regard to changes in the cohort and how it is affected by the sizes of the datasets (that is, the number of samples included). To assess how this influences the results obtained with SLEA and to identify the number of samples under which our methodology works best, we devised a random sampling procedure ( Figure S3 in Additional file 1). Using three datasets (Table 1) Figure S3 in Additional file 1). For each of those random datasets we performed median centering followed by the median z-test EA for the CIN signature. Next we performed correlations of the obtained z-scores for each pair of random datasets in each population and plotted box-and-whisker plots of correlation coefficients for each of the dataset sizes ( Figure S3 in Additional file 1). This analysis shows that, for datasets with more than 71 samples, the correlations are always higher than 0.99 ( Figure S4 in Additional file 1). We also did a t-test comparing the z-scores of all the samples in a population to the z-scores the same sample has in the population with the greatest number of samples (201 samples for GSE4922 [GEO:GSE4922] and TCGA-OV and 111 for GSE4573 [GEO:GSE4573]). This analysis shows that the proportion of samples that are significantly different (t-test corrected P-value < 0.05) is less than 0.05 for sample sizes greater than 81. In summary, we can conclude that SLEA results are highly robust for datasets with 81 or more samples.

Results and discussion
In this study, we aim to demonstrate the use of the SLEA approach by detecting the biological processes underlying the differences between clinically distinct patient subgroups. To do this, we performed SLEA using Gitools [29] for 11 cancer datasets with various relevant gene sets (Tables 1 and 2). Gitools provides two main advantages for this type of analysis, i) one single run of Gitools is enough to perform EA for a large number of samples and modules, and ii) the results are shown in the form of an interactive heat map, which facilitates the comparison between samples and gene sets, and the interpretation of the results. For the sake of clarity and space considerations, we focus on the results for one breast cancer dataset (GSE4922 [GEO: GSE4922]; Table 1) and we point to similarities with and differences from the rest of the datasets, for both breast and other cancer types. The results of the 11 datasets along with the statistical details are accessible at the web service [26] and some results are shown as supplementary figures in Additional file 1.

Stratification of patient cohorts in breast cancer
Focusing on the three breast cancer datasets, we first aimed to stratify the tumors in each cohort by performing EAs with a CIN-related gene signature previously shown to predict clinical outcome in multiple tumor types [35]. In all the datasets, based on the EA results, we separated the tumors into two groups: positively enriched (positive z-scores with corrected P-value < 0.05) and non-enriched (all the rest of the samples that did not satisfy the criteria) (Figure 2; Figure S5 in Additional file 1; online supporting material [26]). Subsequent survival analysis showed that the first group had worse survival than the second group in all the breast cancer datasets analyzed (Figure 2b; Figure S5 in Additional file 1). Moreover, the tumors in the first group coincided with more aggressive subtypes of breast cancer (luminal B and basal-like) [36] (Figure S5 in Additional file 1) and p53 mutation carriers [36] (Figure 2a). These results show that our EA approach can be used to stratify patients with respect to a clinical property, in this case survival. We refer to the tumors with significant upregulation of the CIN signature as 'CIN-positive' in the rest of the manuscript.

CIN-positive tumors activate a senescence-bypass transcriptional program
Senescence is an important tumor suppressive barrier to the progression of cancer [37][38][39][40][41]. Molecular markers of senescence are observed in pre-malignant lesions while they are lost in the malignant counterparts [37][38][39][40][41]. Prompted by this idea, we set out to compare the CINpositive tumors to the non-enriched tumors in terms of their expression of senescence-related transcriptional programs. We performed EA with genes that are differentially regulated in fibroblasts undergoing replicative senescence (with the modules named 'down and up in senescence') [37] and in fibroblasts that bypass RASinduced senescence (with the modules named 'down and up in senescence-bypass') [42]). Indeed, in all breast cancer datasets, the primary tumors with the CIN signature were enriched for the senescence-bypass-related transcriptional program while they exhibited expression patterns opposite to that observed during senescence (Figure 2; online supporting material [26]). Furthermore, we checked the expression level of the genes CDKN2A and MKI67, biomarkers indicative of an abrogated response to senescence-inducing stimulus [33]. These markers were previously shown to indicate compromised Rb signaling and predict subsequent tumor events in breast cancer patients diagnosed with ductal carcinoma in situ [33]. Indeed, some of the CIN-positive tumors displayed concomitant over-expression of CDKN2A and MKI67 together with Rb targets CCNE1 and E2F3 (Figure 2; online supporting material [26]), indicating deregulation of the Rb pathway. As a better measure of Rb signaling status, we used a set of genes repressed by Rb-E2F (with the module name 'Rb-E2F genes') when Rb signaling is functional [43]. EA with this gene signature confirmed that, although the overlap between the two signatures is low (Jaccard index =  Finally, we compared the prognostic power of the CIN signature to that of concomitant overexpression of CDKN2A and MKI67 (positive normalized expression values for both genes in the same sample) [33]. As seen in Figure 2 ( Figure S5 in Additional file 1; online supporting material [26]), the CIN signature is more informative than the two-gene signature (smaller P-values). As many samples with upregulation of the CIN signature have p53 mutations, we sought to determine if the prognostic power of the CIN signature is independent of p53 mutation status. We performed survival analysis in the datasets with p53 mutation status information excluding the tumors with p53 mutations. Of 289 tumors, 189 had wild-type p53 in the GSE4922 dataset [GEO:GSE4922]. In breast cancer, enrichment with the CIN signature is strongly related to bad prognosis even among samples with wild-type p53, indicating that indeed the predictive power of this signature is independent of p53 mutation ( Figure S6 in Additional file 1).

Stress phenotypes of the CIN-positive tumors
Next we performed EA with all Gene Ontology biological process terms in order to identify the biological properties characterizing CIN-positive tumors. These tumors significantly downregulate genes related to processes such as 'cell communication' and 'wound healing' (Figure 3; online supporting material [26]). This is in agreement with previous observations showing that the upregulation of a wound response signature is inversely correlated with good prognosis [44].
On the other hand, some categories such as 'cellular response to DNA damage', 'protein folding' and   'translation' were significantly upregulated. We argue that this transcriptional program can be explained by non-oncogene addiction, which is defined as the dependence of cancer cells on stress support pathways that are not themselves tumorigenic [14]. Most of the differentially enriched Gene Ontology terms can be attributed to one of these stress support pathways: 'DNA damage and replicative stress', 'mitotic stress', 'proteotoxic stress' and 'metabolic stress' (Figure 4; Higher expression of these genes indicates that these tumors are dependent on the DNA damage response as explained by non-oncogene addiction. This observation also points to ideas for specialized therapeutic strategies for these aggressive tumors, which are mainly basal-like and luminal B, based on the possible addiction of these tumors to DNA repair pathways. Indeed, very recently, it was shown that combination therapy of iniparib (a poly (ADP-ribose) polymerase (PARP) inhibitor) and chemotherapy, without significant increased toxic effects, improved the clinical benefit and survival of patients with metastatic triple-negative breast cancer, a majority of which are also basal-like [46].

Dependence on proteotoxic stress mechanisms
We assessed the prevalence of proteotoxic stress mechanisms by performing an EA with sets of genes deregulated in cancer cell lines treated with bortezomib and eeyarestatin [47]. CIN-positive tumors significantly upregulated genes that increase in expression in response to both bortezomib, a proteasome inhibitor, and eeyarestatin, an inhibitor of endoplasmic reticulumassociated protein degradation (Figure 6; online supporting material [26]). At the gene level, these samples upregulated genes that are members of the chaperonincontaining complex and heat shock proteins. Of these genes, HSP90 complex is already a molecular target in cancer [48].
Dependence on phosphoinositide 3-kinase/Akt signaling CIN-positive tumors were also positively enriched for metabolism-related categories such as 'nucleotide metabolism', 'generation of precursor metabolites and energy', 'electron transport chain', 'ribosome biogenesis', and so on. Hence, we focused on a specific pathway that plays a crucial role in the regulation of cellular metabolism and its coupling to proliferation. We collected gene sets related to the phosphoinositide 3-kinase (PI3K)/Akt pathway and its downstream mammalian target of rapamycin (mTOR) signaling: 'genes deregulated in PI3Khyper-activated, hormone resistant cells' [49] (modules named 'upreg and downred in PI3K-hyper'), 'PTEN mutation signature' [50] and genes deregulated in TSC1 knockout cells ('upreg in downreg in TSC1-ko') [51]. Figure 7 shows that the transcriptional program of tumors with the CIN signature is enriched for hyperactivated PI3K signaling as well as for genes upregulated in PTEN mutant cells. mTOR signaling activates the expression of genes encoding nearly every step of glycolysis and the pentose phosphate pathway, as well as critical enzymes in the de novo synthesis of sterols, isoprenoids, and fatty acids [51]. We used modules of genes regulated by mTORC1, a molecular complex that contains mTOR [51], to check if indeed the CIN-positive tumors also have activation of processes downstream of mTOR. As expected, the genes upregulated by mTORC1 are also upregulated in these samples ( Figure  7; online supporting material [26]). mTORC1 promotes the expression of HIF1A [51]. In agreement with this, CIN-positive tumors overexpress HIF1A along with its target vascular endothelial growth factor (Figure 7;  online supporting material [26]). As mTORC1 has been shown to induce the transcription of genes involved in important metabolic pathways [51], we checked the mRNA levels of enzymes from the glycolysis and pentose phosphate pathway. Indeed, most of these enzymes are upregulated in CIN-positive tumor samples ( Figure  7; online supporting material [26]). Together these observations indicate that the CIN-positive tumors have activated signaling through mTOR. These results suggest two things. First, these tumors might be addicted to pathways related to metabolic stress in addition to DNA damage stress. If this is indeed the case, then, secondly, inhibitors of mTOR, such as rapamycin, might be useful for the treatment of these cancers. The observations in this and the previous section show that sample-level EA can help pinpoint pathway dependencies in different subgroups of tumors, which can be used to design rational therapeutic approaches specific to each group of patients.

Conclusions
EA is an effective way to analyze the statistically significant gene sets obtained using high-throughput functional genomics data. In this work, we propose an  alternative approach for the analysis of tumor genomics data to detect clinically relevant patient subgroups. Instead of finding genes differentially expressed between two groups, we identify differentially enriched modules by performing sample-level EA (SLEA). Our method does not require information related to phenotypic classification of samples and can directly take gene sets as input. Moreover, by comparing enrichment results with available clinical information, SLEA enables the understanding of pathways/processes that underlie the clinical phenotypes such as survival. We applied our methodology to test the prognostic power of a gene signature related to chromosomal instability and to study the prevalence of stress phenotypes in different patient subgroups defined by the expression of this gene signature. The tumors significantly upregulating this signature were strongly correlated with worse prognosis in the three breast cancer datasets studied, but not in other tumor types. In all cancer types, however, the tumors with positive enrichment for this gene signature displayed a transcriptional program pointing to evasion of the senescence barrier and particular stress phenotypes, indicating strong interdependencies between these different pathways and therapeutic vulnerabilities for the tumor. The panels are for brain, ovarian, lung and bladder cancers. For each cancer type, six properties are shown. The color code for enrichment analysis results (red to blue) is the same as in Figure 2. The properties are 1) upregulation of chromosomal instability genes, 2) senescence-bypass program, 3) DNA and replicative stress response genes, 4) metabolic stress response genes, 5) mitotic stress response genes, and 6) proteotoxic stress response genes.