A robust prognostic signature for hormone-positive node-negative breast cancer
© Griffith et al.; licensee BioMed Central Ltd. 2013
Received: 15 March 2013
Accepted: 26 September 2013
Published: 11 October 2013
Skip to main content
© Griffith et al.; licensee BioMed Central Ltd. 2013
Received: 15 March 2013
Accepted: 26 September 2013
Published: 11 October 2013
Systemic chemotherapy in the adjuvant setting can cure breast cancer in some patients that would otherwise recur with incurable, metastatic disease. However, since only a fraction of patients would have recurrence after surgery alone, the challenge is to stratify high-risk patients (who stand to benefit from systemic chemotherapy) from low-risk patients (who can safely be spared treatment related toxicities and costs).
We focus here on risk stratification in node-negative, ER-positive, HER2-negative breast cancer. We use a large database of publicly available microarray datasets to build a random forests classifier and develop a robust multi-gene mRNA transcription-based predictor of relapse free survival at 10 years, which we call the Random Forests Relapse Score (RFRS). Performance was assessed by internal cross-validation, multiple independent data sets, and comparison to existing algorithms using receiver-operating characteristic and Kaplan-Meier survival analysis. Internal redundancy of features was determined using k-means clustering to define optimal signatures with smaller numbers of primary genes, each with multiple alternates.
Internal OOB cross-validation for the initial (full-gene-set) model on training data reported an ROC AUC of 0.704, which was comparable to or better than those reported previously or obtained by applying existing methods to our dataset. Three risk groups with probability cutoffs for low, intermediate, and high-risk were defined. Survival analysis determined a highly significant difference in relapse rate between these risk groups. Validation of the models against independent test datasets showed highly similar results. Smaller 17-gene and 8-gene optimized models were also developed with minimal reduction in performance. Furthermore, the signature was shown to be almost equally effective on both hormone-treated and untreated patients.
RFRS allows flexibility in both the number and identity of genes utilized from thousands to as few as 17 or eight genes, each with multiple alternatives. The RFRS reports a probability score strongly correlated with risk of relapse. This score could therefore be used to assign systemic chemotherapy specifically to those high-risk patients most likely to benefit from further treatment.
Large randomized trials have shown that chemotherapy administered in the perioperative setting (for example, adjuvant chemotherapy) can cure patients otherwise destined to recur with systemic, incurable cancer . Once this recurrence has happened, the same chemotherapy is never curative. Therefore, the adjuvant window is a privileged period of time, when the decision to administer additional therapy or not, as well as the type, duration, and intensity of such therapy takes center stage. Node-negative, ER-positive, HER2-negative breast cancer patients have a favorable prognosis when treated with adjuvant hormonal therapy only, but a fraction of such patients recur locally or systemically. Most patients are currently treated not only with hormonal therapy but also cytotoxic chemotherapy, even though it is probably unnecessary for most [1, 2]. Our goal was to stratify these patients with respect to the likelihood of recurrence within 10 years after surgery. Earlier approaches to this problem were developed using relatively small datasets. Here we have explored the extent to which larger and more current meta-datasets can be used to improve predictor performance. Our approach was to develop a multi-gene transcription-level-based classifier of 10-year-relapse (disease recurrence within 10 years) using a large database of existing, publicly available microarray datasets. Existing solutions have been implemented on costly platforms that are slow to return results to the patient. However, new assay technologies exist which could make breast cancer prognostics much more accessible and timely. Some of these potential assay technologies can only measure transcription of a relatively small number of genes while still optimizing cost and efficiency. To this end, we developed a robust prognostic panel offering flexibility in both the number and identity of genes utilized from thousands to as few as eight genes, each with multiple alternatives to maximize the chance of successful migration to other assay technologies.
Studies included in analysis
10-year no relapse
Desmedt, 2007 
Ivshina, 2006 
Loi, 2007 
Miller, 2005 
Schmidt, 2008 
Sotiriou, 2006 
Symmans, 2010 
Wang, 2005 
Zhang, 2009 
All data processing and analyses were completed with open source R/Bioconductor packages. Raw data (Cel files) were downloaded from GEO. Duplicate samples were identified and removed if they had the same database identifier (for example, GSM accession), same sample/patient ID, or showed a high correlation (r >0.99) compared to any other sample in the dataset. Raw data were normalized and summarized using the affy  and gcrma  packages in R/Bioconductor. Probe sets were mapped to Entrez gene symbols using both standard and custom annotation files . ER and HER2 expression status was determined using standard probe sets. For the Affymetrix U133A array we and others have found the probe set '205225_at’ to be most effective for determining ER status . The rank sum of the best probe sets for ERBB2 (216835_s_at), GRB7 (210761_s_at), STARD3 (202991_at), and PGAP3 (55616_at) was used to determine HER2 amplification status. This was calculated by taking the sum of individual expression level ranks of each of the four probes. Cutoff values for ER and HER2 status were chosen by mixed model clustering (mclust [17, 18] package). Mclust performs normal mixture modeling, fitted via an expectation maximization algorithm. A mixture model is a probabilistic model for representing the presence of subpopulations within an overall population. In this case we hypothesized that two populations exist in the dataset (ER + and ER- or HER2+ and HER2-) and would be represented by two distinct distributions of expression values. This is clearly visible in Additional file 1: Figure S1 where both ER and HER2 expression levels apparently manifest as two distinct distributions. In both cases, mclust successfully identified two distributions and we used the maximum value of the first distribution as an unbiased, objective cutoff to separate the two distributions. Unsupervised clustering was performed to assess the extent of batch effects. Once all prefiltering was complete, data were randomly split into training (two-thirds) and test (one-thirds) datasets while balancing for study of origin and number of relapses with 10-year follow-up. The test dataset was renormalized, put aside, left untouched, and only used for final validation, once each for the full-gene, 17-gene, and eight-gene classifiers. Training data was renormalized together as above. Probe sets were then filtered for a minimum of 20% samples with expression above background threshold (raw value >100) and coefficient of variation (COV) between 0.7 and 10. A total of 3,048 probe sets/genes passed this filtering and formed the basis for the 'full-gene-set’ model described below. The clinical annotations and preprocessed data for both training and test datasets (before filtering for percent expression and COV) are provided as Additional files 2, 3, 4, and 5.
Classification was performed on only training samples with either a relapse or no relapse after 10-year follow-up using the randomForest  package. Forests were created with at least 100,001 trees (odd number ensures fully deterministic model) and otherwise default settings. Performance was assessed by area under the curve (AUC) of a receiver-operating characteristic (ROC) curve, calculated with the 'ROCR’ package, from Random Forests internal out-of-bag (OOB) testing results. In Random Forests, this OOB testing takes the place of cross-validation to get an unbiased estimate of the test set error. Each tree is constructed using a different random bootstrap sample of about two-thirds of cases from the original data. Each case left out in the construction of the tree is run through that tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, the software takes j to be the class that got the most votes every time case n was OOB. The proportion of times that j is not equal to the true class of n averaged over all cases is the OOB error estimate. By default, RF performs a binary classification (for example, relapse versus no relapse). However it also reports a probability (proportion of 'votes’) for relapse, which we call the Random Forests Relapse Score (RFRS). Risk group thresholds were determined from the distribution of relapse probabilities using mixed model clustering to set cutoffs for low, intermediate, and high-risk groups.
The 17-gene RFRS signature
The eight-gene RFRS signature
Survival analysis on all training data, now also including those patients with <10 years of follow-up, was performed with risk group as a factor, for the full-gene, 17-gene, and eight-gene models, using the 'survival’ package. Note, the risk scores and groups for samples used in training were assigned from internal OOB cross-validation. Only those patients not used in initial training (without 10-year follow-up) were assigned a risk score and group by de novo classification. Significance between risk groups was determined by Kaplan-Meier log rank test (with test for linear trend). To directly compare relapse rates at 10 years for each risk group to that reported in the OncotypeDX publication , the overall relapse rates in our patient cohort were randomly down-sampled to the same rate (15%) as in their cohort  and results averaged >1,000 iterations. To illustrate, the training dataset includes 572 samples with 143 relapse events (that is, 25.0% relapse rate). Samples with relapse events were randomly eliminated from the cohort until only 15% of remaining samples had relapse events (76/505 = 15%). This 'down-sampled’ dataset was then classified using the RFRS model to assign each sample to a risk group and the rates of relapse determined for each group. The entire down-sampling procedure was then repeated 1,000 times to obtain average estimated rates of relapse for each risk group given the overall rate of relapse of 15%. Setting the overall relapse rate to 15% is also useful because this more closely mirrors the general population rate of relapse. Without this down-sampling, expected relapse rates in each risk group would appear unrealistically high. See Figure 2 for explanation of the breakdown of samples into training and test sets used for classifier building and survival analysis.
Next, the full-gene, 17-gene, and eight-gene RF models along with risk group cutoffs were applied to the independent test data. The same performance metrics, survival analysis, and estimates of 10-year relapse rates were performed as above. The 17-gene model was also tested on the independent test data, stratified by treatment (untreated vs. hormone therapy treated), to evaluate whether performance of the signature was biased towards one patient subpopulation or the other. Finally, for direct performance comparison on the same datasets, the Oncotype DX algorithm  was implemented in R and applied to both training and test datasets.
The independent test data were not used in any way during the training phase. However, these samples represent a random subset of the same overall patient populations that were used in training. Therefore, they are not as fully independent as recommended by the Institute of Medicine 'committee on the review of omics-based tests for predicting patient outcomes in clinical trials’ . Therefore, additional independent validations were performed against the NKI dataset  obtained from the Netherlands Cancer Institute  and the METABRIC dataset  accessed through Synapse . The NKI data represent a set of 295 consecutive patients with primary stage I or II breast carcinomas. The dataset was filtered down to the 89 patients who were node-negative, ER-positive, HER2-negative, and not treated by systemic chemotherapy. Relapse times and events were defined by any of distant metastasis, regional recurrence or local recurrence. The METABRIC data represent a set of 1,992 primary breast tumors. The dataset was filtered down to the 315 patients who were node-negative, ER-positive, HER2-negative, and not treated by systemic chemotherapy. Relapse times and events were used as provided. Expression values from the NKI Agilent array data and METABRIC (Illumina HT 12v3) array were rescaled to the same distribution as that used in training using the 'preprocessCore’ package. Values for the eight-gene and 17-gene-set RFRS models were extracted for further analysis. If more than one Agilent or Illumina probe set could be mapped to an RFRS gene then the probe set with best performance was used. The full-gene-set model was not applied because not all Affymetrix-defined genes (probe sets) in the full-gene-set could be mapped to Agilent or Illumina-defined genes (probe sets). However, the 17-gene and eight-gene RFRS models were applied to NKI and METABRIC data to calculate predicted probabilities of relapse. Patients were divided into low-, intermediate-, and high-risk groups by ranking according to probability of relapse and then dividing so that the proportions in each risk group were identical to that observed in training. ROC AUC, survival P values, and estimated rates of relapse were then calculated as above. It should be noted that while the NKI clinical data described here (n = 89) had an average follow-up time of 9.55 years (excluding relapse events), 34 patients had a follow-up time of <10 years without an event (range, 1.78-9.83 years). Similarly, the METABRIC clinical data (n = 315) had an average follow-up time of 9.85 years (excluding relapse events) but 118 patients had a follow-up time of <10 years without an event (range, 0.05-9.80 years). These patients with <10 years of follow-up but no events would not have met our criteria for inclusion in the training dataset and likely include some events which have not occurred yet. If anything, this is likely to reduce the AUC estimate and underestimate P value significance in survival analysis.
While not necessary for Affymetrix, migration to other assay technologies (for example, RT-PCR approaches) may require highly and invariantly expressed genes to act as a reference for determining accurate gene expression level estimates. To this end, we developed two sets of reference genes. The first was chosen by the following criteria: (1) filtered if not expressed above background threshold (raw value >100) in 99% of samples; (2) filtered if not in top fifth percentile (overall) for mean expression; (3) filtered if not in top 10th percentile (remaining genes) for standard deviation; (4) ranked by coefficient of variation. The top 25 reference genes are listed in Additional file 1: Table S3 along with reference genes used by OncotypeDX. The second set of reference genes were chosen to represent three ranges of mean expression levels encompassed by genes in the 17-gene signature (low, 0-400; medium, 500-900; high, 1,200-1,600). For each mean expression range, genes were: (1) filtered if not expressed above background threshold (raw value >100) in 99% of samples; and (2) ranked by coefficient of variation. The top five genes from each range are listed in Additional file 1: Table S4. Reference genes underwent the same manual checks for sequence correctness by alignment to the reference genome as above (see Additional file 8). Five genes were marked for exclusion in the first set and three genes excluded from the second set. Actual probe sequences for all reference probe sets are provided in Additional file 9.
The RFRS algorithm is implemented in the R programming language and can be applied to independent patient data. Input data are tab-delimited text files of normalized expression values with 17 or eight transcripts/genes as columns and patient(s) as rows. A sample patient data file (patient_data.txt) is presented in Additional file 10. A sample R program (RFRS_sample_code.R) for running the algorithm is presented in Additional file 10. The RFRS algorithm consists of a Random Forest of 100,001 decision trees. This is precomputed, provided as an R data object (see Additional file 11) and must be included in the working directory. Each node (branch) in each tree represents a binary decision based on transcript levels for transcripts described above. Based on these decisions, the patient is assigned to a terminal leaf of each decision tree, representing a vote for either 'relapse’ or 'no relapse’. The fraction of votes for 'relapse’ to votes for 'no relapse’ represents the RFRS - a measure of the probability of relapse. If RFRS is ≥0.606 the patient is assigned to the 'high-risk’ group, if ≥0.333 and <0.606 the patient is assigned to the ’intermediate-risk’ group, and if <0.333 the patient is assigned to the 'low-risk’ group. These cutoffs were chosen by mixed-model clustering during the training phase. The patient’s RFRS value is also used to determine a likelihood of relapse by comparison to a loess fit of RFRS versus likelihood of relapse for the training dataset. Precomputed R data objects for the loess fit (RelapseProbabilityFit.Rdata) and summary plot (RelapseProbabilityPlot.Rdata) are loaded from file (see Additional file 12). The patient’s estimated likelihood of relapse is determined, added to the summary plot, and output as a new report (see Additional file 13). A complete list of GEO sample identifiers, used for training and testing (Additional file 14), and additional sample R code (Additional file 15), are provided to help others reproduce the RFRS implementation.
Internal OOB cross-validation for the initial (full-gene-set) model on training data reported an ROC AUC of 0.704. This was comparable or better than those reported by Johannes et al. (2010) who tested a number of different classifiers on a smaller subset of the same data and found AUCs of 0.559 to 0.671 . It also compares favorably to the AUC value of 0.688 when the OncotypeDX algorithm was applied to this same training dataset.
Our goal was to identify a potential signature for migration to a more affordable and faster platform. We can expect a certain amount of attrition when migrating to such platforms. Therefore we developed a rational approach in which approximately 100 candidate signature genes were identified but organized into signatures of eight- and 17-gene sets with each gene having multiple alternates in case of platform migration failures. The signature sizes were chosen for practical reasons, with the assumption that high-throughput, low-cost platforms may have limitations to approximately 10 or 20 features and need space for one or more control/reference genes.
Comparison of validation results in independent test data for full-gene-set, 17-gene, and eight-gene RFRS models
KM ( P )
Validation against the additional, independent, NKI dataset also had very similar results. The 17-gene and eight-gene models had AUC values of 0.688 and 0.699, respectively, nearly identical to the results for the previous independent dataset. Differences between risk groups in survival analysis were also significant for both 17-gene (P = 0.023) and eight-gene (P = 0.004, Figure 5C) models. Similar results were also obtained from the validation against METABRIC data with 17-gene (AUC = 0.628, P <1.99E-04, Figure 5D) and eight-gene models (AUC = 0.642, P <0.04). It should be noted that any level of performance observed on independent test datasets is somewhat disadvantaged if they make use of different expression platforms (as NKI and METABRIC datasets do). Even with renormalizing or rescaling, the cutoffs and inter-variable relationships will likely be sensitive to changes in the inherent distribution and dynamic range that accompany another technology. When migrating to a new platform, an additional training phase should be performed. Our signatures and their alternates provide a convenient candidate list for development on such a new platform.
In order to maximize the total size of our training dataset we allowed samples to be included from both untreated patients and those who received adjuvant hormonal therapy such as tamoxifen. Since outcomes likely differ between these two groups, and they may represent fundamentally different subpopulations, it is possible that performance of our predictive signatures is biased towards one group or the other. To assess this issue we performed validation against the independent test dataset, stratified by treatment status, using the 17-gene model. Both groups were found to have comparable AUC values with the slightly better value of 0.740 for hormone-treated versus 0.709 for untreated. Survival curves were also highly similar and significant with P value of 0.004 and 3.76E-07 for treated and untreated respectively (Additional file 1: Figure S5A and S5B). The difference in P value appears more likely due to differences in the respective sample sizes than actual difference in survival curves. This is important because it predicts equal or better performance of the signature in hormonal-therapy treated patients.
While most breast cancers are diagnosed at a resectable and thus potentially curable stage, all carry some risk of relapse. As such, most patients are in theory candidates for adjuvant maneuvers to lessen the risk of relapse. We present a meta-analysis of gene expression data in 858 lymph-node-negative, ER-positive, HER2-negative, chemotherapy-naive breast tumors from published datasets. This dataset supports a method of predicting the likelihood of long-term survival without relapse for such breast cancers. The method involves assaying the expression level of one or more RNA transcripts or their expression products in a breast cancer sample, from a set of eight or 17 genes each with one or more alternate genes. We determine a RFRS and risk group by applying the RFRS algorithm to tumor derived gene expression. Furthermore, by comparing the RFRS value to outcomes in training data we can estimate a likelihood of long-term survival without breast cancer relapse in a patient. In theory, those breast cancer patients with tumors at high risk of relapse could be treated more aggressively whereas those at low risk of relapse could more safely avoid the risks and side effects of systemic chemotherapy. The benefits of this type of approach are expected to be significant and are currently being evaluated in the TailorRX (USA) and MINDACT (Europe) trials . Our hope is that this method, together with a rapid assay platform could provide rapid (<1 h) and useful information to inform clinical decision-making. A test developed from this method could be applied to resected breast tumor tissue (either frozen or fresh), formalin-fixed, paraffin-embedded breast cancer tissue or tissue isolated from a core biopsy or fine needle aspirate. We also provide a list of reference genes for use in normalizing gene expression measurements if necessary.
We tested the hypothesis that a larger dataset with more current clinical data would allow development of a signature with better performance than existing methods. Our signature performed comparably but not significantly better, perhaps suggesting that we have reached the upper limits of accuracy for array-based transcriptional signatures for recurrence in ER+/LN- breast cancer. However, RFRS compares favorably to previously described, clinically available products that predict outcome in breast cancer (for example, OncotypeDX and Mammaprint) in several respects: (1) the signature was built from the largest training dataset available to date; (2) patients with HER2+ tumors were excluded, thus focusing only on patients without an existing clear treatment course; (3) the gene signature predicts relapse with equal success for both patients that went on to receive adjuvant hormonal therapy and those who did not; (4) the gene signature was designed for robustness with (in most cases) several alternate genes available for each primary gene; (5) probe set sequences have been validated by alignment and manual assessment. These features, particularly the latter two, make this signature an especially strong candidate for efficient migration to low-cost platforms for use in a clinical setting. Development of a panel for use in the clinic could take advantage of not only primary genes but also some number of alternate genes to increase the chance of a successful migration. Given the small but significant number of discrepancies observed between clinical and array-based determination of ER status we also recommend inclusion of standard biomarkers such as ER, PR, and HER2 on any design. We provide a list of consistently expressed genes, specific to breast tumor tissue, for use as control genes for those platforms that require them. Finally, we make available a large, carefully curated gene expression dataset for ER+, LN- breast cancer along with clinical annotations for use in the research community.
OLG and FP were postdoctoral fellows and LMH was a scientist at Lawrence Berkeley National Laboratory (LBNL), all under the supervision of PTS and JWG, during the production of this work. OLG is currently a genome fellow at the Genome Institute and research assistant professor at Washington University Medical School. OME is a student at University of California, Berkeley. LMH is currently a research assistant professor at Oregon Health & Science University (OHSU). EAC is a medical oncologist and an assistant professor at University of California, San Francisco School of Medicine. PTS was a staff scientist at LBNL and is currently an associate professor at OHSU. JWG was Life Sciences Division director and associate laboratory director for Biosciences at LBNL, program leader of Breast Oncology and Cancer Genetics at UCSF Helen Diller Family Comprehensive Cancer Center and a professor at UCSF. JWG is currently co-director of the Bay Area Breast Cancer Specialized Program of Research Excellence (SPORE), visiting faculty at LBNL, emeritus professor at UCSF, director of the OHSU Center for Spatial Systems Biomedicine (OCSSB), Gordon Moore endowed chair and the department of Biomedical Engineering chair at OHSU.
Area under curve
Gene Expression Omnibus
Human epidermal growth factor receptor 2
Out of bag
Random forest relapse score
OLG was supported by a Fellowship from the Canadian Institutes of Health Research. EAC was supported by K08 CA137153. OME was supported by the NCI ICBP Summer Cancer Research Fellowship program. This work was supported by the Director, Office of Science, Office of Biological & Environmental Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231; U54 CA 112970 and SU2C-AACR-DT0409 to JWG. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.