Pan-cancer identification of clinically relevant genomic subtypes using outcome-weighted integrative clustering

Background Comprehensive molecular profiling has revealed somatic variations in cancer at genomic, epigenomic, transcriptomic, and proteomic levels. The accumulating data has shown clearly that molecular phenotypes of cancer are complex and influenced by a multitude of factors. Conventional unsupervised clustering applied to a large patient population is inevitably driven by the dominant variation from major factors such as cell-of-origin or histology. Translation of these data into clinical relevance requires more effective extraction of information directly associated with patient outcome. Methods Drawing from ideas in supervised text classification, we developed survClust, an outcome-weighted clustering algorithm for integrative molecular stratification focusing on patient survival. survClust was performed on 18 cancer types across multiple data modalities including somatic mutation, DNA copy number, DNA methylation, and mRNA, miRNA, and protein expression from the Cancer Genome Atlas study to identify novel prognostic subtypes. Results Our analysis identified the prognostic role of high tumor mutation burden with concurrently high CD8 T cell immune marker expression and the aggressive clinical behavior associated with CDKN2A deletion across cancer types. Visualization of somatic alterations, at a genome-wide scale (total mutation burden, mutational signature, fraction genome altered) and at the individual gene level, using circomap further revealed indolent versus aggressive subgroups in a pan-cancer setting. Conclusions Our analysis has revealed prognostic molecular subtypes not previously identified by unsupervised clustering. The algorithm and tools we developed have direct utility toward patient stratification based on tumor genomics to inform clinical decision-making. The survClust software tool is available at https://github.com/arorarshi/survClust.


Table S1
Summarizing input data of 18 cancer types across number of samples analyzed and total number of features that went into clustering for mutation, Copy Number (CN), Methylation, mRNA expression, miRNA and Protein 2

Table S2
Comparison of survClust integrated solution versus unsupervised clustering results from published TCGA studies 3 Table S3 summary of survClust Copy Number solution vs unsupervised TCGA solutions on logrank 4 Table S4 summary of survClust Methylation expression solution vs available TCGA solutions on logrank 5 Table S5 summary of survClust mRNA expression solution vs available TCGA solutions on logrank 6 Table S6 summary of survClust miRNA expression solution vs available TCGA solutions on logrank 7 Table S7 summary of survClust Protein expression solution vs available TCGA solutions on logrank 8

Table S8
Cross tabulation of OV copy number survClust labels vs Copy number change in AKT gene 9

Fig S1
Understanding survClust simulation and analysis 10          Plot on the right is the weighted distance matrix of data type 1. Data type 2 -weak clusters, strong cluster association. Plot on the right is the weighted distance matrix of data type 2.

Fig. S5: CD8 T-cell distribution of survClust mutation classes of various cancer types
Boxplot summarizing CD8 T-cell expression (y-axis) across survClust class labels (x-axis). Red line depicts the median, and top and bottom black bars represent 25 th and 75 th percentile respectively. Significance from association test is shown at the bottom of each plot.    Supplementary Notes

Data Pre-Processing
Data types may consist of continuous (gene expression, copy number log-ratio, DNA methylation, miRNA, protein expression) or binary (mutation status) data. Each data type is preprocessed by normalization and standardized as follows - where X is a data type.
Copy Number data was segmented using CBS 1 and reduced to non-redundant regions of alterations using the CNregion function in iClusterPlus 2 with default epsilon of 0.001, keeping in mind that the total numbers of features don't exceed 10,000. For DNA methylation, mRNA expression and miRNA expression, if a certain feature had more than 20% missing data, that feature was removed and remaining were used for analysis. For mRNA expression, we further removed genes having a mean expression lower than the threshold of mean expression of lower 10% quantile. Similarly, methylation probes with mean beta values < 0.1 and > 0.9 were discarded. Genes harboring mutants in less than 1% of the samples were removed. Missing data was imputed using KNN imputation 3 . Data for each cancer type and assay was extracted from pan-cancer study 4 . A summary of features and samples across all cancer types analyzed is shown in Additional file 1: Table S1. Mutation signature analysis was run on the resulting maf file using (https://github.com/mskcc/mutation-signatures) 5 . Blood biomarker data was derived from CIBERSORT 6 .

Running survClust
survClust was run on all cancer types reported in Additional file 1: Table S1. Results are summarized according to each cancer type. Each cancer type was run for each of the 6 platforms -somatic mutation, DNA copy number, DNA methylation, mRNA expression, miRNA expression and RPPA data, and integrating all six.
Final results were compiled after running cv.survclust for 5 folds for 50 rounds of cross validation. Optimal was chosen after assessing cross-validated logrank statistic and standardized pooled within cluster sum of squares. Overfitting was avoided by omitting cluster solution with less than 5 samples. Note that, one can considerably reduce the time to run survClust by reducing the feature space.
We present here two examples of simulations to go over the conceptualization of survClust.

Simulation 1
We simulated a data matrix with 300 samples and 300 features. Out of which 15 features , , were coded for survival relevant information for 100 samples each in cluster1,2 and 3 respectively -  We created two such data types to showcase integration capabilities of survClust (Additional file 1: Fig. S1). We ran survClust and cross-validated runs for k=2-7 for 50 rounds. Optimum is chosen where logrank is maximized and standardized pooled within-cluster sum of squares is minimized. Results are shown in Figure 1.

Simulation 2
We show another simulation example, where we simulate two data types where the truth is a 4class solution shown in Additional file 1: Fig. S2, such that: Data type 1 shows strong molecular association, whereas data type 2 shows strong survival association and weak molecular association.
Survival correlation was imposed by simulating C1, C2, C3 and C4 with median survival time of 5, 4, 3 and 2 years respectively, see (2). The goal here was to see if survClust can identify 4 distinct survival groups when a data type has a strong 3-class molecular structure (as shown in Data type1). Results shown in Additional file 1: Fig. S2.

Centroid Re-labelling
We performed cross validation to avoid over fitting and to arrive at coherent survClust results. Each fold predicts test labels according to its training set, and at the end of cross validation we have prediction of each sample to a class. However, the class labels are meaningless across folds and one needs to be careful when aggregating labels to get a full solution. We define, centroid relabeling method to solve this problem.
Let be total number of samples to classify. We perform a fold cross validation, where each fold predicts a test set, such that, Assume we have the following class label prediction for test set for -class solutions across folds.

Where
= ℎ sample belonging to ℎ class in a test set fold . One can clearly see that labels across folds are meaningless and simply group alike samples together and unlike samples separately in different clusters. To classify all samples, we need to consolidate the labels across folds.
Lets consider the following example where we know the true class of each sample -
test set prediction- Fold 1  Fold 2  Fold 3  sample4  class1  sample5  class1  sample6  class1  sample2  class2  sample3  class2  sample1  class2  sample9  class3  sample7  class3  sample8  class3 As seen, these class labels are meaningless, but they group alike samples together and unlike samples separately. For example, Sample4, Sample5 and Sample6 are grouped in class1 across folds, but we know that their true label is class1, class2 and class3 respectively. We arrive at consolidated labels as follows- Step 1-Calculate centroids of each ( = 3) class in each fold ( = 3). ; centroid calculation for ℎ fold and ℎ class and samples belonging to that class.
Step 2-Run kmeans on the centroids vectors ( 11 , 12 , 13 , 21 , 22 , 23 , 31 , 32 , 33 ) Step 3 -Get kmeans class labels on the centroids, and relabel each sample accordingly. This same strategy is used to relabel class membership of samples across rounds of cross validation, where each round has predicted class labels after each round of cross validation.
Consolidate labels across all rounds to arrive at the final solution as follows.
Compute centroids for each class in each round (same as Step1).