Clinical-grade whole-genome sequencing and 3′ transcriptome analysis of colorectal cancer patients

Background Clinical-grade whole-genome sequencing (cWGS) has the potential to become the standard of care within the clinic because of its breadth of coverage and lack of bias towards certain regions of the genome. Colorectal cancer presents a difficult treatment paradigm, with over 40% of patients presenting at diagnosis with metastatic disease. We hypothesised that cWGS coupled with 3′ transcriptome analysis would give new insights into colorectal cancer. Methods Patients underwent PCR-free whole-genome sequencing and alignment and variant calling using a standardised pipeline to output SNVs, indels, SVs and CNAs. Additional insights into the mutational signatures and tumour biology were gained by the use of 3′ RNA-seq. Results Fifty-four patients were studied in total. Driver analysis identified the Wnt pathway gene APC as the only consistently mutated driver in colorectal cancer. Alterations in the PI3K/mTOR pathways were seen as previously observed in CRC. Multiple private CNAs, SVs and gene fusions were unique to individual tumours. Approximately 30% of patients had a tumour mutational burden of > 10 mutations/Mb of DNA, suggesting suitability for immunotherapy. Conclusions Clinical whole-genome sequencing offers a potential avenue for the identification of private genomic variation that may confer sensitivity to targeted agents and offer patients new options for targeted therapies. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-021-00852-8.


Background
Colorectal cancer (CRC) is one of the most common malignancies, with over 30,000 cases reported in the UK in 2015-2016 and a 5-year survival rate of approximately 60% [1]. CRC is typically initiated by a mutation in the Wnt signalling pathway gene APC [2] (adenomatous polyposis coli) or associated genes (CTTNB1, RNF143, RSPO2/3) that lead to the formation of a polyp [3] that then progresses via mutations in a number of oncogenes and tumour suppressors into an invasive cancer. In parallel with the expansion of our knowledge of the biology of colorectal cancer, the field of targeted oncology is rapidly advancing, with targeted agents available [4] for a high percentage of driver and modifier mutations across a wide range of cancers.
The Cancer Genome Atlas (TCGA) project set out to characterise mutations in colorectal cancer by exome sequencing of a cohort of 600 patients using the Agilent SureSelect panel via tumour-normal subtraction [5]. It confirmed recurrent mutations in APC, TP53, SMAD4, PIK3CA and KRAS as well as identifying recurrent mutations in ARID1A, SOX9 and AMER1 (FAM123B). Giannakis et al. [6] carried out exome sequencing on a clinically annotated cohort of 619 patients, finding further recurrent mutations in BCL9L, RBM10, CTCF and KLF5, and also showing that neoantigen load as determined by exome sequencing was associated both with tumour associated lymphocyte infiltration and overall survival. However, a key weakness of the TCGA and other studies has been the use of exome sequencing to demonstrate key oncogenic drivers. Exome sequencing, whether by the amplicon or hybridisation approach, may miss key oncogenic drivers due to allelic dropout or the biases inherent to targeting approaches [7], and therefore, the alterations seen in the publications resulting from these datasets may miss key insights that would be seen in a more expansive technology such as wholegenome sequencing.
Whole-genome sequencing has a number of potential advantages. Firstly, it can increase the overall variant calling accuracy as exome sequencing techniques can suffer from probe dropout and poor coverage, especially at splice junctions and in "difficult" to sequence regions where probe drop out is common [8]. Secondly, it can natively call fusions [9] and other structural variants [10] (by detection of split reads), and finally, it can identify copy number variants [11] to a higher accuracy than alternative techniques. Given the recent attention to tumour mutational burden (TMB) in selecting patients for anti-PD1 therapies such as pembroluzimab, whole-genome sequencing can accurately call mutation burden [12].
However, until very recently, studies of colorectal cancer using whole-genome sequencing have been limited in number or scope. Shanmugan et al. [13] carried out whole-genome sequencing in order to identify therapeutic targets in four patients with metastatic disease, finding several known mutations of interest as potentially targetable. Ishaque et al. [14] carried out paired metastasis-primary tumour whole-genome sequencing in colorectal cancer, finding novel non-coding oncogenic drivers and an elevated level of "BRCAness". The Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium [15] presented 52 colorectal (37 colon, 15 rectal) whole-genome sequenced tumours as part of the larger consortium effort, although at the time of writing, no specific examination of the landscape of these had been carried out, presumably because of the previous TCGA colorectal cancer paper which examined the exomes of 276 colorectal cancers [5]. Druliner et al. [16] reported WGS results from 10 CRC arising from the polyp of origin from that cancer, a relatively rare phenomenon where residual dysplastic adenoma can be histology seen in the resection specimen. The study found that polypof-origin cancers were genetically indistinguishable from non-polyp of origin cancers, meaning that they could be used as a biological model of the adenoma-carcinoma sequence. A further paper by Druliner et al. [17] examined the whole genomes, transcriptomes and methylomes of cancer-associated and cancer-free polyps from 31 patients, finding significant genomic, transcriptomic and epigenetic differences in patients who had cancerassociated polyps. In a further paper [18], the group identified recurrent loss of heterozygosity in 18q that was preferentially enriched in patients with sporadic colorectal cancer.
The United Kingdom 100,000 Genomes Project has set out to sequence tens of thousands of cancer genomes [19], across multiple tumour types, using a clinical-grade sequencing pipeline and variant calling algorithm. Our study has carried out whole-genome sequencing of 54 paired colorectal tumour-normal samples, utilising a the Genomics England clinical-grade sequencing, alignment, variant calling and annotation pipeline in order to understand the utility of WGS in colorectal cancer. We defined clinical grade as an integrated set of procedures standardising tissue collection, processing, sequencing and downstream analysis as a demonstrator for future use in the UK 100,000 Genomes Project.

Patients
Sequential patients undergoing elective colorectal surgery at the Queen Elizabeth Hospital Birmingham were recruited for the study. Patients were selected who had sporadic colorectal cancer and did not have an Amsterdam positive history of colorectal cancer or an age of onset less than 45 years. Consent for the study was taken, and the study was fully ethically approved by

Samples
Immediately after resection, resected specimens were conveyed to a histopathologist who facilitated direct biopsy of tumour material and associated normal bowel (defined as the distal resection margin) by frozen section. Samples were immediately snap-frozen on liquid nitrogen and stored at − 80°C until needed. Tumour content was verified by frozen section, with at least 60% tumour being needed for inclusion in the study. DNA was extracted using a Qiagen DNEasy kit and RNA with a Qiagen RNEasy kit. Nucleic acid quantity and quality were assessed using a Qubit2 fluorimeter and TapeStation assay. A standardised operation procedure (SOP) was used for the clinical-grade handling of this material to ensure quality, purity and utility for whole-genome sequencing.

Library preparation
Sequencing libraries of 500 ng DNA extracted from the fresh-frozen tumour or normal tissue were prepared using the TruSeq® DNA PCR-free method (Illumina). Sequencing (100 base-paired reads) was performed on the HiSeq2500 platform to a mean depth of > 30× for the normal genome and > 60× for the tumour genome, after the removal of duplicate read-pairs.

RNA
Libraries were prepared using 50 ng of RNA using a Lexogen QuantSeq 3′ RNA kit from tumour and matched normal samples. Polyadenylated mRNA was pulled down then cDNA synthesis and 3′ library preparation carried out. Samples were indexed and pooled across an Illumina NextSeq v2 flow cell and sequenced using a 75-base single-ended sequencing strategy.

Bioinformatics WGS
Raw reads were converted to FASTQ using bcl2fastq, quality trimmed then mapped to the GRCh37 (hg19) Human Reference Genome using the Isaac3 [20] aligner (Illumina). Single nucleotide and indel variants were mapped using the Strelka2 [21] variant caller (for the germline calls using germline-only mode and for somatic calls using joint tumour/normal mode), somatic structural variants using the Manta [22] structural variant caller and copy number aberrations using the Canvas [23] copy number caller. Annotation of the variants was performed using Illumina's annotation engine Nirvana (https://github.com/ Illumina/Nirvana/wiki) using Ensembl 73 as database reference. Novel driver analysis was generated using MutSigCV2 [24], Intogen [25] and dNdScv [26] with and without hyper-mutated samples. Non-coding driver analysis was performed with FunSeq2 [27]. Mutational signatures were generated using the MutationalPatterns R/ Bioconductor package [28]. All variants were stored in VCF files. Telomere length from whole-genome sequencing data was derived using TelomereCat [29].
Copy number calls were pooled across individuals with bedtools and overlapped with bedIntersect to identify the regions that were recurrently gained/lost. Structural variants were pooled using bedtools and overlapped with intersectBed to identify common regions of structural variation.
In samples requiring mutational confirmation, Sanger sequencing was performed (primer sequences available on request).

RNA-seq
FASTQ files were quality trimmed, adapters were removed and reads were aligned to the hg19 reference genome using the STAR aligner [30] (version 2.6.1). Genes were annotated using the Ensembl v74 database and gene-centric read counts generated using Partek Flow GSA algorithm [31]. Hierarchical clustering and PCA plots were also generated. CMS and CRIS signatures were called using the CMSCaller R package [32]. For the calculation of the CIRC score, the methodology of Lal et al. was used [33]. For immune infiltration scores via CIBERSORT, the methodology described by Chen et al. was used [34]. For the signature derivation, the BioSigner module of Bioconductor was used [35].

Data availability
All data are available in the Sequence Read Archive (accession number PRJNA681391) [36].
Fifteen patients underwent adjuvant therapy consisting of capecitabine (1 patient), or capecitabine and oxaliplatin (14 patients). Seventeen patients had disease recurrence, with a median time to recurrence of 639 days (IQR 276-2501 days). Fourteen (25.9%) patients died whilst within the study, with a median time to death of 598 days (IQR 398-1231 days).

Germline mutations
The germline genome of all patients was studied for mutations in genes associated with familial colorectal cancer syndromes (APC, MYH, MLH1, MSH2, MSH6, PMS2, POLE, POLD1, SMAD4 and BMPRA1). We found no SNVs or indel germline mutations in this cohort of patients.

Hypermutator phenotype
In total, 17/54 patients (Table 1) had greater than 10 somatic mutations per megabase, suggesting that they may be suitable for anti-PD1 immunotherapy. Of these patients, five had somatic mutations that have previously been demonstrated as responsible for hypermutated tumours (

Most frequently mutated genes and identification of new drivers
A generic analysis of the ten most frequently mutated genes (both SNV and indel, not normalised by transcript length) demonstrated that these were (from most to least recurrent): TTN, APC, MUC4, FAT2, TP53, FRG1, KRAS, LRP2, CSMD3 and MT-ND4 (Fig. 1). Mutations in KRAS and BRAF were validated with pre-existing Sanger sequencing performed as standard of care.
Mutational frequency of cancer genes was compared to known cancer drivers in (Fig. 1). The most frequently mutated gene was APC (38/54 samples), followed by TP53 (23/54 samples), KRAS (19/54 samples) and FBXW7 (12/54 samples). Less frequent mutations were seen in genes that are typically considered "druggable" but not seen previously in colorectal cancer including KIT, ERBB2 and ALK.
For all driver analyses, samples were analysed in hypermutated and non-hypermutated groups. For the hypermutated analysis, MutSigCV analysis (in order to identify genes significantly mutated compared to background) of driver mutations demonstrated 1235 potentially significant mutations (p < 0.05, q < 0.05) in the dataset. Only APC was highlighted as significant from the typical colorectal driver mutations (Additional file 1: Table S1). For the non-hypermutated analysis, Mut-SigCV analysis demonstrated 97 potentially significant mutations, with APC, TP53, KRAS, SOX9 and FBXW7 being highlighted as significant driver genes (Additional file 1: Table S1).

Recurrent non-coding mutations
An analysis of non-coding drivers using FunSeq2 [37] revealed multiple regions with statistically significant increased mutation rates as compared to background (Additional file 4: Table S4). In the hypermutated set, the top-ranked region (Chr2:133021792-133036207) was identified as having recurrent mutations and is predicted in silico to bind the BRCA1, CHD2, IRF3, MAFK, MXI1, NFKB1, RFX5 and SMC3 transcription factors. The long non-coding RNA ENSG00000232274.1 (chr1:143,189, 434-143,211,070) was also recurrently mutated. In the non-hypermutated sample set, the AP-1 transcription factor complex member JUND was recurrently mutated in 46/47 samples in non-coding regions. An enhancer region adjacent to TEKT4P2 a pseudogene of the Tektin pathway (involved in PI3K/AKT signalling) was also frequently mutated in 27/54 samples. A second enhancer region adjacent to CDH10 (Cadherin 10, chr5:24276200-24285200, implicated in colorectal cancer) was identified as recurrently mutated in 15/54 samples. In total, over 700 non-coding regions (either of transcription factor binding sites, enhancers or promoters) were recurrently mutated in the FunSeq2 dataset. FunSeq2 analysis also ranked APC as the top-ranked coding driver mutation in 27/54 samples.
An overlapping analysis of potential drivers using Venny from all four algorithms only demonstrated APC as being a potential driver in the dataset across all four sets of calls (Fig. 2). When the MutSigCV calls were removed, 12 genes were enriched (KRAS, TP53, FBXW7, PIK3CA, NPEPPS, CTNND1, FLII, MGA, SETPB1, BCL9, MSH3 and ANXA6). When the Intogen calls only were removed, 4 genes were enriched (ZNF517, CROCC, TPO and FSHR). When dNdScV was removed, RIPK4 only was enriched and when FunSeq2 calls only were removed there were no significant genes.
A pathway analysis of these pooled drivers across the four algorithms using G Profiler [38] revealed enrichment in a number of transcription factor associated enriched pathways, KEGG pathways and GO terms (Additional file 5: Table S5).

Copy number aberrations
A pooled analysis of copy number variation across the cohort was performed (Fig. 3). A consistent low-level pattern of both copy number gain and loss was observed. When filtered by exonic regions across all samples, 6/ 354 losses and 2/30 gains were observed to be exonic. Gains were seen in all samples in the FOXI2 (chr10: 129534543-129537433) and REX1BD genes (chr19: 18654566-18746304). FOXI2 is a forkhead binding gene associated with transcriptional activation which has been seen to be consistently hypomethylated in colorectal cancer [39] and REX1BD (required for excision 1 binding domain) is a putative DNA repair gene [40].

Structural variants
Structural variants were filtered on the basis that the most functionally relevant ones were likely to be those involving known cancer driver genes. In total, 29 potential oncogenic gene fusions, detected by WGS, were seen  -axis); the x-axis shows missense (green), nonsense (red), frameshift deletion (blue), splice site (yellow), frameshift insertion (purple), in-frame deletion (brown), in-frame insertion (dark red), non-stop mutation (light blue) and transcription start site mutation (orange). f Top ten mutated genes by frequency-genes on the y-axis, numbers of mutations on the x-axis; colours are the same as in e. g Oncoprint of colorectal driver genes (left y-axis) by sample (x-axis) with the variant type shown in the key. Percentages across the whole cohort are seen in percentages down the right y-axis. h TCGA style log [10] (Table 2). Of the 29 potential gene fusions, no recurrent gene fusions were seen. However, fusions involving IDH1-PTH2R, CDK6-CDK14, KAT6B-RBMS3, ERBB2-HAP1, CCDC6-TMEM212AS1 and BRAF-DLG1 were seen. Sanger sequencing across the breakpoint of the BRAF-DLG1 and ERBB2-HAP1 fusions were performed in order to confirm these clinically relevant findings.

Kataegis
The phenomenon of kataegis (localised somatic hypermutation) has been previously demonstrated in breast cancer [46]. In our study, we found that it occurred in all 54 samples significantly to one extent or another (Additional file 1: Table S1). Kataegis occurred particularly frequently at a per-sample level between chr20: 31050000-31080000 (Additional file 1: Figure S1) which corresponds to the region of NOL4L/C20orf112 (chr20: 31,030,862-31,071,385) a known fusion partner of RUNX1 and PAX5 in leukaemia [47].

Telomere length
Because of the well-observed phenomenon of shorter telomere length in cancer, we studied the lengths of telomeres as measured by whole-genome sequencing, which have previously been shown to correlate well to older methods such as Southern blotting [29]. Median telomere length in cancer was 5028 bp, and in normal germline, blood was 6294 bp (Mann-Whitney p < 0.0001).

RNA-seq Differential expression profiles
In order to understand if there were any de novo transcriptional subgroups within the dataset, a cut-off of the top 250 genes by variance was extracted from the dataset. When comparing tumour/normal expression and using clustering analysis, the number of groups found to have the lowest Davis-Bouldin index (5 clusters, 1.17) was used to set a threshold for K-means clustering (Fig. 5). Hierarchical clustering of 5 separate groups' revealed separation between the five groups and KEGG pathway analysis of each subgroup was performed (Additional file 6: Table S6). In three of the clusters, there were either only one or two samples found. There was no distinction between these clusters in terms of anatomical location, stage or tumour mutational burden.
For subgroup one, an over-representation of pathways concerning inflammation and DNA repair was seen. For subgroups 2 and 3, no significant pathway overrepresentation was seen, possibly because these groups only had one sample within them. For subgroup 4, multiple separate inflammatory pathways (mostly IL-17, Th1 and Th2 centric) were over-represented. Subgroup 5 had a number of interesting over-represented pathways, including reduced MHC presentation, Wnt/BMP signalling, TGFbeta signalling (via upregulated SMAD) and upregulated Hedgehog signalling.

Pathway analysis
Single sample gene expression differences do not explain much of the context of disease processes, so we carried out a pathway gene expression analysis using the KEGG pathways of over-expressed genes to normal counts across the whole dataset. From this, we found a number of pathways of interest that were differentially expressed in colorectal cancer: the p53 signalling pathway (hsa41105, p = 2.24 × 10-53, FDR p = 1.06 × 10-51), NFkappa-B signalling pathway (hsa040605, p = 1.75 × 10-47, FDR p = 4.95 × 10-46) and the "colorectal cancer" A number of other pathways of interest (but not of direct relevance to colorectal cancer) were overexpressed, including platinum drug resistance (hsa01524), the cytosolic DNA-sensing pathway (a.k.a. cGAS-STING, hsa04623) and several involved with DNA repair (FA pathway hsa03460, DNA replication hsa03030, NER, hsa03420).

CMS/CRIS
Two classifiers for transcriptional subtypes in colorectal cancer have been identified (the consensus molecular subtype (CMS) and the CRC intrinsic subtypes (CRIS) subtype [48,49]), which reflect the disease biology of the tumour and have been linked with prognosis. These subtypes are derived from pre-existing molecular data by various computational methods to discover transcriptionally distinct groups within colorectal cancer. CMS and CRIS classifiers were generated for all tumours (Fig. 6)

CIRC
We have previously demonstrated the utility of the Coordinate Immune Response Cluster (CIRC) [33] as a Th1-centric RNA based signature in predicting class I and II MHC immunovisibility (beyond TMB) in order to target with immunomodulatory drugs. The average of expression Z-score for the 28 genes in the CIRC was calculated for each tumour sample, with the lowest CIRC score being − 0.56 and the maximum 3.17. In total, 12/ 54 samples had CIRC > 0 suggesting immunovisibility.

Cell deconvolution using RNA-seq
Immune infiltration estimation using cell type deconvolution by CIBERSORT [34] (Table 3) was performed on 3′ RNA-seq data. This demonstrated a rich and varied immune infiltration within the colorectal cancers studied. The predominant cell type was CD4+ memory (resting) T cell, followed by M2 macrophages, CD8+ T cells, M0 macrophages then activated mast cells. There did not seem to be any correlation with purity estimates of the samples as determined by WGS.

RNA signature for hypermutation
In order to see whether a RNA-based signature for hypermutation could be developed from RNA-seq data, gene-centric gene expression was processed using BioSigner (Bioconductor) using a threshold of > 20 mutations/Mb in the WGS data (in order to develop a clear signature as > 50% of hypermutant samples were near to the classical 10 mutations/Mb cut-off). Using 250 iterations of the algorithm, we attempted to generate random forest (RF), partial least squares discriminant analysis (PLSDA) and support vector machine (SVM) models of gene expression for hypermutant samples. We found that no stable model could be generated; however, this could be a consequence of the relatively few numbers of hypermutant samples.

Correlation between drug mutations database and druggable mutations
In order to ascertain the possibility of actionable targets from the mutations observed in the dataset, we entered a list of protein-coding mutations found in at least one sample to the Drug Interaction Database (http://www. dgidb.org). Potential drug targets were observed for the genes-APC, TP53, KRAS, FBXW7, ATM, PIK3CA, ARID1A, KMT2A, PTEN, SMARCA4, IDH1 and RRM2B (Additional file 7: Table S7). Also, 17/54 (32%) of patients exceeded the 10 mutations/Mb threshold for the potential benefit for treatment with PD-1/PD-L1 therapy.
Utilising the OpenTarget platform (http://www. opentargets.org), which takes lists of mutations and functionally characterises them into drug targets, the 50 top genes from each tool for driver ranking (MutSigCV2, Intogen, dNdScv, Funseq2) were aggregated and input into the system (due to a limit of 200); after duplicate filtering, this left 123 genes of interest. OpenTargets demonstrated significant enrichment for GI and epithelial tract cancers of all subtypes (Additional file 7: Table S7). Also, significantly enriched pathways were seen in classical cancer pathways but also Interferon signalling, phagocytosis and class I MHC signalling. Of the identified druggable genes, for small molecule agents, 8/123 had clinical precedence, 50/123 discovery precedence, and 49/123 were predicted to be tractable. Amongst antibody-based agents, 3/123 had clinical precedence, 68/123 had high tractable confidence and 83/123 had mid-low tractable confidence.

Conclusions
The use of clinical-grade whole-genome sequencing in this study has allowed us to identify known and novel driver mutations that are potentially druggable based on the current state of knowledge. Our study demonstrated the known driver mutations seen in colorectal cancer such as APC, KRAS, BRAF and PIK3CA [5], but also more novel mutations that would potentially be targetable by molecular agents. For instance, we detected KIT mutations that would potentially be targeted by the tyrosine kinase inhibitor imatinib [50], offering a therapeutic option not available to these patients.
We also identified and validated several interesting potential driver mutations by frequency within our cohort. Recurrent mutations were seen in KMT2C, which codes for lysine methyltransferase-2C. These mutations have typically been seen in leukaemia and other blood Hierarchical clustering plot of 100 most variably expressed genes in RNA-seq data, demonstrating five separate clusters. Red = overexpressed, green = under-expressed within cohort by Z-score malignancies but other more recent studies have demonstrated that these mutations occur amongst a wide variety of other cancers [51] and are targetable by inhibitors of KMT2C function. Mutations were also seen in ATM (targetable with ATM kinase inhibitors [52]), IDH1 (targetable with the small molecular inhibitor of IDH1, Ivosidenib [53]) and SMARCA4 (targetable with CDK4/6 inhibitors) [54]. We attempted to identify new driver mutations as well as validate existing drivers using validated calling algorithms; however, only APC was consistently enriched across all four callers in our study, once again emphasising the predominant Wnt signalling driven nature of colorectal cancer. The recurrent nature of HLA-A mutations (which were not validated by Sanger sequencing) in our cohort is interesting, as it is seen infrequently across all cancers [55] and could potentially represent a mechanism of immune invasion in a subset of cancers.
We identified many significantly mutated non-coding regions, such as enhancers, transcription factor binding sites and promoters which may play a significant role in the pathogenesis of colorectal cancer. These regions have been relatively unexplored up to this point and may represent a hitherto unexplored area of colorectal cancer biology.
Recurrent alterations in genome structure, in the form of structural variants, copy number aberrations or gene fusions have also been highlighted as a potential target for therapy. For instance, the FGFR2/3 fusion seen in approximately 40% of cholangiocarcinoma is a target for the drug pemagatinib [56]. Our study has shown several recurrent copy number variations or structural variations but also a number of unique "private" variations that may be targetable. For instance, we observed potential fusions between BRAF and DLG1 (which may be targetable by BRAF kinase inhibition [57]) and between ERBB2 and HAP1 (which may be targetable by lapatinib [58]). It is conceivable, however, that our structural variants observed may be as a result of radiotherapyinduced damage as a subset of these patients underwent neoadjuvant radiotherapy. However, the patients in which these fusions were seen did not undergo neoadjuvant chemoradiotherapy.
Tumour immunotherapy, using a combination of anti-PD1 and/or anti-CTLA4 therapy has been shown to have a survival benefit across multiple tumour types [59], especially when stratified to patients with high tumour mutational burden (TMB). TMB correlates directly with neoepitope production and thus immunovisibility of the tumour. A threshold of 10 mutations per megabase of sequence has been suggested as a cut-off threshold sufficient for the benefit for immunotherapy [60]. Our study has shown that up to 30% of patients with colorectal cancer reach this threshold, which is higher (16%) than previously reported [5] with 10% of We have carried out a variety of analyses of the RNA data derived from our samples. Surprisingly, the pathway analysis demonstrated findings of potential clinical utility, for instance, the presence of KEGG pathway hsa01524 (Platinum resistance). Oxaliplatin is commonly given in adjuvant chemotherapeutic treatment in colorectal cancer and resistance remains a problem [62], especially on the background of toxicity that leads to peripheral neuropathy. Interestingly, we have shown that the most frequent transcriptomic subtype within our dataset is CMS4, which is associated [48] with a worse prognosis (also seen in our dataset) and a more aggressive phenotype mainly due to the presence of fibroblasts which act as "malignant stroma". The low numbers of accurate classification of our samples may represent a weakness of 3′ RNA-seq (although we have previously used this technique without issue) or inherent weaknesses in the CMS classifier when a low tumour content heterogenous tumour sample undergoes sequencing [63]. We have also demonstrated by cell deconvolution a rich and varied immune infiltration with the predominant cell types being CD4+ memory and CD8+ cells; however, M2 macrophages are seen in most tumours. M2 macrophages are known as "repair" macrophages that decrease inflammation and promote tissue repair [64]. If this is indeed the case, it highlights an intriguing future path of research in colorectal cancer. The CIRC classifier, which we have previously used to highlight immunovisibility [33] in cancer, demonstrates that a proportion of samples have immunovisibility beyond those expected by high TMB.
In an era of personalised medicine, we have attempted to utilise current drug databases (DGIDb [65] and Open-Target [66]) in order to identify targets for personalised medicine therapy. All patients had mutations within their tumour that were potentially "druggable" allowing their recruitment into a current or planned clinical trial. This is an exciting finding, as it gives a potential route of treatment for patients with metastatic disease; however, the majority of these trials are phase 1 in nature and thus are not conclusively demonstrated to be active in colorectal cancer, or indeed in the targeted genomic alteration outside of pre-clinical models.
In conclusion, we have demonstrated the utility of standardised clinical-grade WGS at detecting both new biological insights into colorectal cancer and targets for therapy. WGS has the advantage of breadth and depth of coverage but comes at the cost of expense; this is likely to drop significantly as technologies improve. A particular disadvantage in the clinical setting is the need for access to fresh-frozen tumour material in order to perform whole-genome sequencing to the highest quality. Current experiences of FFPE WGS have demonstrated poor quality in comparison with fresh-frozen deriver material [67], and so, this tissue type remains inaccessible to routine WGS. The use of 3′ RNA-seq allows a cost-effective way to further enrich the data returned by these assays and may be useful for future studies, and has the additional advantage of having equivalent performance [68,69] between FFPE and fresh frozen materials. The UK government has recently recommissioned Genomics England to sequence five million genomes over the next decade, and we suggest based on our results that whole-genome sequencing should be considered standard of care for colorectal cancer. We additionally suggest that RNA sequencing should be utilised as the standard of care due to the additional insights it gives into tumour biology.