A global cancer data integrator reveals principles of synthetic lethality, sex disparity and immunotherapy

Background Advances in cancer biology are increasingly dependent on integration of heterogeneous datasets. Large-scale efforts have systematically mapped many aspects of cancer cell biology; however, it remains challenging for individual scientists to effectively integrate and understand this data. Results We have developed a new data retrieval and indexing framework that allows us to integrate publicly available data from different sources and to combine publicly available data with new or bespoke datasets. Our approach, which we have named the cancer data integrator (CanDI), is straightforward to implement, is well documented, and is continuously updated which should enable individual users to take full advantage of efforts to map cancer cell biology. We show that CanDI empowered testable hypotheses of new synthetic lethal gene pairs, genes associated with sex disparity, and immunotherapy targets in cancer. Conclusions CanDI provides a flexible approach for large-scale data integration in cancer research enabling rapid generation of hypotheses. The CanDI data integrator is available at https://github.com/GilbertLabUCSF/CanDI. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-021-00987-8.


Background
Large-scale but often independent efforts have mapped phenotypic characteristics of more than one thousand human cancer cell lines. Despite this, static lists of univariate data generally cannot identify the underlying molecular mechanisms driving a complex phenotype.
We hypothesized that a global cancer data integrator that could incorporate many types of publicly available data including functional genomics, whole genome sequencing, exome sequencing, RNA expression data, protein mass spectrometry, DNA methylation profiling, chromatin immunoprecipitation sequencing (ChIP-seq), assay for transposase-accessible chromatin sequencing (ATAC-seq), and metabolomics data would enable us to link disease features to gene products [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. We set out to build a resource that enables cross platform correlation analysis of multi-omic data as this analysis is in and of itself is a high-resolution phenotype. Multi-omic analysis of functional genomics data with genomic, metabolomic or transcriptomic profiling can link cell state or specific signaling pathways to gene function [2,3,13,[15][16][17][18]. Lastly, co-essentiality profiling across large panels of cell lines has revealed protein complexes and co-essential modules that can assign function to uncharacterized genes [19].
Problematically, in many cases publicly available data are poorly integrated when considering information on all genes across different types of data and the existing data portals are inflexible. For example, lists of genes cannot be queried against groups of cell lines stratified by mutation status or disease subtype. Furthermore, one cannot integrate new data derived from individual labs or other consortia.
We created the Cancer Data Integrator (CanDI) which is a series of python modules designed to seamlessly integrate genomic, functional genomic, RNA, protein, and metabolomic data into one ecosystem [20]. Our python framework operates like a relational database without the overhead of running MySQL or Postgres and enables individual users to easily query this vast dataset and add new data in flexible ways. This was achieved by unifying the indices of these datasets via index tables that are automatically accessed through CanDI's biologically relevant Python Classes. We highlight the utility of CanDI through four types of analysis to demonstrate how complex queries can reveal previously unknown molecular mechanisms in synthetic lethality, sex disparity, and immunotherapy. These data nominate new small molecule and immunotherapy anti-cancer strategies in KRASmutant colon, lung, and pancreatic cancers.

CanDI module structure
CanDI is a python library built on top of Pandas specializing in retrieving, formatting, and integrating the publicly available data from The Cancer Dependency Map (Dep-Map) [12,20], The Cancer Cell Line Encyclopedia (CCLE) [1], The Pooled In-Vitro CRISPR Knockout Essentiality Screens Database (PICKLES) [21], The Comprehensive Resource of Mammalian Protein Complexes (CORUM) [8] and protein localization data from The Cell Atlas [4], The Map of the Cell [11], and The In Silico Surfaceome [7,22] (Additional file 1: Fig. S1). The data we present is sourced from the 2021Q2 releases of DepMap and CCLE data and the Avanna 2018Q3 release of PICKLES data [23]. CanDI is designed to retrieve data from the most current release of DepMap and CCLE data. While some data we present is from analysis of Bayes Factors take from PICKLES, CanDI's current version no longer makes uses of these data as a measure of gene essentiality. At its core, CanDI is a software that allows for automated retrieval and formatting of publicly available datasets, as well as computational tools that allow for quick data dataset sub-setting, integration, and cross-referencing. Data retrieval is not tied to specific release of the data of which CanDI classes are built around.
Access to all datasets is controlled via a python class called Data. Upon import, the data class reads the config file established during installation and defines unique paths to each dataset and automatically loads the cell line index table and the gene index table. Installation of CanDI, configuration, and data retrieval is handled by a manager class that is accessed indirectly through installation scripts and the Data class. Interactions with this data are controlled through a parent Entity class and several handlers. The biologically relevant abstraction classes (Gene, CellLine Cancer, Organelle, GeneCluster, CellLineCluster) inherit their methods from Entity. Entity methods are wrappers for hidden data handler classes who perform specific transformations, such as data indexing and high-throughput filtering.

Differential expression
In all cases where it is mentioned, differential expression was evaluated using the DESeq2 R package (Release 3.13) [24]. Significance was considered to be an adjusted p value of less than 0.01.

Differential essentiality
Essentiality scores (CERES gene effect scores [12]) are taken from the DepMap database (2021Q2). To reduce the number of hypotheses posed during this analysis, the mutual information of gene essentiality was calculated using the mutual information metric from the python package SciKitLearn (Version 0.22.0). Genes with mutual information scores greater than one standard deviation above the median were removed from consideration. Differential essentiality was evaluated by performing a Mann-Whitney U test between two groups on every gene that passed the mutual information filter. Significance was considered to be a p value of less than 0.01. Magnitude of differential essentiality of a given gene was shown as the difference in mean CERES scores between two groups of cell lines.

Protein localization confidence
Protein localization data was assembled from The Cell Atlas [4], The Map of the Cell [11], and The In Silico Surfaceome [7,22]. Confidence annotations were taken from the supplemental data of each paper and put on a number scale from 0 to 4 and summed for a total confidence score for each localization annotation for every gene across all three databases. The analysis shown in Fig. 4 represents a gene list that was further manually curated to remove the genes that are localized to the intracellular space at the cell membrane revealing cell surface protein targets that are highly expressed in nonsmall cell lung cancer (NSCLC) cancer models over normal lung bronchial epithelial cells [4,7,11,22].

DepMap Creative Commons license
When an individual uses CanDI they are downloading D e p M a p d a t a a n d t h u s a r e a g r e e i n g t o a C C Attribution 4.0 license (https://creativecommons.org/ licenses/by/4.0/).

Synthetic lethality of Fanconi anemia genes in ovarian and breast cancer models
Using CanDI, the essentiality scores of 50 top hits identified by a CRISPR screen in Hela cells that confer sensitivity to PARP inhibition [25] were visualized across all ovarian cancer cell models in DepMap (2021Q2) (Fig. 1). FANCA and FANCE showed selective essentiality in the BRCA1 mutant ovarian cancer cell lines. Following this observation, CanDI was used to gather the gene essentiality for all FANC genes in the Fanconi anemia pathway. CanDI was then used to visualize these data across all ovarian and breast cancer cell lines, sorting by BRCA1 mutation status.

Synthetic lethality in KRAS-and EGFR-mutant cell lines
CanDI was leveraged to bin NSCLC cell lines present in both CCLE (Release: 2021Q2) and DepMap (Release 2021Q2) into 8 groups. KRAS-mutant and KRASwildtype cell lines with and without EGFR mutants removed as well as EGFR-mutant and EGFR-wildtype cell lines with and without KRAS mutants removed. We present genes that are synthetic lethal with KRAS and EGFR mutations by plotting the mean gene essentiality for all genes in the genome of mutant cell lines against wild type cell lines (Fig. 2, Additional file 1: Fig. S2, S3). Synthetic lethality can be interpreted as a gene's shift off of the y = x line of these figures.

Pan-Cancer synthetic lethality analysis
A set of core oncogenes and tumor suppressor driver mutations was chosen for analysis [26]. To test the effect of these gene's mutations on gene essentiality, CanDI was leveraged to split mutations into two groups: a nonsense mutation group containing genes annotated as tumor suppressors (N = 153) and a missense mutation group containing genes annotated as oncogenes with specific driver protein changes (N = 53). CanDI was then used to collect a core set of genes with highly variable essentiality. To do this, the Bayes factors from the PICKLES database (Avana 2018Q4) were converted to binary numeric variables. Bayes factors over 5 were assigned a 1 = essential and Bayes factors under 5 were  assigned a 0 = non-essential. Genes were then sorted buy their variance across cell lines and genes between the 85th and 95th percentile were used for this analysis (N = 2340). To determine a short list of genes with which to follow up on chi-square tests were applied to the 95,940 gene pairs in the missense group and the 603,720 gene pairs in the tumor suppressor group. Three new groups were formed for further analysis: the first consisted of the significant gene/mutation pairs from the oncogenic group, the second consisted of the significant gene/mutation pairs from the tumor suppressor group, and the third was a combination of the significant pairs from both groups with no discrimination on the type of mutations considered. These groups were further analyzed for differential essentiality via the Mann-Whitney method described above and Cohen's D effect size was calculated to measure the extent of the phenotype.

Differential expression and essentiality of male and female KRAS-driven cancers
We used CanDI to gather all cell lines that are present in both DepMap (2021Q2) and CCLE (Release 2021Q2). CanDI was then leveraged to put these cell lines into the following tissue groups: KRAS-mutant colon/colorectal (CRC), pancreatic ductal adenocarcinoma (PDAC), and NSCLC. Each tissue group was then split into male and female subgroups. We chose to analyze KRASmutant colon/colorectal, PDAC, and NSCLC cell lines as there are a relatively large number of KRAS-mutant male and female cell lines representative of these types of cancer present in DepMap giving us increased statistical power for subsequent analysis. Differential expression was analyzed by applying the methods described above to raw RNA-seq counts data from CCLE (Release: 2021Q2). Genes with adjusted p values less than 0.01 were considered significantly differentially expressed. Differential essentiality was analyzed using the methods described above on the previously described sex subgroups for each tissue type. Genes with p values less than 0.05 were considered significantly differentially essential between male and female cell models. For each tissue type, the distributions of the top 7 significantly differentially essential genes were highlighted in comparison with the bottom 3 as a negative control (Fig. 3).

Differential expression of benign and malignant cancer cell lines
We downloaded human bronchial epithelial (HBE) RNA-seq data from Gillen et al. via the European Nucleotide Archive to use as a benign lung tissue model [27]. This data set contains gene expression data for primary HBE cells cultured from three different donors and also normal human bronchial epithelial (NHBE) cells (a mixture of HBE and human tracheal epithelial cells purchased from Lonza Bioscience (CC-2541)). We then used CanDI to put NSCLC models into three different groups: KRAS-mutant, EGFR-mutant, and all cell lines (Fig. 4). For our benign model, raw counts were quantified via kallisto [28]. Raw counts for our malignant cell lines were queried via CanDI. DESeq2 was then applied to evaluate the differential expression between our normal lung tissue model and our three malignant lung tissue groups. The results from DESeq2 were then filtered by significance (adjusted p value < 0.01). To filter based on potential immunotherapy targets, we removed all genes not annotated as being localized to the plasma membrane, and genes with localization confidence scores lower than 6. Genes that were obviously misannotated as surface proteins were also manually removed.

Results
CanDI is a global cancer data integrator.
We set out to integrate multiple types of data by creating programmatic and biologically relevant abstractions that allow for flexible cross-referencing across all datasets [20]. Data from the Cancer Cell Line Encyclopedia (CCLE) for RNA expression, DNA mutation, DNA copy number, and chromosome fusions across more than 1000 cancer cells lines was integrated into our database with the functional genomics data from the Cancer Dependency Map (DepMap) (Fig. 1a, b and Additional file 1: Fig. S1) [1,12,21]. We also integrated protein-protein interaction data from the CORUM database along with three additional distinct protein localization databases [4,7,11,22]. CanDI by default will access the most recent release of data from DepMap, although users can also specify both the release and data type that is accessed [20]. The key advantage to this approach is that CanDI enables one to easily input user-defined queries with multi-tiered conditional logic into this large integrated dataset to analyze gene function, gene expression, protein localization, and protein-protein interactions [20].
CanDI identifies genes that are conditionally essential in BRCA-mutant ovarian cancer The concept that loss-of-function tumor suppressor gene mutations can render cancer cells critically reliant on the function of a second gene is known as synthetic lethality. Despite the promise of synthetic lethality, it has been challenging to predict or identify genes that are synthetically lethal with commonly mutated tumor suppressor genes. While there are many underlying reasons for this challenge, we reasoned that data integration through CanDI could identify synthetic lethal interactions missed by others.
A paradigmatic example of synthetic lethality emerged from the study of DNA damage repair (DDR) [29]. Somatic mutations in the DNA double-strand break (DSB) repair genes, BRCA1, create an increased dependence on DNA single-strand break (SSB) repair. This dependence can be exploited through small molecule inhibition of PARP1-mediated SSB repair. Inhibition of PARP provides significant clinical responses in advanced breast, prostate and ovarian cancer patients with DDR mutations but they ultimately progress [29]. Thus, new synthetic lethal associations with BRCA1 are a potential path towards therapeutic development PARP refractory patients.
To illustrate the flexibility of CanDI to mine contextspecific synthetic sick lethal (SSL) genetic relationships, we hypothesized that the genes that modulate response to a PARP1 inhibitor might be enriched for selectively essential proliferation or survival of BRCA1-mutant cancer cells. To test this hypothesis, we integrated the results of an existing CRISPR screen that identified genes that modulate response to the PARP inhibitor olaparib [25]. We then tested whether any of these genes are differentially essential for cell proliferation or survival in ovarian cancer and in breast cancer cell models that are either BRCA1 proficient or deficient (Fig. 1c, d). This query revealed that the Fanconi Anemia pathway is selectively essential in BRCA1-mutated ovarian cancer models but not in BRCA1-wild type ovarian cancer, BRCA1-mutated breast cancer, or BRCA1-wildtype breast cancer models (Fig. 1e, Additional file 2: Table S1, and Additional file 3: Table S2). To our knowledge, a SSL phenotype between FANCM and BRCA1 has not been previously reported in human cancer models although our hypothesis is supported by a recent paper characterizing a SSL phenotype between FANCM and BRCA1 in mouse embryonic stem cells [30]. A second recent paper has nominated a role for FANCM and BRCA1 in telomere maintenance [31]. Importantly, FANCM is a helicase/translocase and thus considered to be a druggable target for cancer therapy [32]. Clinical genomics data support this SSL hypothesis, although this remains to be tested in ovarian cancer patient samples [33]. Because the DepMap currently only allows single genes to be queried and does not enable users to easily stratify cell lines by mutation such analysis would normally take a user several days to complete manually. Our approach enabled this analysis to be completed using a desktop computer in less than 2 hours, which includes the visualization of data presented here (Fig. 1e).

Conditional genetic essentiality in KRAS-and EGFRmutant NSCLC cells
Beyond tumor suppressor genes (TSGs), many common driver oncogenes such as KRAS G12D are currently undruggable, which motivates the search for oncogenespecific conditional genetic dependencies. We reasoned that CanDI enables us to rapidly search functional genomics data for genes that are conditionally essential in lung cancer cells driven by KRAS and EGFR mutations. We stratified non-small cell lung cancer cell (NSCLC) models by EGFR and KRAS mutations and then looked at the average gene essentiality for all genes within each of these 4 subtypes of NSCLC. We chose to analyze KRAS-and EGFR-mutant cell lines as these driver oncogene mutations are well characterized and amongst the most common mutations present in DepMap cell lines giving us increased statistical power for subsequent analysis. We observed that KRAS is conditionally selfessential in KRAS-mutant cell models but that no other genes are conditionally essential in KRAS-mutant, EGFR-mutant, KRAS-wildtype, or EGFR-wildtype cell models (Fig. 2a, b, Additional file 1: Fig. S2,S3 and Additional file 4: Table S3). This finding demonstrates that very few-if any-genes are synthetic lethal with KRASor EGFR-in KRAS-and EGFR-mutant lung cancer cell lines. It may be that these experiments are underpowered or it may be that when the genetic dependencies of diverse cell lines representing a disease subtype are averaged across a single variable (e.g., a KRAS mutation) very few common synthetic lethal phenotypes are observed [34]. CanDI provides potential solutions for both of these hypotheses.

CanDI enables a global analysis of conditional essentiality in cancer.
It is thought that data aggregation across vast landscapes of unknown covariates does not necessarily increase the statistical power to identify rare associations [34]. Thus, the global analyses of aggregated cancer data sometimes lie in systematically sub-setting data based on key covariates post aggregation. This has been observed in driver gene identification [35]. Inspired by our analysis of TSG and oncogene conditionally essentiality above, we next used CanDI to identify genes that are conditionally essential in the context of several hundred cancer driver mutations. We first grouped driver mutations (e.g., nonsense or missense) for each driver gene. For this analysis, we selected several thousand genes that are in the 85-90th percentile of essentiality within the DepMap data and therefore conditionally essential, meaning these genes are required for cell growth or survival in a subset of cell lines. Importantly, it is not known why these several thousand genes are conditionally essential. We then tested whether each of these conditionally essential genes has a significant association with individual driver mutations. Our analytic approach does not weight the number of cell models representing each driver mutation nor does this give information on phenotype effect sizes. Our analysis nominates a large number of conditionally dependent genetic relationships with both TSG and oncogenes (Fig. 2c, d and Additional file 5: Table S4). A number of the conditional genetic dependencies identified in our independent variable analysis above are represented by a limited number of cell models and so further investigation is needed to systematically validate these conditional dependencies, but this data further suggests that averaging genetic dependencies across diverse cell lines with un-modeled covariates obscures conditional SSL relationships. For example, this analysis does not weight biological variables such as other mutations, chromosomal amplifications or deletions, aneuploidy, cancer subtype, biological sex, and more. It is currently impossible to account for all variables with the relatively limited number of human cancer cell lines that exist; however, it is predicted that if sufficient biological models representative of each variable were analyzed then a predictable pattern of SSL relationships would emerge.
To further investigate this hypothesis, we analyzed these same conditional genetic relationships with a second analytic approach that weights the number of cell models representing each driver mutation. We observed a limited number of conditional genetic dependencies that largely consists of oncogene self-essential dependencies as previously highlighted for KRAS-mutant cell lines (Fig. 2e-g and Additional file 6: Table S5) [13,36]. Our analysis of mutant oncogene self-essentiality suggests that mutant oncogenes that drove tumor initiation or progression generally continue to support the proliferation and survival of human cancer cell lines in vitro. Thus, analysis that averages each conditional phenotype across diverse panels of cell lines with unknown covariates masks interesting conditional genetic dependencies.
CanDI reveals female and male context-specific essential genes in colon, lung, and pancreatic cancer Cancer functional genomics data is often analyzed without consideration for fundamental biological properties such as the sex of the tumor from which each cell line is derived. It is well established that biological sex influences cancer predisposition, cancer progression, and response to therapy [37]. We hypothesized that individual genes may be differentially essential across male and female cell lines. This hypothesis to our knowledge has never been tested in an unbiased large-scale manner. To maximize our statistical power to identify such differences, we chose to test this hypothesis in a disease setting with large number of relatively homogenous cell lines and fewer unknown covariates. Using CanDI, we stratified all KRAS-mutant NSCLC, PDAC, and CRC cell lines by sex and then tested for conditional gene essentiality. This analysis identified a number of genes that are differentially essential in male or female KRASmutant NSCLC, PDAC, and CRC models (Fig. 3a-f and Additional file 7: Table S6, Additional file 8: Table S7, Additional file 9: Table S8, Additional file 10: Table S9, Additional file 11: Table S10, Additional file 12: Table S11). The genes that we identify are not common across all three disease types suggesting as one might expect that the biology of the tumor in part also determines gene essentiality. To test whether any association between differentially essential genes could be identified from expression data (e.g., essential genes encoded on the Y chromosome), we first used CanDI to identify genes that are differentially expressed between male and female cell lines within each disease [24]. We then plotted the set of differentially essential genes against the differentially expressed genes in KRAS-mutant NSCLC, PDAC, and CRC models (Fig. 3a, c, e and Additional file 7: Table  S6, Additional file 8: Table S7, Additional file 9: Table S8, Additional file 10: Table S9, Additional file  11: Table S10, Additional file 12: Table S11) and found little overlap between these gene lists. Notably, we observed VHL and ELOB, which are thought to form a protein complex, are top hits that are more essential in KRAS-mutant male colon cancer cells [38,39]. Our analysis demonstrates that stratifying groups of heterogeneous cancer models by three variables, in this case tumor type, KRAS mutation status and sex, reveals differentially essential genes. CanDI enables biologically principled stratification of data in the CCLE and DepMap by any feature associated with a group of cell models. This stratification allows us to identify genes associated with sex, which is not possible with other covariates included.
CanDI enables rapid integration of external datasets to reveal immunotherapy targets.
An emerging challenge in the cancer biology is how to robustly integrate larger "resource" datasets like CCLE with the vast amount of published data from individual laboratories. For example, a big challenge in antibody discovery is identifying specific surface markers on cancer cells. To approach these big questions, we utilized CanDIs ability to rapidly take new datasets, such as raw RNA-seq counts data in a disparate study of interest, then normalize and integrate this data into the CCLE, DepMap, and protein localization databases previously described. Specifically, we rapidly integrated an RNA-seq expression dataset that measured the set of transcribed genes in primary lung bronchial epithelial cells from 4 donors [27]. Classes within CanDI enable rapid application of DESeq2 to assess the differential expression between outside datasets and the CCLE. We used this feature to identify genes that are differentially expressed between primary lung bronchial epithelial cells and KRAS-mutant NSCLC, EGFR-mutant NSCLC, or all NSCLC models in CCLE. We then used CanDI to identify genes that are upregulated in cancer cells over normal lung bronchial epithelial cells with protein products that are localized to the cell membrane. This analysis of KRAS-mutant, EGFR-mutant, and pan-NSCLC generated highly similar lists of differentially expressed surface proteins (Fig. 4a-f and Additional file 13: Table S12). Notably, overexpression of several of these genes, such as CD151 and CD44, has been observed in lung cancer and is associated with poor prognosis [40][41][42]. These proteins represent potential new immunotherapy targets in KRAS-driven NSCLC.

Discussion
Data integration is a critical requirement in biology research in the era of genomics and functional genomics. Large-scale efforts such as the CCLE have revealed genomic features of more than 1000 cell line models. This data has not to our knowledge previously been integrated with functional genomics data in a manner that individual users can enter batched queries that are stratified by disease subtype or mutation status. This is not just a small improvement in functionality, but rather it is an enabling format that makes possible the types of conditional genomics analyses that drive discovery. Moreover, it fills a fundamental gap in the cancer research community that integrates large-scale projects with investigator-initiated studies Our data framework enables biologists without specialized expertise in bioinformatics to use the full spectrum of data in the CCLE and DepMap in a higher throughput and precise manner. Using CanDI, we identified genes that are selectively essential in male versus female KRAS-mutant NSCLC, PDAC, and CRC models. To our knowledge, such analysis has never been performed to begin to query the biologic basis of sex disparity in cancer or cancer therapy. We illustrate another feature of our framework by analyzing a list of hit genes nominated by a bespoke CRISPR drug screen for gene essentiality in BRCA1-wild type and BRCA1-mutated breast and ovarian cancer. In a third application, we analyzed the principle of synthetic lethality for 17,427 genes in 33 KRAS-mutant and 17 EGFR-mutant NSCLC models. We then used CanDI to globally identify genes that are conditionally essential in the context of common cancer driver mutations. Finally, we nominated 12 potential new immunotherapy targets in KRAS-mutant, EGFR-mutant, and pan-NSCLC models by using CanDI to identify genes that are differentially expressed in normal bronchial epithelial cells versus NSCLC models that are localized at the plasma membrane.

Conclusions
Our use of CanDI reveals a wealth of new hypotheses that can be rapidly generated from private and publicly available cancer data. By sharing data flows and use cases with a CanDI community, we illustrate the ways in which individual research groups can interact with massive cancer genomics projects without reinventing tools or relying upon DepMap tool releases. We anticipate that CanDI will be widely used in cell biology, immunology, and cancer research.