Web portal
Using our pipeline (Fig. 1), we processed RNA-Seq data from 1,082 human cancer cell lines, generating HLA type and quantification, virus identification and gene expression, and retrieved cell line mutations [4, 5]. The outcome of this pipeline is freely accessible in the TRON Cell Line Portal at http://celllines.tron-mainz.de.
The user web interface offers two main views, the sample information page (Fig. 2a) and the advanced search functionality (Fig. 2b). The sample information page provides information about the selected cell line. Through a tab-based interface, tables display tissue and disease type, all linked mutations, gene expression values, detected HLA types, and virus expression. The second view provides advanced search functionality, allowing one to search by a combination and exclusion of criteria. For example, the portal can easily execute the following query: ‘Show me all melanoma cell lines that are (i) HLA-A*02:01 positive, (ii) express EGFR, (iii) have a BRAF p.V600E mutation, and (iv) are annotated as female’. Translating this in the search form, we specify HLA type ‘A’ with allele ‘02:01’, have mutated gene ‘BRAF_p.V600E’, have gene expressed ‘EGFR’ with RPKM from 1 to 100 RPKM, leaving the virus name field empty and do a ‘ALL and fuzzy’ search on the properties to find cell lines annotated as ‘Female’ and have the keyword ‘Melanoma’ in their disease description (Fig. 3a). The cell lines A375, RPMI7951, and WM115 are returned (Fig. 3b). Alternatively, search criteria can also be logically negated, for example, searching for all female melanoma samples that do not have the HLA type A*02:01.
In addition to the user interface, we provide an API based on the Django REST Framework (http://www.django-rest-framework.org/). This provides the user direct access to underlying data models and bulk data retrieval. The user interface relies on and interacts with this API; advanced users can thus discover the available entry points or alternatively browse the API page at http://celllines.tron-mainz.de/api. Additional file 1 shows an example python script to retrieve data using this API.
HLA type and expression
Knowledge of a cell lines HLA type and HLA expression is critical for immunologic and cancer research and therapeutic development. As an example, in cancer immunotherapy, when developing a vaccine targeting specific mutations presented on a patients HLA allele [19], one might want to use a cancer cell expressing HLA-A*02:01 to identify mutation bearing neo-epitopes presented on HLA [6] and test T-cell activity [20]. In addition, the HLA type of a cell line can be regarded as a molecular identifier [21] and thus HLA typing can be utilized as sample barcode to detect mislabeled or contaminated samples [6].
To our knowledge, this is the largest catalog of HLA type and expression annotated cancer cell lines. Using paired-end RNA-Seq samples from 1,082 cancer cell lines, we determined the 4-digit HLA Class I and Class II type and HLA expression using the tool seq2HLA [6, 15]. When available, HLA typing data from literature are integrated. Figure 2a shows results for the prostate adenocarcinoma cell line PC-3.The HLA Class I type is HLA-A*24:01, HLA-A*01:01, HLA-B*13:02, HLA-B*55:01, HLA-C*01:02, and HLA-C*06:02, consistent with the sequence-based typing (SBT) from Adams et al. [16]. In case of HLA-C, the latter only provides 2-digit types, whereas seq2HLA provides the 4-digit HLA type, which is necessary for applications, such as HLA binding predictions [17]. Among HLA Class I allele in PC-3 cells, HLA-A shows the highest (109 RPKM) and HLA-B the lowest expression (16 RPKM). PC-3 expresses HLA Class II alleles at very low levels: HLA-DRB1*13:01 could be correctly identified despite the very small number of mapped reads (0.04 RPKM) while no reads were associated with other HLA Class II alleles.
Detected viruses
Infections or contaminations of cell lines by viruses can be determined by the presence of viral sequences. As an example, Additional file 2: Figure S1 shows the report for the liver carcinoma cell line PLC/PRF/5 including the determined HLA type and the detected viruses. Here, concordant to the information from the American Type Culture Collection (ATCC), the Hepatitis B virus (HBV) genome is reported. The coverage of above 90 % shows that most of the HBV genome is expressed as mRNA. HBV infection is related to the onset of hepatocellular carcinoma [22] and thus this cell line may act as a model for this cancer entity in terms of HBV infection. Additionally, the Human endogenous retrovirus K113 (HERV-K113) is reported, the only HERV (human endogenous retrovirus) genome present in this database. HERV-K113 is present in many human genomes and is known to express mRNA and even proteins [23, 24].
In addition of identifying new or already known cancer-related virus infections, contaminations can be detected. We find evidence (90 % genome coverage) of murine type c retrovirus in the transcriptome of the bladder urothelial carcinoma cell line 253JBV, which might have confounding effects on experiments [25].
Mutations
The portal integrates mutation information for the analyzed cell lines from CCLE [4] and Klijn et al. [5]. For each mutation, annotations are displayed, such as the affected gene, the position in the genome, the type (for example, substitution), the effect (for example, missense or intron), and the influence on the protein sequence (for example, p.Y58F means, that the Tyrosine residue at position 58 is substituted by a Phenylalanine). In addition, we provide links to the webpage of this entry at the respective source, CCLE or Genentech, and a link to the ‘Drug Gene Interaction Database’, which identifies relationships between mutated genes and drugs [26].
Neo-epitope catalog
Using the determined HLA Class I and Class II types in conjunction with the mutations enabled us to define a catalog of HLA Class I and Class II neo-epitope candidates. Figure 4 shows the neo-epitope catalog for colon carcinoma cell line HCT116, sorted from strong to weak binding. The columns 1 to 3 describe the mutation and columns 4 to 7 show the HLA allele, the percentile rank, the sequence, and the IC50 of the predicted strongest binding neo-epitope, respectively. Columns 8 to 11 show information for the corresponding wild-type sequence.
Such a list can be input for experiments searching for tumor HLA-ligands. As an example, Bassani-Sternberg et al. [27] recently eluted HLA ligands from HCT116 cells, followed by mass spectrometry profile, and found several mutation-containing ligands, which are listed in the neo-epitope catalog, such as QTDQMVFNTY with a predicted strong binding affinity (rank: 0.01, IC50: 8 nM, marked row in Fig. 4).
Gene expression
The TCLP allows searching for and listing gene expression values from a selected cell line. The table enables the user to filter via the gene name or to define a RPKM value range. The table dynamically changes its content to display only the data fulfilling the given criteria. The gene name is linked to the NCBI platform for additional gene information. All expression data of the current cell line can be downloaded via a download button at the top of the table or through the corresponding API.