Sequence-based prediction of vaccine targets for inducing T cell responses to SARS-CoV-2 utilizing the bioinformatics predictor RECON

Background The ongoing COVID-19 pandemic has created an urgency to identify novel vaccine targets for protective immunity against SARS-CoV-2. Consistent with observations for SARS-CoV, a closely related coronavirus responsible for the 2003 SARS outbreak, early reports identify a protective role for both humoral and cell-mediated immunity for SARS CoV-2. Methods In this study, we leveraged HLA-I and HLA-II T cell epitope prediction tools from RECON® (Real-time Epitope Computation for ONcology), our bioinformatic pipeline that was developed using proteomic profiling of individual HLA-I and HLA-II alleles to predict rules for peptide binding to a diverse set of such alleles. We applied these binding predictors to viral genomes from the Coronaviridae family, and specifically to identify SARS-CoV-2 T cell epitopes. Results To test the suitability of these tools to identify viral T cell epitopes, we first validated HLA-I and HLA-II predictions on Coronaviridae family epitopes deposited in the Virus Pathogen Database and Analysis Resource (ViPR) database. We then use our HLA-I and HLA-II predictors to identify 11,776 HLA-I and 7,991 HLA-II candidate binding peptides across all 12 open reading frames (ORFs) of SARS-CoV-2. This extensive list of identified candidate peptides is driven by the length of the ORFs and the significant number of HLA-I and HLA-II alleles that we are able to predict (74 and 83, respectively), providing over 99% coverage for the US, European and Asian populations, for both HLA-I and HLA-II. From our SARS-CoV-2 predicted peptide-HLA-I allele pairs, 368 pairs identically matched previously reported pairs in the ViPR database, originating from other forms of coronaviruses. 320 of these pairs (89.1%) had a positive MHC-binding assay result. This analysis reinforces the validity our predictions. Conclusions Using this bioinformatic platform, we identify multiple putative epitopes for CD4+ and CD8+ T cells whose HLA binding properties cover nearly the entire population and thus may be effective when included in prophylactic vaccines against SARS-CoV-2 to induce broad cellular immunity.

1 high degree of sequence similarity allows us to leverage the previous research on protective 2 immune responses to SAR-CoV to aid in vaccine development for SARS-CoV-2. Both humoral 3 and cellular immune responses have been shown to be important in host responses to SARS-CoV 4 (12). Antibody responses generated against the S and the N proteins have shown to protect from 5 SARS-CoV infection in mice and have been detected in SARS-CoV infected patients (13-16). 6 However, these antibody responses detected against the S protein were short-lived and 7 undetectable in patients six years post-recovery, suggesting that T cell responses may be 8 involved in the long-term control of this virus (17). Indeed, significant changes in the total 9 lymphocyte counts and T cell subset composition have been observed in patients with SARS- 10 CoV; namely, levels of both B cells, and CD4 + and CD8 + T cells have been significantly reduced 11 in these patients (18,19). Similarly, mice infected with SARS-CoV demonstrated that the severity 12 of SARS correlated with the ability to develop a virus-specific T cell response (20, 21). 13 Both CD4 + and CD8 + T cell responses have been detected in SARS-CoV-infected patients 14 (12,22) as well as in SARS-CoV-2 (23). Notably, SARS-CoV-specific memory CD8 + T cells 15 persisted up to 11 years post-infection in patients who recovered from SARS (24). Studies in 16 mice have shown that SARS-CoV-specific memory CD8 + T cells provided protection against a 17 lethal SARS-CoV infection in aged mice (21). In addition, adoptive transfer of effector CD4 + 18 and CD8 + T cells to immunodeficient or young mice expedited virus clearance and improved 19 clinical results (20). Immunization with SARS-CoV peptide-pulsed dendritic cells invigorated a 20 T cell response, increasing the number of virus-specific CD8 + T cells enhancing both virus 21 clearance and overall survival (25). These studies indicate an important role for T cell responses 22 in controlling disease severity, virus clearance and conferring protective immunity to SARS-CoV 23 5 infections. Given the homology between SARS-CoV and SARS-CoV-2, as well as emerging 1 data on SARS-CoV-2 (23), cellular immune mechanisms might play a critical role in providing 2 protection against SARS-CoV-2. 3 Here, we used T cell epitope prediction tools from the bioinformatic pipeline RECON ® (Real- 4 time Epitope Computation for ONcology) (26,27) to identify SARS-CoV-2 epitopes recognized 5 by CD4 + and CD8 + T cells. RECON was trained on high-quality mono-allelic major 6 histocompatibility complex (MHC) immunopeptidome data generated via mass spectrometry. 7 The use of mass spectrometry allows for the high throughput, and relatively unbiased, collection 8 of MHC binding data compared to traditional binding affinity assays, as well as the inclusion of 9 important chaperone molecules. Additionally, the use of engineered mono-allelic cell lines 10 avoids dependence on in-silico deconvolution techniques and allows for allele coverage to be 11 expanded in a targeted manner. 12 With this approach, we generated data for 74 human leukocyte antigen (HLA)-I and 83 HLA-II 13 alleles (Supplementary Tables 1 and 2). This mass spectrometry data enabled us to train neural 14 network-based binding predictors that outperform the leading affinity-based predictors for both 15 HLA-I (26) and HLA-II (27). Furthermore, we demonstrated in (27) that this improved binding 16 prediction leads to improved immunogenicity prediction by validating on a data set of tetramer 17 responses to a diverse collection of pathogens and allergens (28,29). Although RECON was 18 originally developed to prioritize neoantigens for immunotherapy applications, it is agnostic to 19 the source of peptide sequences evaluated and can be easily applied to peptides derived from 20 pathogens as well. As validation to that end, the binding predictors from RECON were used to 21 score Coronaviridae family peptides that had been assayed for T cell reactivity or MHC binding 22 from the Virus Pathogen Resource (ViPR) database (30). The ViPR database integrates viral 23 6 pathogen data from internally curated data, researcher submissions and data from various 1 external sources. Our approach provides a significant improvement in both the breadth of 2 predictions, and their validity, compared with a recent study that had a similar aim (31). We used 3 the HLA-I and HLA-II binding predictors from RECON to predict the binding potential of 4 peptide sequences from across the entire SARS-CoV-2 genome for a broad set of HLA-I and 5 HLA-II alleles, covering the vast majority of USA, European, and Asian populations 6 (Supplementary Table 3 if a given peptide-allele pair was assayed multiple times by a specific assay type and was 18 determined to be positive in any single one of the assays, the peptide-allele pair was classified as 19 positive. Specifically, the priority was given by the following order: Positive-High > Positive-20 Intermediate > Positive-Low > Positive > Negative (e.g., a peptide allele pairing that was  Binding prediction for ViPR Coronaviridae family T cell epitopes 2 Peptide-HLA-I allele pairs in the ViPR validation dataset were scored using RECON's HLA-I 3 binding predictor , a neural network-based model trained on mass spectrometry data (26). 4 Similarly, peptide-HLA-II allele pairs in the ViPR validation dataset were scored using 5 RECON's HLA-II binding predictor, a recently published convolutional neural network-based 6 model trained on mono-allelic mass spectrometry data (27). When applying the HLA-II binding 7 predictor, we used the highest score for all 12-20mers within a given assay peptide. This is meant 8 to account for the fact that the predictor is trained on ligands observed via mass spectrometry and 9 may learn processing rules that are irrelevant for assays that do not incorporate processing and 10 presentation.

12
Retrieval of SARS-CoV-2 sequence 13 The GenBank reference sequence for SARS-CoV-2 (accession: NC_045512.2) was used for this 14 study. All twelve annotated open-reading frames (orf1a, orf1b, S, ORF3a, E, M, ORF6, ORF7a, 15 ORF7b, ORF8, N, and ORF10) were considered as sources of potential epitopes. To identify candidate HLA-I epitopes, all possible 8-12mer peptide sequences from SARS-CoV- 19 2 were scored with RECON's HLA-I binding predictor. The HLA-I binding predictor was used 20 to score SARS-CoV-2 peptides binding against 74 alleles, including 21 HLA-A alleles, 35 HLA-21 B alleles, and 18 HLA-C alleles. Peptide-allele pairs were assigned a percent rank by comparing 1 their binding scores to those of 1,000,000 reference peptides for the same respective allele. Peptide-allele pairs that scored in the top 1% of the scores of these reference peptides were 3 considered strong potential binders. 4 These top-ranking peptides were then prioritized based on expected USA population coverage 5 (allele frequencies obtained from (32) -USA frequencies calculated as follows:  The cumulative product itself represents the chance that an individual in the USA does not 14 express any one of the contained alleles; hence, the complement describes the probability that at 15 least one is present. The aim of using USA, European, and API allele frequencies is to cover a 16 diverse population where allele frequency estimates are relatively reliable. 17 We then construct two ranked lists of HLA-I epitopes by coverage. The first ranks the epitopes 18 by their absolute coverage, such that sequences predicted to bind similar collections of alleles 19 would be ranked similarly (Supplementary Table 4). The second list, referred to as the "disjoint" 20 list, is constructed in an iterative fashion where the sequence with the greatest coverage is 21 selected first, and then the coverage for the remaining epitopes is updated to nullify contributions 22 from any alleles that have already been selected (Supplementary Table 5). This second list was 1 used to generate Figure 3A. To identify HLA-II epitopes, we used RECON's HLA-II binding predictor to score all 12-20mer 4 sequences in the SARS-CoV-2 proteome to predict both binding potential and the likely binding  Table 2). 8 Peptide/allele pairs were assigned a percent rank by comparing their binding scores to those of 9 100,000 reference peptides. Pairs scoring in the top 1% were deemed likely to bind. 10 Additionally, we define the "epitope" of a 12-20mer to be the predicted binding core within the 11 sequence. As such, overlapping 12-20mers with the same predicted binding core for a given 12 allele would constitute a single epitope. Table 1 shows counts of these epitopes.

13
Additionally, we generated two lists of 25mers contained in SARS-CoV-2 protein sequences 14 ranked by population coverage. To do this, we associated each 25mer with all subsequences that 15 were likely binders and calculated the population coverage of the corresponding HLA-II alleles. 16 Given a collection of alleles, we calculated the coverage as described in the previous section, the 17 only difference being the cumulative product is taken across the following four HLA-II loci:  As with HLA-I, two sorted lists of predicted binding sequences were generatedone sorted on 21 absolute coverage (Supplementary Table 6), and one sorted on disjoint coverage (Supplementary   22  Table 7), which was used to generate Figure 3B and the observation that it would only require 1 four 25mers to have predicted binders for >99.9% of the USA, European, and API populations.

2
Comparison of predicted epitopes to the human proteome 3 8-12mer sequences (corresponding to predicted HLA-I epitopes), 9mer sequences 4 (corresponding to predicted HLA-II binding cores), and 25mer sequences (corresponding to 5 predicted HLA-II sequences that bound multiple alleles) from SARS-CoV-2 were compared 6 against sub-sequences of the same length from the human proteome, using UCSC Genome

12
Validating RECON prediction for viral epitopes using ViPR 13 We first sought to validate the ability of our predictors to identify epitopes from genomes of the 14 Coronaviridae family. Since SARS-CoV-2 only emerged recently, specific data on SARS-CoV-2 15 peptide MHC-binding and immunogenic epitopes are currently limited. However, other viruses 16 from the Coronaviridae family have been studied thoroughly, specifically MERS-CoV and SARS-

17
CoV. The latter has significant sequence homology to SARS-CoV-2 (35). We therefore sought to 18 leverage previously tested epitopes from across the Coronaviridae family to validate our 19 predictions of viral peptides, with special interest in peptide sequences that incidentally overlapped 20 the novel SARS-CoV-2 virus. To that end, we used the publicly available ViPR database, which 21 lists the results of T cell immunogenicity and MHC peptide-binding assays for both HLA-I and

22
HLA-II alleles for viral pathogen epitopes. We used all assays of Coronaviridae family viruses 1 with human hosts from ViPR as our validation dataset. Assays that did not have an associated four-2 digit HLA allele or were associated with an allele our models did not support were omitted (see 3 Supplementary Tables 1 and 2 for a list of supported alleles). 4 For HLA-I, within the validation dataset there were a total of 4,445 unique peptide-HLA allele 5 pairs that were assayed for MHC-binding, using variations of: 1) cellular MHC or purified MHC; 6 2) a direct or competitive assay; and 3) measured by fluorescence or radioactivity. Two additional 7 peptide-MHC allele pairs were confirmed via X-ray crystallography. Depending on the study from 8 which the data was collected, peptide-MHC allele pairs were either binarily defined in ViPR as 9 "Negative" and "Positive" for binding, or with a more granular scale of positivity: Low, 10 Intermediate, and High. We assigned peptide-MHC allele pairs with multiple measurements with 11 the highest MHC-binding detected across the replicates (see Methods). 12 We then applied our HLA-I binding predictor from RECON to the peptide-MHC allele pairs in 13 the validation dataset and compared the computed HLA-I percent ranks of these pairs with the 14 reported MHC-binding assay results (Supplementary Table 8). A low percent rank value 15 corresponds to high likelihood of binding (e.g., a peptide with a percent rank of 1% scores amongst 16 the top 1% of the reference peptides). The percent ranks of peptide-MHC allele pairs that had a 17 binary "Positive" result in the MHC-binding assay were significantly lower than pairs with a 18 "Negative" result. Further, in the more granular positive results, stronger assay results (low < 19 intermediate < high) were associated with increasingly lower percent ranks ( Figure 1A). In 20 addition, the two peptide-MHC alleles that were confirmed by X-ray crystallography were 21 predicted as very likely binders, with low percent rank scores of 0.07% and 0.30%. These results 22 demonstrate that our HLA-I binding predictor from RECON can reliably predict the HLA-I 1 binding of peptides from proteins of the Coronaviridae family, to which SARS-CoV-2 belongs.
2 Assays of T cell reactivity (e.g., interferon-gamma ELISpots, tetramers), which are stricter 3 measures for T cell immunogenicity to epitopes, were performed in significantly lower numbers 4 compared with MHC-binding assays. For HLA-I, the overlap between peptide-MHC allele pairs 5 for which we had a prediction (supported alleles) and pairs with a reported T cell assay consisted 6 of only 32 pairs, of which 23 had a positive result. We did not detect differences in the percent 7 ranks across the positive and negative groups, however sample sizes are extremely small (data not 8 shown). In addition, for HLA-I epitopes, the validation dataset only contained T cell assay results 9 for peptide-MHC allele pairs that had a positive result in a binding assay, suggesting a biased pool 10 of epitopes selected for testing.  Table 9). There were 259 unique peptide-MHC allele pairs assayed by MHC-binding assays in the 18 ViPR validation dataset for HLA-II. As before, we compared their percent rank with their reported 19 'best' (in the case of multiple measurements) MHC-binding assay result. This comparison could 20 not be performed with the "Negative" pairs as an independent group since there was only one 21 negative result in the validation dataset for HLA-II. The low negative counts may be due to under-22 reporting of negative assay results or biased selection of the peptides to be assayed. Therefore, we 23 13 merged the "Negative" and "Positive-Low" groups into one group and compared their percent 1 ranks with either the "Positive-Intermediate" or the "Positive-High" groups ( Figure 1B). This 2 analysis revealed a trend similar to that observed with HLA-I predictions, indicating that stronger 3 MHC-binding assay results are associated with a lower predicted percent rank for HLA-II binders, 4 as we expect for a robust predictor. Similar to the HLA-I T cell assays, there were too few recorded 5 HLA-II T cell assays (31) in our validation dataset to determine percent rank differences between   We detected a total of 11,776 unique SARS-CoV-2 peptides that were predicted to bind at least 21 one HLA-I allele with a percent rank score of 1% or lower (Supplementary Table 4). 14 of these 22 peptides overlapped with a subsequence of the human proteome (see methods, Supplementary 1 Table 4).

2
Unlike HLA-I, which has a closed binding groove that constrains bound peptide lengths to 3 approximately 8 to 12 amino acids, peptides binding HLA-II have a wider length distribution (up 4 to 30 amino acids or even longer) since the HLA-II binding groove is open at both ends. Peptides 5 bind with a 9 amino acid subsequence (termed the binding core) occupying the HLA-II binding 6 groove, with any flanking sequence overhanging the edges of the molecule. We consider a group 7 of peptides that differ in the flanking regions but share a common binding core as a single epitope. 8 Using the HLA-II predictor we identified 7,207 unique binding-cores that are predicted to bind at 9 least one HLA-II allele with a percent rank score of 1% or lower. The number of high-quality 10 peptide-MHC allele pairs we identify per SARS-CoV-2 gene is listed in Table 1. The majority of 11 predicted peptide-MHC allele pairs are from orf1a and orf1ab, primarily driven by the length of 12 these ORFs. In addition, orf1a and orf1ab have very similar sequences, with over 18,000 identical 13 binding peptide-HLA-I allele pairs predicted for both ORFs. We therefore opted to exclude 14 redundant predictions and only reported unique pairs (see * in Table 1). Similarly, all HLA-II 15 predicted epitopes from orf1a were covered by those reported for orf1ab. 16 To test the validity of the SARS-CoV-2 predicted peptide-HLA pairs, we looked for identical 17 peptide sequences in the Coronaviridae portion of the ViPR database ( Figure 2D). A total of 368 18 HLA-I peptide-MHC allele pairs from SARS-CoV-2 had both a percent rank lower than 1% by 19 our predictor and were found in the HLA-I MHC-binding validation dataset. Strikingly, of these 20 HLA-I peptide-MHC allele pairs, 328 (89.1%) had a positive assay result. As a comparison, we 21 also tested for overlap between epitopes predicted to have low likelihood of MHC-binding (percent 22 rank 50% or higher) and the validation dataset. 37 peptide-MHC allele pairs overlapped between 23 these sets, of which 36 (97.2%) had a negative assay result, as predicted. Further, we sought to 1 determine whether our highly predicted SARS-CoV-2 peptide-HLA-I allele pairs (percent rank 2 lower than 1%) would be validated by reported T cell assay results. Despite the significantly 3 smaller number of peptide-MHC allele pairs that were tested for T cell reactivity in the validation 4 dataset, 10 assayed pairs were also highly predicted by our HLA-I binding predictor. Nine out of 5 these 10 (90%) predicted pairs had a positive result to the T cell assay. No low-scoring pairs 6 (percent rank of 50% or above) were reported in the validation dataset. These findings demonstrate   CoV-2 that had previously been assayed in ViPR were confirmed as non-binding. We therefore 1 concluded that using RECON's HLA binding predictors to predict T cell epitopes from the ORFs 2 of SARS-CoV-2 provides a significantly expanded, novel set of high-quality vaccine targets for 3 the virus. These sequences can be exploited for vaccines in various formats, including RNA or 4 peptides.

6
This application of our prediction algorithms has clearly identified many candidate epitopes that 7 can be included in a vaccine to induce cellular responses against this novel virus. Immune cells. With respect to humoral immunity, antibodies are seen primarily to the S and N proteins.

15
Although short lived, antibody responses are essential to control the persistent phase of CoV 16 infection by preventing subsequent viral entry. We thus propose that a combination of B and T 17 cell epitopes could provide long-lasting immunity from SARS-CoV-2 or mitigate the severity of 18 disease when protection is partial. 19 The strength of our prediction is two-fold: first, we have validated predictors for both HLA-I and 20 HLA-II binders, which potentially could be leveraged to induce both long-term CD4 + and CD8 +

21
T cell immunity against the virus. Specifically, our HLA-II predictor, which has also been 22 trained on a large set of mono-allelic mass spectrometry data, has been shown to significantly 23 outperform previously published tools and is used here to identify high-quality CD4 + epitopes 1 that may contribute to both cellular and humoral immunity (27) (Supplementary Table 6).
2 Second, our expansive database of supported HLA-I and HLA-II alleles provides us with the 3 ability to not only identify many peptide-MHC allele pairs, but to generate a narrow list of 4 peptides with many potential HLA pairings that could be presented by the entire USA, European 5 and Asian Pacific Islander populations. These advantages significantly improve upon previously 6 published findings (31).

7
Our algorithms predict peptide-MHC binding, which is necessary but not sufficient to induce a T 8 cell response. Therefore, further experimental work would be needed to refine the list of peptides 9 to strictly immunogenic ones. However, with the breadth of the list we are able to provide, the 10 likelihood of identifying many such epitopes is high. In addition, while the availability of 11 confirmed T cell reactions to SARS2-CoV-2 epitopes is limited, nine out of 10 highly ranking 12 peptide-HLA-I allele pairs that were previously assayed had a positive result in a T cell assay. Asian Pacific Islander populations and could complement B cell epitopes that have been shown 21 to be effective but provide short-lived immunity. This expansive data set allows us to identify 22 peptides predicted to bind many alleles and to propose a small set of peptides that are predicted 1 cover over 99.9% of USA, European, and Asian populations and induce broad CD8 + and CD4 + 2 immunity.             The 10 HLA-I predicted binders with the broadest cumulative allele coverage. The table provides 5 the peptide sequence, its rank, the SARS-CoV-2 protein it is derived from, the alleles the peptide     The 10 SARS-CoV-2-derived 25mers with the broadest cumulative predicted HLA-II allele 19 coverage. For each 25mer, the table provides the rank, the peptide sequence, the SARS-CoV-2 20 protein it is derived from, the cumulative alleles that are covered by all 25mers up to this rank, 21 30 and the associated USA, European (EUR), and Asian Pacific Islander (API) population coverage. 1 Note that it is not the case that any of these 25mers, or their binding subsequences, are found as 2 subsequences within the human proteome.
3 Filename: SuppTable_7_covid_25mers_best_cumulative_coverage_AVG.csv 4 5 Supplementary The peptide-HLA alleles pairs from the ViPR database which belong to the Coronaviridae 7 family and have a human host had been scored using RECON's HLA-I binding predictor. Alleles 8 not reported in a four-digit format or not supported by RECON were excluded.

9
Filename: SuppTable_8_ViPR_classi_percent-rank.csv  The peptide-HLA alleles pairs from the ViPR database which belong to the Coronaviridae 13 family and have a human host had been scored using RECON's HLA-II binding predictor.