Footprints of antigen processing boost MHC class II natural ligand binding predictions

Major Histocompatibility complex class II (MHC-II) molecules present peptide fragments to T cells for immune recognition. Current predictors for peptide:MHC-II binding are trained on binding affinity data, generated in-vitro and therefore lacking information about antigen processing. For the first time, we here describe prediction models of peptide:MHC-II binding trained directly on naturally eluted peptides, and show that these, in addition to peptide binding to the MHC, incorporate identifiable rules of antigen processing. In fact, we observed detectable signals of protease cleavage at defined positions of the peptides. We also hypothesize a role of the length of the terminal ligand protrusions for trimming the peptide to the epitope presented. The results of integrating binding affinity and eluted ligand data in a combined model demonstrate improved performance for the prediction of MHC-II ligands, and foreshadow a new generation of improved peptide:MHC-II prediction tools of considerable importance for understanding and manipulating immune responses.


Introduction
Major Histocompatibility Complex Class II (MHC-II) molecules play a central role in the immune system of vertebrates. MHC-II present exogenous, digested peptide fragments on the surface of antigen presenting cells, forming peptide-MHC-II complexes (pMHCII). On the cell surface, these pMHCII complexes are scrutinized, and if certain stimulatory conditions are met, a T helper lymphocyte may recognize the pMHC-II and initiate an immune response (Rudolph, Stanfield, & Wilson, 2006).
The precise rules of MHC class II antigen presentation are influenced by many factors including internalization and digestion of extracellular proteins, the peptide binding motif specific for each MHC class II molecule, and the transport and surface half-life of the pMHCIIs. The MHC-II binding groove, unlike MHC class I, is open at both ends. This attribute facilitates peptide protrusion out of the groove, thereby allowing longer peptides (and potentially whole proteins) to be loaded onto MHC-II molecules (Kim et al., 2014;Sette, Adorini, Colon, Buus, & Grey, 1989).
Peptide binding to MHC-II is mainly determined by interactions within the peptide binding groove, which most commonly encompass a peptide with a consecutive stretch of 9 amino acids . Ligand residues protruding from either side of the MHC binding groove are commonly known as Peptide Flanking Regions (PFRs). The PFRs are variable in length and composition, and affect both the peptide MHC-II binding (Lovitch, Pu, & Unanue, 2006), and the subsequent interaction with T cells (Arnold et al., 2002;Carson, Vignali, Woodland, & Vignali, 1997;Godkin et al., 2001). The open characteristic of the MHC-II binding groove does not constrain the peptides to a certain length thereby increasing the diversity of sequences that a given MHC-II molecule can present. Also, MHC-II molecules are highly polymorphic, and their binding motifs have appeared to be more degenerate than MHC-I motifs (Andreatta, Schafer-Nielsen, Lund, Buus, & Nielsen, 2011;Hammer et al., 1993;Sturniolo et al., 1999).
Considering all the aspects mentioned above, MHC-II motif characterization and rational identification of MHC-II ligands and epitopes is a highly challenging and costly endeavor.
Because MHC-II is a crucial player in the exogenous antigen presentation pathway, considerable efforts have been dedicated in the past to develop efficient experimental techniques for MHC-II peptide binding quantification. The traditional approach to quantify peptide-MHC-II binding relies on measuring binding affinity, either as the dissociation constant Kd of the complex (Roche & Cresswell, 1990;Hall et al., 2002), or in terms of IC50 (concentration of the query peptide which displaces 50% of a bound reference peptide) (Buus, Sette, Colon, Miles, & Grey, 1987). To date, data repositories such as the Immune Epitope Database (IEDB) (Vita et al., 2015) have collected more than 150,000 measurements of peptide-MHC-II binding interactions. Such data have been used during the last decades to develop several prediction methods with the ability to predict binding affinities to the different alleles of MHC class II. While the accuracy of these predictors has increased substantially over the last decades due the development of novel machine learning frameworks and a growing amount of peptide binding data being available for training (Andreatta, Trolle, et al., 2017), state-of-the-art methods still fail to accurately predict accurately MHC class II ligands and T cell epitopes (Gowthaman & Agrewala, 2008;Wang et al., 2008).
Recent technological advances in the field of mass spectrometry (MS) have enabled the development of high-throughput assays, which in a single experiment can identify several thousands of peptides eluted of MHC molecules (reviewed in Caron et al., 2015). Large data sets of such naturally presented peptides have been beneficial to define more accurately the rules of peptide-MHC binding (Abelin et al., 2017;Bergseng et al., 2015;Chong et al., 2017;Clement et al., 2016;Jurtz et al., 2017;Ooi et al., 2017). For several reasons, analyses and interpretation of MS eluted ligand data is not a trivial task. Firstly, because any given individual constitutively expresses multiple allelic variants of MHC molecules; thus, the ligands detected by MS are normally a mixture of specificities, each corresponding to a different MHC molecule. Secondly, as mentioned above, MHC-II ligands can vary widely in length, and identification of the binding motifs requires a sequence alignment over a minimal binding core. Finally, data sets of MS ligands often contain contaminants and false spectrum-peptide identifications, which add a component of noise to the data. We have earlier proposed a GibbsCluster method capable of dealing with all these issues, allowing the characterization of binding motifs and the assignment of probable MHC restrictions to individual peptides in such MS ligand data sets (Andreatta, Alvarez, & Nielsen, 2017;Andreatta, Lund, & Nielsen, 2013).
Because naturally eluted ligands incorporate information about properties of antigen presentation beyond what is obtained from in-vitro binding affinity measurements, large MSderived sets of peptides can be used to generate more accurate prediction models of MHC antigen presentation (Abelin et al., 2017;Jurtz et al., 2017).
As shown recently, generic machine learning tools, such as NNAlign (Andreatta et al., 2011;, can be readily applied to individual MS data sets, which in turn can be employed for further downstream analyses of the immunopeptidome (Alvarez, Barra, Nielsen, & Andreatta, 2018). The amount of MHC molecules characterized by MS eluted ligand data is, however, still limited. This has led us to suggest a machine learning framework where peptide binding data of both MS and in-vitro binding assays are merged in the training of the prediction method . This approach has proven highly powerful for MHC class I, but has not, to the best of our knowledge, been applied to MHC class II.
Undoubtedly, antigen processing plays a critical role in generating CD4+ T cell epitopes presented by MHC class II molecules. It is assumed that endo-and exo-peptidase activities, both before and after binding to the MHC-II molecule, play a key role in the generation and trimming of MHC class II ligands (Blum, Wearsch, & Cresswell, 2013;Lippolis et al., 2002).
However, the precise rules of MHC class II antigen processing are poorly understood.
Kropshofer et al. identified patterns of protein cleavage e.g. the presence of proline at the penultimate N and C terminal position of HLA-DR ligands (Kropshofer et al., 1993). In contrast, Bird et al. suggested that endolysosomal proteases have a minor and redundant role in peptide selection leading to the conclusion that the effect of processing on the generation of antigenic peptides is "relatively non-specific" (Bird, Trapani, & Villadangos, 2009). Given this context, it is perhaps not surprising that limited work has been aimed at integrating processing signals into a prediction framework for MHC-II ligands.
In this work, we have analysed large data sets of MS MHC-II eluted ligands obtained from different research laboratories covering two HLA-DR molecules with the purpose of investigating the consistency in the data; quantifying the differences in binding motifs contained with such MS eluted data compared to traditional in-vitro binding data; defining a machine-learning framework capable of integrating information from MS eluted ligand and in-vitro binding data into a prediction model for MHC-II peptide interaction prediction; and finally evaluating if inclusion of potential signals from antigen processing are consistent between different data sets and can be used to boost performances of the developed prediction models.

Data filtering and motif deconvolution
We first set out to analyse the different MS data sets of eluted ligands. Data were obtained from two recent publications; Ooi et al., 2017 (termed P) and Clement et al., 2016 (termed S) covering the HLA-DRB1*01:01 and HLA-DRB1*15:01 MHC class II molecules. Data were obtained from either human (termed h) or HLA-DR transfected mice (termed m) cell lines. Using this syntax, DRB1 Ph corresponds to the HLA-DRB1*01:01 data from the human cell in the study by Ooi et al. (for more details see Materials and Methods). Here, we applied the GibbsCluster method with default parameters for MHC class II to both filter out potential noise and to identify the binding motif(s) contained in each data set. The result of this analysis is shown in figure 1, and confirms the high quality of the different ligand data sets. In all data sets less than 7% of the peptides were identified as noise (assigned to the trash cluster), and in all cases does GibbsCluster find a solution with the number of clusters matching the number of distinct MHC specificities present in a given data set. In this context, the DR15-Ph is of particular interest, since this data set was obtained from a heterozygous cell line expressing two HLA-DR molecules, HLA-DRB1*15:01 and HLA-DRB5*01:01 (shortened here as DR15/51).
Consequently, this data set contains a mixture of peptides eluted from both of these HLA-DR molecules. As apparent from figure 1, the GibbsCluster method was able to handle this mixed data set, and correctly identified two clusters with distinct amino acid preferences at the anchor positions P1, P4, P6 and P9. Further, a comparison of the motifs identified from the different data sets sharing the exact same HLA-DR molecules revealed a very high degree of overlap, again supporting the high accuracy of both the MS eluted ligand data and of the GibbsCluster analysis tool.

Training prediction models on MHC class II ligand data
After filtering and deconvolution with GibbsCluster, MHC peptide binding prediction models were constructed for each of the 5 data sets corresponding to the majority clusters in figure 1.
Since only one data set was available for HLA-DR51, this molecule was not included in the subsequent analyses. Models were trained using the NNAlign framework as described in materials and methods. The eluted ligand data sets (EL) were enriched with random natural peptides labeled as negatives, as described in Material and Methods. Models were likewise trained on relevant and existing data sets of peptide binding affinities (BA) obtained from the IEDB. Prior to training, both the eluted ligand and binding affinity data sets were filtered to include only peptides of length 11 to 19 amino acids, each data set was split into 5 partitions using a 9mer common-motif clustering, and models were trained and evaluated using 5 fold cross-validation as described in Material and Methods. These analyses revealed a consistent and high performance for the models trained on the different eluted ligand data sets (Table 1).
In accordance with what has been observed earlier for MHC class I , the overall cross-validated performance of models trained on binding affinity data is lower than that of models trained on eluted ligand data. Table 1: Cross-validated performance of NNAlign models generated on binding affinity (BA) and ligand elution (EL) data sets.
The binding motifs captured by the different models are shown in figure 2. As evidenced by identical anchor positions (P1, P4, P6 and P9) and virtually identical anchor residues, highly consistent motifs were obtained from the same HLA-DR molecules irrespective of the source of the peptide (i.e. whether they were obtained from human or mouse cells, or from one laboratory or another). This even extended to the motifs obtained when using binding data, although we do observe subtle, but consistent, differences between the binding motifs derived from eluted ligand and peptide binding affinity data, exemplified by the preference for E at P4 and for D at P6 in the eluted ligand motifs for DR1 and DR15, respectively. Such preferences are absent from the motifs derived from the peptide binding affinity data. Figure 2. Logos for models trained using the NNAlign method  with eluted ligand or binding affinity data for the molecules DRB1*01:01 and DRB1*15:01. Logos were constructed from the predicted binding cores in the top 1% scoring predictions of 900.000 random natural peptides.
3. Training a combined prediction model on MHC-II binding affinity and ligand elution data Earlier work on MHC class I has demonstrated that the information contained within eluted ligand and peptide binding affinity data is complementary, and that a prediction model can benefit from being trained in an integral way using both data types . Here we investigate if a similar observation could be made for MHC class II. As proposed by Jurtz et al, we extended an artificial neural network architecture with two output neurons to the NNAlign framework. In short, the model architecture takes as input a mixed set of two data types -in this case, eluted ligands and peptide binding affinity data. All model parameters except the parameters connected to the output prediction value are shared between the two data types. This allows the model to leverage shared information between the two data types, potentially leading to improved predictive performance (for details refer to Jurtz et al., 2017). As the datasets of eluted ligands are much larger than the available binding affinity datasets, the training was performed in a balanced way so that on average the same number of data points of each data type are used for training in each training iteration. Models were trained with the same model parameters as used for the single data type model, in a 5 fold cross-validation manner, and performance was reported in terms of AUC and AUC0.1 (Table 2). Comparing the performance of the single data type to the multiple data type models for the different data sets, a consistent improvement in predictive performance was obtained when the two data types where combined. This improvement is similar to what we have previously observed for MHC class I predictions ). Table 2. Cross-validated performance measured in terms of AUC and AUC0.1 for models trained on combined binding affinity and ligand elution data.
Combined _BA and Combined _EL refer to output neuron for "binding affinity" or "eluted ligand" prediction, respectively.
Constructing the binding motif captured by the different models confirms the findings from the single data types model, with clearly defined and consistent binding motifs in all cases, and with subtle differences in the preferred amino acids at the anchor positions between motifs derived from the binding affinity and eluted ligand output value of the models. We next turned to the issue of accurately predicting the preferred length of peptides bound to the different HLA-DR molecules. The MS eluted ligand data demonstrates a length preference for the two MHC class II molecules centered on a length around 14-16. Earlier prediction models have not been able to capture this length preference, and have in general had the bias of assigning higher prediction values to longer peptides (data not shown). We have earlier demonstrated that including information about the peptide length in a framework integrating MS eluted ligand and peptide binding affinity data, allows the model to capture the length preference of the two data sets . Applying a similar approach to the MHC class II data, we obtain the results shown in figure 4, confirming that also for class II the model is capable of approximating the preferred length preference of each molecule. Figure 4. Length preference learned by combined models trained with both binding affinity and eluted ligand data. Histograms represent the top 0.1 and 1% scoring predictions for _EL and _BA respectively, of 1 million random peptides of lengths 11 to 19 amino acids extracted from natural occurring proteins. Green and red lines represent predictions obtained using the "binding affinity" output neuron and the "eluted ligand" output neuron, respectively, in the five combined models. Black lines represent the observed length distribution of the ligand training set.
Lastly, we performed an evaluation across data sets to confirm the robustness of the results obtained, and to limit any unforeseen signal of performance overfitting. For each data set, we used the two-output model trained above to predict the other ligand data sets of the same allotype. Prior to evaluation, all data with a 9mer overlap between training and evaluation sets were removed. We observed that, in all cases, models trained on a specific data set retain a high predictive performance for the prediction of ligands of the same allotype derived from a different experiment (Table 3). These results confirm the high reproducibility of the motifs across different cell lines, as well as the robustness of the prediction models derived from individual data sets.

Signals of Ligand Processing
Having developed improved models for prediction of MHC class II ligand binding, we next analysed whether the models could be used to identify signals of antigen processing in the MS eluted ligand data sets. We hypothesized that information concerning antigen processing should be present in the regions around the N and C terminus of the ligand. These regions comprise residues that flank the MHC binding core called Peptide Flanking Regions (PFRs), and residues from the ligand source protein sequence located outside the ligand (see figure 5 for a schematic overview).
We speculate that the signals of antigen processing will depend, to some degree, on the length of the PFRs on each side of the binding core. MHC-II ligands are cut and trimmed by exopeptidases, which operate according to specific motifs in prioritizing cleavage sites.
However, in the case of short PFRs, the MHC hinders access to the protease and prevents trimming of the residues in close proximity to the MHC (Larsen, Pedersen, Buus, & Stryhn, 1996;Mouritsen, Meldal, Werdelin, Hansen, & Buus, 1992) For this reason, we expect to observe cleavage motifs only in peptides with sufficiently large PFRs, where the end-of-thetrimming signal is given by the peptide sequence rather than by MHC hindrance. To validate this hypothesis, we identified the PFRs of the ligands in the DR15 Pm EL dataset, as well as three "context" residues found immediately upstream and downstream of the ligand in its source protein. To avoid over-estimation of the performance, the binding core was identified from the cross-validated eluted ligand predictions of the two output model. The ligands were split into groups depending on the length of the C and N terminal PFRs, and sequence logos were generated for each ligand subset using Seq2Logo (Fig. 5). Flanking Regions (PFRs) length vary from one to six amino acids (left to right). All logos of the same terminal region, upstream and downstream were estimated from the same number (down-sampled is necessary) of peptides, 111 and 44 sequences respectively. Amino acid background frequencies were calculated using the antigenic source protein of all the ligands present in the dataset. Motifs were generated as using Seq2logo as described in Materials and Methods.
The results displayed in figure 5 clearly confirm the important role of the MHC in shaping the processing signal. For both the N and C terminal data sets, we observe a clear enrichment of proline (P) at the second position from the ligand terminals only for data sets where the PFR is longer than 2 amino acids. This observation is confirmed from the reanalysis of a structural data set of peptide:HLA-DR complexes from the Protein Data Bank (PDB) previously assembled for benchmarking the accuracy for MHC-II binding core identification (Andreatta et al., 2015). On this PDB dataset, 29% of the entries with a N-terminal PFR longer than 2 amino acids contain a proline at the second position from the N terminal, and 38% of the entries with a C terminal PFR longer than 2 amino acids contain a proline at the second position from the C terminal (data not shown). Given this, we next combined all ligands with PFR length larger than 2, together with the three source protein context residues at either C or N terminal side of the ligand. Then, we used these to construct a motif of the processing signal (see Figure 6). The processing motif confirms the strong preference for prolines at the second but last position in the ligand at both N and C terminal sides, as well as a clear signal of depletion of other hydrophobic amino acid types towards the terminals of the ligand. Note that the cysteine depletion in the PFR is likely to be a technological artifact, as cysteines have previously been shown to be underrepresented in MS-derived peptide data sets (Abelin et al., 2017;Bassani-Sternberg et al., 2017). From this figure it is also clear that processing signals contained outside the ligand are very weak.
Next, we investigated to what degree the signal observed in the HLA-DRB1*15:01 Pm data set could be consistently identified in the other four data sets. To do this, we repeated the above analysis for each data set and computed the corresponding processing Position Specific Scoring Matrix (PSSM). A similarity matrix was constructed by calculating the similarity between all pairs of PSSMs in terms of the Pearson's correlation coefficient between the two vectors of the 6*20 elements (6 positions and 20 amino acid propensity scores at each position) (see Figure 7, the PSSMs from each data are included in supplementary figure S1).  the correlation between the two human and mouse data sets is as high as the correlation between two data sets within the same species. These results strongly suggest that the observed signal is related to antigen processing.

Incorporating ligand processing into a combined predictor
Having identified a consistent signal associated with antigen processing, we next investigated whether this signal could be integrated in a prediction model to boost predictive performance.
To conduct this analysis, we used the data sets and data partitions described earlier. The processing signal was integrated into the machine-learning framework by complementing the encoding of each ligand with the 3 N terminal context, 3 N terminal peptide, 3 C terminal context, and 3 C terminal peptide residues (see figure 5). For peptide binding affinity data, the context information was presented with three wildcard amino acids "XXX". The result in terms of AUC0.1 of the 5 fold cross-validation evaluation is shown in table 4, and the results of the crossdataset evaluation in table 5. In both cases, we observe a substantial and consistent improved performance of the eluted ligand prediction of the model when including the context information.
As expected, this gain is not observed for the binding affinity predictions (data not shown).
Because our analysis supports that all the eluted peptides of the same allele, irrespective of their source, have similar peptide signals, we merged all eluted ligands from the same allotype and trained a merged model. Evaluating the cross-validation performance of the global model on the data from the individual sets showed consistently higher performance for all datasets (column "With context merged datasets" in Table 4). Further, as control for performance overestimation, we trained a control model with shuffled context residues. As expected, we for this model due to the loss of the antigen processing signal observed a performance comparable to the merged model trained without context (Table 4, "With shuffled context" column). Table 4. Cross-validation performance of prediction models trained with vs without context information.
Models trained on both binding affinity and ligand elution data, with and without context information, and using the eluted ligand output neuron. "With context" columns indicate models trained with 3 amino acids from the source protein and 3 amino acids from PFR sequences for the N-and C-terminus of the ligand. Merged datasets refer to a model using all datasets combined for the same allele. Shuffled context refers to a model trained with context but where the order of the amino acids was randomized. Performance values are expressed in terms of AUC 0.1. Table 5. Cross-dataset performance of prediction models trained with vs without context information Performance values are expressed in terms of AUC 0.1.
As a final illustration of the signal captured by the model trained including the context information, we constructed sequence motifs of the top 1% highest scoring peptides from a list of 1 million random natural peptides of length 10-25 and their context (supplementary figure 2).
As expected, the motif contained within the N and C terminal peptide flanks and context resembles strongly the motif described in figure 6.

Discussion
Peptide binding to MHC II is arguably the most selective step in antigen presentation to CD4+ T cells. The ability to measure (and predict) specific CD4+ responses is crucial for the understanding of pathological events, such as infection by pathogens or cancerous transformations. Recent studies have also highlighted a potential role for CD4+ T cells for the development of cancer immunotherapies (Kreiter et al., 2015;Tran et al., 2014;Zanetti, 2015).
Characterizing peptide:MHC-II binding events has been a focal point of research over the last decades. Large efforts have been dedicated in conducting high-throughput, in-vitro measurements of peptide MHC II interactions (Justesen, Harndahl, Lamberth, Nielsen, & Buus, 2009;J. Sidney et al., 2001;John Sidney et al., 2013), and these data have been used to develop methods capable of accurately predicting the interaction to MHC II from the peptide sequence alone (Andreatta et al., 2015;Nielsen, Lundegaard, & Lund, 2007;Singh & Raghava, 2001). While these approaches have proven highly successful as guides in the search for CD4 epitopes (Gfeller, Bassani-Sternberg, Schmidt, & Luescher, 2016;Nielsen, Lund, Buus, & Lundegaard, 2010), a general conclusion from these studies is that MHC II in-vitro binding affinity (whether measured or predicted) is a relatively poor correlate of immunogenicity (Backert & Kohlbacher, 2015). In other words, peptide binding to MHC II is a necessary but not sufficient criterion for peptide immunogenicity. The same situation holds for MHC class I presented epitopes. Here, however, peptide binding to MHC I is a very strong correlate to peptide immunogenicity, and can be used to discard the vast majority (99%) of the irrelevant peptide space while maintaining an extremely high (>95%) sensitivity for epitope identification . For MHC II, recent studies suggest the corresponding numbers to fall in the range 80% (specificity), 50% sensitivity (Jensen et al., 2018). For these reasons, we suggest that features other than MHC II in-vitro binding affinity may be critical for MHC II antigen presentation. Based on five MS MHC II eluted ligand data sets, we have here attempted to address and quantify this statement.
Firstly, we have demonstrated that MS MHC II eluted ligand data sets generated by state-of-theart technologies and laboratories are of very high quality, comprising low noise levels and allowing very precise determination of MHC II binding motifs. Overall, the obtained binding motifs show overlap with the motifs identified from in-vitro binding affinity data, with subtle differences at well-defined anchor positions.
Secondly, we demonstrated that high accuracy prediction models for peptide MHC II interaction can be constructed from the MS derived MHC II eluted ligand data; that the accuracy of these models can be improved by training models integrating information from both binding affinity and eluted ligand data sets; and that these improved models can be used to identify eluted ligands in independent data sets at an unprecedented level of accuracy.
Thirdly, we analyzed signals potentially associated with antigen processing. We postulate that if a processing signal exists, it should be influenced by the relative location of the peptide binding core compared to the N and C terminal of the given ligand. This is because the MHC II molecule may hinder the access of the protease, thus preventing trimming of the residues in close proximity to the MHC (Stryhn et al, PMID 8691132). Investigating the data confirmed this hypothesis, and a strong processing signal (with a preference for Prolines at the second amino acid position from the N and C terminal of the ligand), was observed only for ligands where the length of the region flanking the binding core was three amino acids or more. This observation was found consistently in all 5 data sets independent of MHC II restriction and host-species (human or mouse).
Lastly, we integrated this information associated with antigen processing into a machinelearning framework, and demonstrated a consistently improved predictive performance not only in terms of cross-validation but also when applied to independent evaluation data sets.
In conclusion, we have demonstrated how integrating MHC class II in-vitro binding and MS eluted ligand data can boost the predictive performance for both binding affinity and eluted ligand likelihood predictions. To the best of our knowledge, we have also demonstrated for the first time how MHC II eluted ligand data can be used to extract signals of antigen processing, and how these signals can be integrated into a model with improved predictive performance.
The work here has been limited to two HLA-DR molecules, but the framework is readily extendable to additional molecules and eventually, once sufficient data becomes available, a pan-specific predictor as has been shown earlier for MHC class I ) may be achievable.
Here, the data sets with subscript h correspond to data obtained from human cell lines and data sets with the subscript m to data obtained from human MHC-II molecules transfected into MHC-II deficient mice cell lines. Note, that DR15-Ph data set was obtained from a heterozygous EBV transformed B lymphoblastoid cell line, IHW09013 (also known as SCHU), which express two HLA-DR molecules, HLA-DRB1*15:01 and HLA-DRB5*01:01 (shortened here with the name DR15/51).
The ligand datasets were analysed using the GibbsCluster-2.0 method as described earlier (Alvarez et al., 2018) using default settings to remove potential noise and biases imposed by some data sharing multiple binding specificities. The details of the two data sets are described in table 6. Table 6. Summary of peptide binding and eluted ligand datasets references and its corresponding data points used for training the different models.
Number of points after filtering with Gibbs clustering (GC), amino acid length (L11-19) and number of Random peptides (Random) added to each dataset used for training.

NNAlign Modeling
Models predicting peptide-MHC interactions were trained as described earlier using NNAlign (Alvarez et al., 2018;. Only ligands of length 11-19 amino acids were included in the training data. Random peptides of variable lengths derived from the nonredundant UniProt database were used as negatives. The same amount of random negatives was used for each length (11 to 19), and consisted of five times the amount of peptides for the most represented length in the positive ligand dataset. Positive instances were labeled with a target value of 1, and negatives with a target value of 0. Prior to training, the datasets were clustered using the common motif Hobohm algorithm 1 described earlier (Hobohm, Scharf, Schneider, & Sander, 1992) with a motif length of 9 amino acids to generate five partitions for cross validation. Two types of model were trained; one with single data type (eluted ligand or binding affinity) input, and one with a mixed input of two data types. Single models per each dataset and allele were trained as described earlier with either binding affinity or eluted ligand data as input (Alvarez et al., 2018). All models were built as an ensemble of 250 individual networks generated with 10 different seeds, 2, 10, 20, 40, and 60 hidden neurons and 5 partitions.
Models were trained for 400 iterations, without use of early stopping. Additional settings in the architecture of the network were used as previously described for MHC class II defaults in (Alvarez et al., 2018). Some modifications were introduced to NNAlign to better account for specific challenges associated with MHC class II ligand data. For the network to be able to learn peptide length preferences, a "binned" encoding of the peptide length was introduced, consisting of a one-hot input vector of size nine (one neuron for each of the lengths 11 to 19). In order to guide binding core identification, a burn-in period was introduced with a limited search space for the P1 binding core position. During the burn-in period, consisting of a single learning iteration, only hydrophobic residues were allowed at the P1 anchor position. Starting from the second iteration, all amino acids were allowed at the P1 position (Supplementary figure 3).

Sequence logos
Sequence logos for binding motifs were constructed using Seg2Logo tool with default parameters (Thomsen & Nielsen, 2012). Logos for context information were generated as weighted Kulback-Leibler logos using Seq2Logo excluding sequence weighting. Amino acids were grouped by negatively charged (red), positively charged (blue), polar (green) or hydrophobic (black).