Introduction

T-cells play a central role in the host adaptive immune defense by the recognition of foreign peptide antigens bound to cell-membrane expressed major histocompatibility complexes (MHC) via their T-cell receptors (TCR) (reviewed in Garcia et al. 1999; Margulies 1997; Wang and Reinherz 2001). Antigen presenting MHC molecules are extremely polymorphic (Reche and Reinherz 2003), and fall into two major classes, termed class I and class II (reviewed in Maenaka and Jones 1999; Stern and Wiley 1994). Antigens presented by MHC class I (MHCI) and MHC class II (MHCII) are recognized by two distinct sets of T-cells, CD8 and CD4 T-cells, respectively (reviewed in Wang and Reinherz 2001). Engaging both T-cell subsets is desirable for mounting a strong defensive immune response against cancer cells and pathogens.

T-cell immune responses are driven by antigenic epitopes whose identification is important for understanding disease pathogenesis and etiology as well as for vaccine design. Bona fide experimental identification of T-cell epitopes is costly and time consuming, requiring the synthesis of overlapping peptides spanning the entire length of a protein and necessitating complicated in vitro cellular assays for each peptide synthesized (Draenert et al. 2003). As a result, alternative computational approaches have been developed for the prediction of antigenic peptides. Since T-cells recognize antigenic peptides only in the context of MHC molecules (Zinkernagel and Doherty 1974), computer-aided methods for the anticipation of T-cell epitopes rely mostly on the prediction of peptide-MHC binding (Hammer 1995). Peptides bound to the same MHC are related by sequence similarity (Falk et al. 1991; Rammensee et al. 1995) so that peptide binding patterns (Sette et al. 1989) as well as motif matrices (De Groot et al. 1997; Rammensee et al. 1999) have also been used for the prediction of peptide-MHC binding. In this regard, we have recently introduced a more sophisticated matrix method involving the use of position-specific scoring matrices (PSSMs) or profiles (Gribskov et al. 1987) for the prediction of peptide-MHCI binding (Reche et al. 2002). This resource is available online at our RANKPEP web server (http://www.mifoundation.org/Tools/rankpep.html). Profiles are derived from a set of aligned peptides known to bind to a given MHC. Correct alignment of peptides by structural or sequence similarity is essential for the proper prediction of peptide-MHC binding (Gribskov et al. 1987). Profiles of aligned peptides known to bind to MHCII molecules could also be used for the prediction of peptide-MHCII binding, and consequent anticipation of CD4 T-cell epitopes. However, alignment of MHCII ligands is quite challenging since peptides binding to a single MHCII molecule are extremely variable in length and share very limited sequence similarity (Barber and Parham 1993; Madden 1995; Stern and Wiley 1994). In response to this complexity, in this paper we describe the use of the motif discovery program MEME (Bailey and Elkan 1995) to generate alignment and profiles that are consistent with the binding mode of peptides to MHCII molecules. Fifty MHCII-specific profiles have been created, allowing the extension of the RANKPEP resource to the anticipation of CD4 T-cell epitopes. On average, the sensitivity of these MHCII-specific profiles is such that ~60% of known MHCII-restricted T-cell epitopes are found among the top 2% scoring peptides from their protein sources.

Anticipation of T-cell epitopes is heavily predicated on the prediction of peptide-MHC binding, yet prior to MHC binding, correct peptide processing must occur to liberate a peptide from its protein source. Processing of MHCII-restricted epitopes occurs in the endosomal compartment, being mediated by several endopeptidases in combination with amino-peptidases and carboxy-peptidases (Pieters 2000; Watts 2001). This complexity makes the identification of any pattern related with processing of class II restricted peptides difficult. In contrast, there is experimental evidence that the C-terminus of MHCI-restricted epitopes results from the selective proteolysis of cytosolic proteins mediated by the proteasome (Craiu et al. 1997). The proteasome thus plays a vital role in determining cytotoxic T-cell (CTL) epitopes, and consequently we have modeled the proteasomal cleavage site from MHCI-restricted peptides using statistical language models. Language models for proteasomal cleavage prediction (LMPCP) could properly recognize up to ~90% of the C-termini from a set of 554 known MHCI-restricted epitopes, independent of the training set. Prediction of those MHCI-peptide binders containing a C-terminus that is likely to be the result of proteasomal cleavage is now implemented by the RANKPEP web server, leading to refined CTL epitope predictions. Finally, because amino-acid sequence mutation offers a means for immune evasion exploited by some pathogenic organisms such us HIV, we implemented the RANKPEP web server with a feature to mask the sequence variability from a multiple sequence alignment (MSA) using the Shannon entropy measure (Shannon 1948) as a variability metric (Reche and Reinherz 2003; Stewart et al. 1997).

Materials and methods

Peptide and protein sequences

Sequences of peptides that bind to MHC molecules were collected from the MHCPEP database (Brusic et al. 1998b). All peptides in the MHCPEP database are binders, but their binding strength is reported as unknown, low, moderate, or high. In this work we have excluded MHC ligands that were labeled as low binders. Sequences of naturally restricted T-cell epitopes were collected from the SYFPEITHI database (Rammensee et al. 1999). Full-length sequences of proteins containing T-cell epitopes were isolated from the non-redundant database of GenBank (Benson et al. 2003) following a blast search (Altschul et al. 1997) with the relevant T-cell epitopes as the query.

Alignments and PSSMs of MHCI and MHCII-specific ligands

Block alignments of peptides binding to specific MHCI molecules were obtained as indicated elsewhere (Reche et al. 2002). Briefly, peptides were collected from MHCPEP according to their MHCI binding specificity, and subsequently grouped by their sequence length to create block alignments. Likewise, peptides binding to MHCII molecules were collected from the MHCPEP database, and divided into subsets according to their MHCII binding specificity. Peptides shorter than nine residues were not considered. Subsequently, motif block alignments of peptides binding to specific MHCII molecules were obtained using the motif discovery program MEME (Bailey and Elkan 1995), using the command meme file.fasta -protein -mod oops -nmotifs 1-minsites 4-maxsites 300 -minw 9 -maxw 9 -evt 10,000, where file.fasta corresponds to each of the MHCII-specific subsets of protein ligands in FASTA format; -mod oops, indicates that each sequence has a binding site; -minsites 4 -maxsites 500, indicates that the motif should contain between four and 500 sequences; -min 9 -maxw 9, indicates that the size of the motif is exactly nine; and finally -evt 10000 is the expected threshold value for a sequence to be included in the motif. Collections of peptides binding to the same MHCII molecule usually contain overlapping peptides, and therefore block motifs yielded by MEME frequently contain repeated sequences corresponding to the overlapping regions of different peptides. Consequently, alignments were parsed to eliminate sequence redundancy, yielding block alignments of unique sequences.

Profiles were obtained from peptide alignments containing a minimum of five sequences using PROFILEWEIGHT (Thompson et al. 1994b) and the BLK2PSSM utility included in the BLIMPS package (Henikoff and Henikoff 1996; Henikoff et al. 1999). PROFILEWEIGHT uses a branch-proportional weighting method, whereas a position-based weighting method (Henikoff and Henikoff 1994) was applied to the PSSMs obtained with BLK2PSSM.

Scoring peptide-MHC binding using PSSMs and cross-validation tests

Peptide scores indicate the similarity (and hence binding potential) of the peptides to the set of aligned peptides known to bind to a given MHC molecule. Scores are obtained by aligning the PSSM with the protein segments, and adding up the appropriate profile coefficients matching the residue type and position in the protein segment. Scoring starts at the beginning of each sequence, and the PSSM is moved over the entire sequence one residue at a time.

Cross-validation tests to address the predictive performance of profile matrices were carried out using a ROC analysis (Swets 1988). The ROC curves were generated plotting the function SE versus 1-SP for various thresholds of top scoring peptides (0.5, 1, 2, 3, 5, 10, 20%), where SE, and SP represent the sensitivity and specificity of the predictions, respectively. The area under the ROC curve (AUC) provides a measure of overall prediction accuracy. SE and SP are calculated from Eqs. 1, 2

$${\text{SE}} = {\text{TP}}/{\left( {{\text{TP}} + {\text{FN}}} \right)}$$
(1)
$${\text{SP}} = {\text{TN}}/{\left( {{\text{TN}} + {\text{FP}}} \right)}$$
(2)

where TP are true positives (binders predicted as binders); FN are false negative (binders predicted as non-binders); TN are true negatives (non-binders predicted as non-binders) and FP are false positives (non-binders predicted as binders). For each of the MHC molecules in Table 1, known binders were divided into two distinct sets, a binding training set and a binding test set, with comparable numbers of peptides. PSSMs were derived from the training set using both PROFILEWEIGHT and BLK2PSSM as indicated elsewhere and then used to test whether the peptides in the binding test set were TP or FP at the mention thresholds. At any given threshold a test binding peptide was considered to be a TP if it was found among the predicted peptides at that threshold from a random protein of 1,000 (amino-acid composition after frequencies in the swissprot database) incorporating the tested peptide. The peptide was a FN if it was not among the predicted peptides. Calculation of SP requires having a set of experimentally determined non-binders. Unfortunately, because there are very few experimentally verified examples of peptides that do not bind to a particular MHC, we have followed an approach similar to that of Donnes and Elofsson (2002) to obtain a set of non-binder peptides. In any given protein, most of the peptides do not bind to the MHC molecule (90–98%), and consequently a randomly generated peptide could be considered as a non-binder. Thus, we have calculated the SP of the predictions from a set of randomly generated peptides (not identical to the binders). The number of non-binder peptides was double that of the peptides in the binding set. At each threshold a non-binder was considered a TN if it was not among the predicted peptides from a random protein of 1,000 amino acids incorporating the tested non-binder peptide. The non-binder peptide was considered an FP, if it was found among the predicted peptides.

Table 1 MHC molecules targeted for peptide binding predictions

Epitope prediction tests using PSSMs

Peptide-MHC binding prediction tests were carried out by determining the relative ranking of known MHC-restricted epitopes from their protein sources using the relevant profiles. Several thresholds of top scoring peptides were checked (0.5, 1, 2, 3, 5, 10, 20%). Peptides were considered to be predicted if they were among the top scoring peptides at the set threshold. MHC molecules targeted for peptide predictions are shown (Table 1). These MHC molecules were selected on the basis of alignments of known ligands that include at least ten sequences of MHCI-restricted or MHCII-restricted peptides, CD8 and CD4 T-cell epitopes, respectively. Moreover, the binding specificity of the ligands was known at the allelic level. MHCI-restricted peptides considered in these tests were all nonamers (9mers), and annotations about known CD8 and CD4 T-cell epitopes were taken from the SYFPEITHI database (Rammensee et al. 1999). Binding predictions of these epitopes to their MHC molecules were tested using PSSM that were derived from alignments with and without the epitope to be predicted. PSSMs for prediction tests were obtained using both PROFILEWEIGHT and BLK2PSSM as indicated elsewhere.

Prediction of proteasomal cleavage using statistical language models

The proteasomal cleavage site was modeled from a database consisting of 332 naturally MHCI-restricted epitope fragments and their C-terminal flanking regions using the SRI language modeling toolkit (SRILM) (Stolcke 2002). Selected epitopes were all restricted by human MHCI molecules (HLA I). To prevent biases towards any given HLA I allele, selected peptides included all peptides restricted by HLA-C (38) and HLA-G alleles (12), and a similar number of epitopes restricted by HLA-A and HLA-B binding peptides, 135 and 147, respectively. Moreover, no more than 5% of the selected epitopes were restricted by the same HLA-A or HLA-B allele. LMPCP were created using the SRILM NGRAM-COUNT utility over training sets derived from the above database. From this database different training sets of fragment size 10, 8, 6, and 4 were generated (length was fixed in all fragments of the same training set). Cutpoints in fragments were indicated by a vertical line (“|”) after the C-terminal end of the epitope (P1 cleavage site), with fragments having an even number of residues on either side of the cleavage site. Representative fragments of training sets of ten and four residues will be EPRKL|VTQDL and KL|VT, respectively. For each training set of a given fragment size (N), N−2 different LMPCPs were generated by changing the window size or order (i) of the model from i=2 to i=N−1. LMPCPs then varied with the length of the fragments in the training file and with the order i chosen to generate them, and for clarity will note them as LMPCP N i. Each LMPCP N i was tested using the SRLIM HIDDEN-NGRAM utility over test files containing peptide fragments of the same size (N) as the training file and using the same window size (i) as that of the LMPCP. HIDDEN-NGRAM is a word boundary tagger based on n-gram models (Stolcke 2002) which at a selected probability threshold indicates the cutpoints in the fragments of a testing file by inserting the cleavage marker “|” into a position determined by the LMPCP. Thirteen probability thresholds were tested for each LMPCP (0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55, 0.60, 0.65, 0.70). Test files were derived from 554 human MHCI-restricted epitopes different from those in the training test but of the same size and nature (even number of residues around the cutpoint) as those used in the training sets and without the cleavage site indicated. Evaluation of the results was carried out by determining the percentage of correctly predicted cleavage sites (PCS) in the test files. In addition, a percentage of expected cleavage sites (ECS) was calculated for each model according to the equation

$${\text{ECS}} = 100 \times C/{\left( {N - 1} \right)}$$
(3)

where C is the number of cutpoints per fragment yielded by a given model LMPCP N i when tested in a file of fragments size N. Conceptually, ECS would indicate the number of correctly PCS, if cleavage resulting from a given model was random. Thus, the above 23 different LMPCP N i [N=10, 6, 4; i=1–(N−2)] were tested for each probability threshold, and ranked by the difference between the PCS and ECS. In addition, each LMPCP was tested on files containing the full-length protein source of the 554 human MHCI-restricted peptide fragments used in the testing files, and the mean length of the fragments also was obtained.

Consensus sequence and sequence variability masking

Sequence variability is calculated from multiple amino acid sequence alignments as indicated by Reche and Reinherz (Reche and Reinherz 2003), using a variability metric (V) formally identical to the Shannon entropy equation (Shannon 1948). Briefly, V per site is given by

$$V = - {\sum\limits_{i = 1}^M {P_{i} \log _{2} \,P_{i} } }$$
(4)

where P i is the fraction of residues of amino acid type i, and M=20, the number of amino acid types. V ranges from 0 (total conservation, only one amino-acid type is present at that position) to 4.322 (all 20 amino acids are equally represented in that position). Note that in order to achieve the maximum value V=4.3, at least 20 sequences are required. Gap symbols (–) are considered for deriving the consensus sequence but are not computed for the variability calculations. Given a sequence variability threshold V t , a consensus sequence is generated from the sequence alignment as the most common amino acid for those positions with a VV t , whereas variable positions (V>V t ) are masked and represented in the consensus sequence with a dot. Segments with a position masked are not considered in the RANKPEP predictions of peptide-MHC binding.

Sequence logos

Peptide fragments of MHCI-restricted epitopes with their flanking regions were aligned centered around the C-terminal end (cleavage site) of the MHCI-restricted peptide, and sequence information was calculated for each position and displayed using a sequence logo (Schneider and Stephens 1990). In a sequence logo each of the residues present in a position of the sequence alignment is represented with a height that is proportional to its frequency, and the height of the entire stack is proportional to the total information content (R) in that position. Sequence information R per site was given by the following equation

$$R = 4.3 - V{\left( {{\text{bits}}\,{\text{per}}\,{\text{position}}} \right)}$$
(5)

where 4.3 is the upper variability limit for 20 symbols, and V is the variability in that position (Eq. 4). Sequence information is given in bits.

Results

Structure-based alignments of MHCI and MHCII ligands

MHC molecules, also known as human leucocyte antigens (HLAs) in humans, are highly polymorphic molecules imposing distinct chemical and physical constraints on their selective peptide binders which are related to each other by sequence similarity. PSSMs or profiles (Gribskov et al. 1987) created from a set of aligned peptide-MHC binders provide a means for the prediction of peptide binding to a given MHC molecule (Reche et al. 2002). However, for a PSSM to be a good predictor of peptide binding to MHC, peptides must be first aligned by structural and/or sequence similarity. MHCI and MHCII molecules bind peptides in similar yet distinct modes (Barber and Parham 1993; Madden 1995; Stern and Wiley 1994), and consequently PSSMs were derived differently for MHCI and MHCII ligands. MHCI ligands are of short length (8–11), as they are constrained into the MHCI peptide binding groove, with their N-terminal and C-terminal ends connected by a network of hydrogen bonds to conserved residues of the MHCI molecule (Madden 1995; Matsumura et al. 1992; Zhang et al. 1998) (Fig. 1a). While peptides bound to the same MHCI can differ by one or two amino acids in length from each other, proper structural alignment of diverse peptides is best guaranteed if the peptides are of the same length (Reche et al. 2002). Accordingly, we have separated the peptides bound to a given MHCI molecule into subsets containing only peptides of the same length, and created separate profiles from ungapped block alignments. In contrast, the peptide binding groove of MHCII molecules is open, allowing both the N- and C-terminus of a peptide to extend beyond the binding groove (Fig. 1b). Thus, peptides bound to MHCII molecules display a great variability in length (9–22 residues) even though only a peptide core of nine residues fits into the MHCII binding groove per se. The binding mode of this peptide core is conserved among the different peptide-MHCII complexes, providing the energy that anchors the peptide to the MHCII molecule (Wang and Reinherz 2001). An important contribution to the binding energy derives from a set of conserved hydrogen bonds between the backbone of the peptide core and conserved residues in the MHCII molecules (Barber and Parham 1993; Madden 1995; Stern and Wiley 1994). As a result, the peptide binding repertoire of MHCII molecules is broader than that of MHCI molecules, and consequently peptide-MHCII ligands share less sequence similarity than peptide-MHCI ligands. Poor amino acid sequence similarity between MHCII ligands together with their great variability in sequence length makes their alignment difficult and hampers the use of global alignment algorithms such as CLUSTALW (Thompson et al. 1994a). Since alignment of the MHCII ligands requires the identification of their binding core, we have turned to the motif discovery program MEME (Bailey and Elkan 1995). MEME uses an expectation maximization algorithm in combination with a priori information regarding the nature of the motif. The a priori information we used was consistent with described structural information about the binding of peptide to MHCII molecules, namely identification of a single preferred register of peptide binding to a given MHCII molecules whose length is nine residues (see “Materials and methods” for detail).

Fig. 1a, b
figure 1

Binding of peptide ligands to MHCI and MHCII molecules. The figure shows the top of the molecular surface of the antigen-presenting platform of representative human MHCI (a) and MHCII (b) molecules as viewed by the TCR. The MHCI molecule corresponds to HLA-A*0201 in complex with a peptide LLFGYPVYV from HTLV-1 TAX protein [PDB:1HHK (Madden et al. 1993)]. The MHCII molecule corresponds to HLA-DR1 in complex with peptide PKYVKQNTLKLAT from influenza hemagluttinin protein [PDB:1FYT (Hennecke et al. 2000)] The peptide binding platform of the MHCI molecule is composed of two anti-parallel α-helices sitting over a base of eight anti-parallel β-strands shown as worm representation under the molecular surface. Likewise, the peptide-binding platform of the MHCII molecule presents the same secondary features, but resulting from the association of two different polypeptide chains (an α1 chain in blue and a β1 chain in yellow). Peptides bound to these molecules are represented by sticks to highlight the contours of the binding groove. Note how the peptide binding groove of the MHCI molecule is closed, and peptides bind in a manner such that both the N-terminal and C-terminal ends of the peptide (indicated by arrows) are nested into the MHCI binding groove, restricting their lengths to 8–11 residues. In contrast, the peptide binding groove of the MHCII molecule is open, thereby imposing no limitation to the size of ligands, whose N-terminal and C-terminal ends can extend beyond the binding grove. The side chains of N-terminal and C-terminal end of the 9mer peptide core that fit into the MHCII binding groove are indicated. The molecular surface is colored by electrostatic potential (blue positively charged and red negatively charged). The figure was prepared using GRASP (Nicholls et al. 1991)

Performance of peptide-MHC binding predictions using profiles

A single measure of the accuracy of predictive models is provided by the AUC value, which results from plotting SE versus 1-SP at several thresholds (See Materials and methods). We have carried out such analysis for each of the MHC molecules shown in the Table 1, and the results are provided in Table 2. For cross-validation, ROC analysis was carried out using binding test sets containing different peptides than those used for profile generation (training sets). Calculated AUC values varied with the actual peptides in the training and binding sets, and consequently ROC analyses were carried out using ten different training and binding sets (generated by randomly dividing the peptide binders into two different sets), and the mean AUC value (AUCm) and standard deviation is shown the Table 2. Also in Table 2 is given the best AUC value (AUCb) obtained from the above ROC analysis, along with the SE and SP values at a 3% threshold. When using PSSMs we are computing the similarity of the peptide to a set of aligned peptides known to bind to the MHC molecule, and thus variation of the peptide-MHC binding predictions (given by the AUC standard deviation) is linked to the overall amino acid sequence similarity between the peptides in the training and test sets. Thus, AUCb must result from the division of the peptide binders into a training and a test set with the best overall sequence similarity. Note that when all peptide binders are included in the profiles, AUCb is likely to reflect the performance of the relevant profiles for the prediction of peptide-MHC binding.

Table 2 Performance of profiles for the prediction of peptide-MHC binding. Peptide binders (Aln column of Table 1) were randomly divided into a training and binding set and a ROC analysis was carried out to determine the AUC value (performance). Several ROC analyses were carried out with different training and binding sets, and the mean AUC value (AUCm) with its standard deviation is shown here. Also shown is the best AUC value obtained (AUCb). Sensitivities and specificities of the predictions with the AUCb at a 3% threshold are also given

Values of AUC=0.5 indicate random choice, while the accuracy of predictions is poor for values of AUC<0.7, good for values of AUC>0.8% and excellent for values of AUC>0.9 (Swets 1988). Following the above and considering the AUCb values, good peptide-MHCI binding predictions can be provided (PROFILEWEIGHT or BLKWPSSM profiles) to 21 of 25 tested MHCI molecules (Table 2, AUCb>0.8%) and excellent peptide-MHCI binding predictions to 12 MHCI molecules (Table 2, AUCb>90%). Peptide binding predictions to class II MHC molecules were not as good. Nevertheless, adequate peptide-MCHII binding predictions could be provided to 11 of 24 MHCII molecules, and excellent peptide binding predictions to five MHCII molecules (HLA-DQ2, HLA-DP9, DRB1*0901, DRB1*1104, I-Ed). Peptide binding predictions provided to DRB1*0801, DRB1*1101, DRB*0301 were poor (AUCb<0.7) but yet well above that of a random choice (AUC=0.5).

The effect of the size of the training set in the performance of the peptide binding predictions to A*0201 and DRB1*0101 is shown in Fig. 2. The results indicate that good peptide binding prediction (AUC>0.8) can be provided to A*0201 from matrices derived with as few as 20 peptides (Fig. 2a). On the other hand, a larger number of peptides (>40) are required to provide good peptide-binding predictions to the class II MHC molecule DRB1*0101 (Fig. 2b). It is known that MHCI ligands are more related by sequence similarity than MHCII ligands. Apparently, a large collection of MHCII ligands is needed to get a good representation of the class II peptide-binding motif. However, using this ROC analysis, a comparison of the performance of the peptide binding predictions across different MHC molecules with respect to the training set size should be approached with caution. For MHC molecules with only a few known peptide binders that are very similar to each other, the performance of the profiles would appear very good. On the other hand, if a large number of peptides are known to bind to the MHC molecule and their sequence diversity is high, the predictions might then not appear as adequate. Nevertheless, a profile derived from a large and diverse set of peptides is more likely to predict new binding peptides from a query protein than a profile derived from a few related peptides. Therefore, a more rigorous ROC analysis would need to be performed upon experimental determination of the binding of peptides predicted from a protein. A final cautionary note applies to the SP values. Due to the difficulty of finding experimentally determined non-binders, SP values were calculated from random peptides (see Materials and methods), and thus are quite high.

Fig. 2a, b
figure 2

Peptide-MHC binding predictions using PSSMs derived from alignments of different number of peptides. Performance of the peptide binding predictions to A*0201 (a) and DRB*0101 (b) computed as AUC values obtained from a ROC analysis where predictive profiles were derived from sets of 5, 10, 20, 40, 80 and 120 peptide binders and tested on the remaining peptide binders (Table 1). AUC values varied with the peptides included in the alignment (training set). The AUC values plotted in the figure correspond with the best value obtained after repeating the ROC analysis with 100 different training sets. Profiles were obtained using PROFILEWEIGTH (black bars) and BLK2PSSM (white bars)

Prediction of MHCI-restricted and MHCII-restricted T-cell epitopes using profiles

Profiles of MHC ligands can be used in combination with the dynamic search algorithm to score and sort all peptides according to their binding potential to the relevant MHC molecule (see Materials and methods). Only peptides that bind to MHC with an affinity above a necessary threshold are able to elicit a T-cell response. Consequently, if PSSMs are good predictors of peptide-MHC binding, T-cell epitopes should be expected among the high scoring peptides within their protein sources. We checked the validity of this notion for the set MHC molecules shown in Table 1. All prediction tests were carried out using PSSMs that were generated using PROFILEWEIGHT (Thompson et al. 1994b), which uses branch-proportional sequence weights, and BLK2PSSM in combination with position-based based weights (Henikoff and Henikoff 1994; Henikoff et al. 1999). The percentage of correctly predicted MHC-restricted epitopes at several thresholds of top scoring peptides (0.5, 1, 3, 5, 10, 20%) using the relevant PSSMs are shown in Fig. 3. Figure 3a illustrates predictions of MHCI-restricted epitopes (CD8 T-cell epitopes) using PSSMs generated from alignments without (Fig. 3a1) or with (Fig. 3a2) the epitope evaluated as a binder of a given MHC molecule. Likewise, Fig. 3b corresponds to predictions of MHCII-restricted epitopes (CD4 T-cell epitopes) using PSSMs generated from alignments without (Fig. 3b1) or with (Fig. 3b2) the targeted epitope. Results indicate that on average over 80% of CD8 T-cell epitopes were predicted at a 2% threshold (predictions with all peptides included in the alignment), whereas up to a 3–10% threshold was required to predict 80% of CD4 T-cell epitopes (predictions with all peptides included in alignment). Also, as reported elsewhere (Reche et al. 2002), for the prediction of MHCI-restricted epitopes we see no clear differences between results obtained from PSSMs derived with BLK2PSSM and PROFILEWEIGHT. When all peptides are included in the alignment, PSSM generated with BLK2PSMM gave better results (Fig. 3a2), but if the epitopes to be predicted are not included in the alignment, then PSSMSs generated from PROFILEWEIGHT gave slightly better results. On the other hand, for the prediction of MHCII-restricted epitopes PSSMs obtained with BLK2PSSM and a position-based weighting scheme gave better results independently of whether the epitopes to be tested were included in the alignment (Fig. 3b1,2). It is also important to note that there is some variability between the prediction tests obtained for the different MHC molecules targeted in this test (indicated by standard deviation in Fig. 3). Moreover, variability is greater in tests concerning the prediction of MHCII-restricted epitopes. This indicates that each PSSM has a different threshold of top scoring peptides at which known binders appear. Thus, for each PSSM we defined a PSSM-specific binding threshold (PSBT) as the score value that includes 85% of the peptides from which that PSSM was obtained.

Fig. 3a, b
figure 3

Prediction of MHCI-restricted and MHCII-restricted epitopes using PSSMs. Prediction of peptide-MHC binding using PSSMs was evaluated for MHCI (a1,2) and MHCII (b1,2) molecules by scoring known MHC-restricted epitopes from their protein sources. MHC molecules targeted for peptide-MHC binding prediction are those shown in Table 1. Predictions were carried out at different thresholds (abscissa) [(percentage of top scoring peptides)]. A peptide is computed as predicted if it is found among the top scoring peptides (set by the threshold) from its protein source using the relevant PSSM, and the percentage of correctly predicted peptides is plotted in the figure (ordinate). Predictions were carried out using PSSMs derived from alignments that did (a2, b2) or did not (a1, b1) contain the peptide tested, and PSSMs were generated from these alignments using PROFILEWEIGHT which uses a branch-proportional weighting method (gray bars) and BLK2PSSM under position-based weighting method (black bars). Given that several MHC molecules were targeted for peptide binding predictions (Table 1), plotted values correspond to the mean and standard deviation of percentage of properly predicted epitopes for all MHC molecules examined

Prediction of proteasomal cleavage using statistical language modeling tools

Statistical language modeling is the science of building probabilistic models from word strings. Language models including n-gram models are most frequently applied in speech recognition and natural language tagging (Rosenfeld 2000), but have also been applied to the sequence analysis and motif identification (Jimenez-Montano et al. 2002; Wu and Shivakumar 1994; Wu et al. 1996). Cleavage by the proteasome occurs at preferential sites within the protein, and sequence signals from antigenic peptides processed by the proteasome are especially conserved at position P1 of the cleavage site (the C-terminus of antigenic peptide) and its immediate flanking P1’ residue (Altuvia and Margalit 2000). Prediction of proteasomal cleavage resembles the problem of language tagging (modeling the location of grammatic tags such as punctuation signs) and thus we have used the SRILM toolkit (Stolcke 2002) for statistical modeling of proteasomal cleavage sites. Training sets for statistical modeling of proteasomal cleavage were obtained from a database containing the C-terminus and flanking regions of 332 antigens restricted by human MHCI molecules (See Materials and methods for details). Sequence information for this initial data set is shown in Fig. 4. As noted by others, P1 followed by P1′ are the positions with the largest information content (Altuvia and Margalit 2000; Kesmir et al. 2002). Significant sequence information is also found in the P3′ and P2 positions. LMPCP were based on n-gram statistics (Rosenfeld 2000) and created from training sets of peptide fragments of variable fragment length derived from the above database. LMPCP were tested at different cutpoint probabilities (0.1–0.70) using HIDDEN-NGRAM with testing files of peptide fragments not included in the training sets as well as with their entire protein sources. For each probability threshold, 23 different LMPCP N i [N=10, 6, 4; i=1−(N−2)] were tested, where N is the fragment size of fragment in training and testing sets and i is the order of the LMPCP tested (See Materials and methods for details). The LMPCPs were ranked with regard to the percentage of correctly PCS minus the percentage of ECS (Eq. 3 in Materials and methods). The best LMPCP at each threshold is shown in Table 3 along with the indicated mean size of the fragment yield by the model. PCS varied with the relevant LMPCP, and was always under 50% if the cutpoint probability threshold was set above 0.7. Furthermore, LMPCPs from fragments of size four (two residues at each side of the cleavage site) were the best if the threshold cutpoint was above 0.3. PCS were under 50% if the models were generated from training sets containing fragment sizes longer than ten (for example, six residues to each side of the cleavage site) for all the cutpoint probability thresholds tested (0.1–0.7) (data not shown). With the exception of LMPCP66 (0.25), PCS were above 70% only under soft cutting probability (0.1–0.4) (Table 3), indicating that the nature of the proteasome specificity is much less rigid than that of grammar tagging for which these languages models were originally intended. Yet, PCS by these models exceed the ECS by at least 30% (Table 3). The average size of the length of the peptide fragments produced by LMPCPs varied, with those providing a highest PCS yielding smaller fragments (around three residues) (Table 3).

Fig. 4
figure 4

Sequence logo of peptide fragments containing the C-terminal end of MHCI-restricted peptides. The sequence logo was built as indicated from a collection of peptide fragments containing the C-terminus of 332 human MHCI-restricted epitopes and their flanking regions (see Materials and methods). The C-terminus of each epitope corresponds to the P1 proteasomal cleavage site (shown by the vertical bar)

Table 3 Proteasomal cleavage prediction results using representative LMPCPs at different thresholds. LMPCPs were obtained using NGRAM-COUNT and tested using HIDDEN-NGRAM at different probability thresholds (Pro). N−2 LMPCP models were produced by varying the order of the model (i) from 2 to N−1, and the best model at each threshold is shown in this table. N size of the peptides fragments in the training and testing sets; PCS predicted cleavage sites; ECS expected cleavage sites calculated according to Eq. 3 (Materials and methods)

Web implementation

Predictions of peptide-MHCI binding using PSSMs are available on-line from the Molecular Immunology Foundation web server hosted by Dana-Farber Cancer Institute (http://www.mifoundation.org/Tools/rankpep.html). The server consists of a set of python and perl scripts that handle the input, combine the prediction of peptide-MHC binding and proteasomal cleavage and serve the output over the Internet. The interface to the server, shown in Fig. 5a, is divided in six major sections: PSSMs, INPUT, THRESHOLD, PROTEASOME CLEAVAGE, and ADVANCED OPTIONS. The PSSM section includes a selection of 88 MHCI-specific (81 human and seven mouse), and 50 MHCII-specific (38 human and 12 mouse) PSSMs for the prediction of peptide binding. Alternatively, the users can input their own PSSMs for the prediction of peptide-MHC binding. Optional PSSMs included in the server for the prediction of MHCI-peptide binding are those obtained using PROFILEWEIGHT (Thompson et al. 1994b), whereas PSSMs for the prediction of peptide binders to MHCII are those obtained using BLK2PSSM (Henikoff et al. 1999) with position-based weights (Henikoff and Henikoff 1994)(see Results). The INPUT query for the prediction of peptide binders to MHC molecules can be sequence(s) in FASTA format or an MSA. If an MSA is entered, the server creates a consensus sequence in which the variable positions are masked (see Materials and methods), and prediction of peptide-MHC binding is restricted to the conserved regions. Default variability for masking is 1.0 (positions with a variability above 1.0 will be masked). Roughly, a position in the alignment with a variability under 1.0 is either occupied by a prevalent residue (around 90% of the residues are identical) or by two different residues equally represented (∼50% each). The variability threshold can be set to other values in the ADVANCED OPTIONS section. Values must range between 0 and 4.3 to be consistent with Eq. 4 (Materials and methods). Using the THRESHOLD options, peptides can be sorted by the number of top scoring peptides or the percentage of top scoring peptides. Filtering the sorted peptides by molecular weight can also be done using the ADVANCED OPTIONS section. Models for the prediction of proteasomal cleavage are selected in the PROTEASOME CLEAVAGE section. The current models (highlighted in Table 3) include LMPCP102 (0.1) (option 1), LMPCP42 (0.45) (option 2), and LMPCP24 (0.7) (option 3). The RANKPEP result page (Fig. 5b) displays the PSSM selected, the optimum sequence (consensus) for that PSSM, i.e., the sequence that gives the highest score, and a list of peptides whose number is determined in the THRESHOLD section and ordered by score. For every sorted peptide, the server also outputs its molecular weight, and its relative score in percentage to that of optimum score. Peptides whose scores are equal or greater than the PSBT score will be highlighted in red and, when predicting MHCI-binding peptides, those containing a C-terminal end predicted to be the result of proteasomal cleavage are shown in violet (Fig. 5b).

Fig. 5a, b
figure 5

The RANKPEP web server. a RANKPEP input page. The page is divided into several sections: PSSM for the selection of MHCI-specific and MHCII-specific matrices; INPUT; THRESHOLD, and PROTEASOME CLEAVAGE as discussed in Results; ADVANCE OPTIONS include filtering the peptide results by the molecular weight (MW) of peptides, and selection of a variability threshold (0–4.3) to mask sequence variability from inputs in the form of multiple sequence alignment. b RANKPEP result page. The page lists a number of peptides from the query given at a selected threshold. Also indicated in the result page is the PSSM selected and the binding threshold of the PSSM. Peptides whose scores are above the PSBT are shown in red. Peptides shown in violet contain a C-terminal residue that is predicted to be the result of proteasomal cleavage. If the proteasomal cleavage filter is checked ON, only violet peptides will be shown. Proteasomal cleavage options are only applied to the prediction of MHCI-restricted peptides

Discussion

Prediction of peptide-MHC binding using profiles: selection of PSSMs and thresholds

PSSMs or profiles are useful for representing sequence motifs. Indeed, popular databases such as BLOCK (Henikoff et al. 1999), PROSITE (Hofmann et al. 1999), and IMPALA (Schaffer et al. 1999) rely on motif profiles for the functional classification of new sequences via their similarity to these profiles. Similarly, PSSMs of peptides known to bind to MHC can be used for the identification/prediction of peptide-MHC binders. We first applied this idea to the prediction of MHCI-peptide binding (Reche et al. 2002). Here we have extended the method to the prediction of peptide-MHCII binding. PSSMs for the prediction of both MHCI and MHCII binding are now available at the RANKPEP web site (http://www.mifoundation.org/Tools/rankpep.html). PSSMs for the prediction of peptide-MHCI binding have been derived from ungapped alignments of peptides of the same length. Since MHCI molecules can bind peptides between eight and 11 residues in length, several PSSMs might be available at the RANKPEP web server for the independent prediction of 8mer, 9mer, 10mer or 11mer peptide binders to a given MHCI molecule. Most of the known MHCI-restricted peptides are 9mers (∼90%)(data not shown), and therefore, in the absence of a certain preference for a given size, we suggest selecting PSSMs for the prediction of 9mer peptide binders. PSSMs for the peptide-MHCII binders always target the 9mer core of the peptide binders.

Overall, PSSMs for prediction of peptide-MHCI binding and MHCI-restricted epitopes are more sensitive than those used for the prediction of peptide-MHCII binding and MHCII-restricted epitopes. Consequently, a higher threshold is required to predict a similar percentage of epitopes (Table 2, Fig. 3) This result does not necessarily indicate that PSSMs specific for the prediction of MHCII were derived from incorrect alignments, but rather could reflect the greater structurally inherent peptide binding promiscuity of MHCII molecules (see Results). Indeed, from the available crystal structures of peptide-MHCII complexes, we determined that the peptide binding core fitting onto the MHC groove was properly defined in the relevant alignments from which the PSSMs were derived (data not shown). PSSMs in RANKPEP are associated with a specific binding threshold above which sorted peptides are highlighted in the results page (Fig. 5b). Since PSBT is variable with regard to the sequence space diversity of the aligned peptides, both overestimation and underestimation of the number of peptides that are predicted to bind to a given MHCI molecule can occur. Therefore, following the results summarized in Table 2, Fig. 3, we recommend using a 2–3% binding threshold of top scoring peptides for the prediction of MHCI-restricted peptides, and a 4–6% threshold for the prediction of MHCII-restricted peptides.

Prediction of peptide-MHC binding using profiles: comparison with other methods

Prediction of peptide-MHC binding is important for the anticipation of T-cell epitopes and so determination of peptides that can bind to MHC molecules has been approached by a large array of methods including sequence patterns (Sette et al. 1989), motif-matrices (De Groot et al. 1997; Rammensee et al. 1999), quantitative matrices (QM) (Guan et al. 2003; Hammer et al. 1994; Parker et al. 1994; Stryhn et al. 1996; Udaka et al. 2000), virtual quantitative matrices (VQM) (Raddrizzani and Hammer 2000; Sturniolo et al. 1999), artificial neural networks (ANN) (Adams and Koziol 1995; Brusic et al. 1998a; Gulukota et al. 1997; Honeyman et al. 1998); hidden Markov motifs (HMM) (Mamitsuka 1998; Udaka et al. 2002); structural peptide threading (SPT) (Altuvia et al. 1997; Schueler-Furman et al. 2000; Swain et al. 2001), support vector machine (SVM) algorithms (Donnes and Elofsson 2002; Zhao et al. 2003) and stepwise discriminant analysis meta-algorithm (SDA) (Mallios 1999). QM and VQM methods are derived from actual binding experiments, whereas SPT is an entirely computer-based method that relies on the evaluation of peptide fit into the binding groove, and despite its great potential is currently still under development. On the other hand, techniques such as sequence patterns, motif-matrices, ANN, HMM, SVM, and SDA algorithms rely on the analysis of the sequences of peptides that are experimentally known to bind. Prediction of peptide-MHC binding using PSSMs lies within the motif-matrices methods, although in previous methods the matrices coefficients were adjusted either manually (Rammensee et al. 1999) or were not specified (De Groot et al. 1997). In any case, motif-matrices are a more accurate predictor of peptide-MHC binding than simple single sequence patterns (Reche et al. 2002). Most of these methods have been applied to both the prediction of peptide-MCHI and peptide-MHCII binding, and as occurs with the use of PSSMs, the success of the predictions seems to be greater for the prediction of peptide-MHCI binding than peptide-MHCII binding. In terms of accuracy (a balance between sensitivity and specificity) ANN and HMM have been reported to be best predictors of peptide-MHC binding, perhaps because they can model binding interferences, positive or negative, between the side chains of the peptides. Other methods assume independent binding of each side chain. Nevertheless, independent binding is generally the case, as supported by experimental evidence (Parker et al. 1994; Sturniolo et al. 1999). Indeed, in a recent study of the independent binding assumption for binding of peptides epitopes to MHCI molecules, there was only marginal improvement when sidechain pair interactions were introduced into the motif-matrix predictor (Peters et al. 2003). Furthermore, the accuracy of our profiles, given by the AUC value (Table 2), is similar to that reported by ANN and HMM methods. However, an objective comparison between these methods should be done upon experimental determination of the binding of peptides predicted from a protein query. It is in fact revealing that in practicum, only 30–50% of predicted peptides from query proteins turn out to be significant binders, independently of the method used. In the absence of such experimental testing, a rigorous computer-based comparison of the various peptide-MHC binding prediction is not straightforward, as the various methods have been training with different set of data and the results are dictated by the chosen test peptides. Thus, in this paper we have also determined whether known T-cell epitopes can be predicted from their protein sources using realistic thresholds. In this scenario, we find that using PSSMs around 80% of the known MHCI-restricted and MHCII-restricted epitopes appear among the top 3% and 5% scoring peptides, respectively (Fig. 3a,b).

Examples of online web servers for the predictions of peptide-MHC binding are available for most of the methods discussed above (see Table 1 in Guan et al. 2003). However, all these sites are for the prediction of peptide binding to either MHCI or MHCII molecules alone. The exception is the SYFPEITHI web site (Rammensee et al. 1999) that contain matrices for the prediction of peptide-MHCI and peptide-MHCII binding, but therein, the number of MHCII molecules that can be targeted for peptide prediction is very limited. The RANKPEP web site contains the largest set of predictors for the anticipation of peptide-MHCI and peptide-MHCII binding (88 and 50 PSSMs for targeting peptide binding predictions to independent MHCI and MHCII molecules, respectively).

Antigen processing: prediction of proteasomal cleavage using statistical language models

Antigen processing occurs prior to MHC binding, thus determining the pool of peptides that can become T-cell epitopes. CD8 T-cell epitopes, and MHCI-restricted peptides in general, derive from protein fragments generated by the protease activity of the proteasome. Protein fragments thus generated are substrates for amino-peptidases that destroy most of the fragments. Nevertheless, a few peptides ranging between eight and 15 residues in length are translocated to the endoplasmic reticulum (ER) by the TAP transporter, where they can either be destroyed by an additional amino-peptidase or be rescued by binding to MHCI molecules (Pamer and Cresswell 1998; Rammensee 2002; Serwold et al. 2002). Thus, the N-terminus of any class I restricted peptide is shaped by activity of several amino-peptidases with the resulting loss of information, whereas the C-terminus is the result of the original proteasomal cleavage (Craiu et al. 1997). The proteasome is a multi-enzyme complex, whose catalytic subunits, and resulting specificity, change in the presence of IFN-γ. The form in the absence of IFN-γ is referred to as the constitutive proteasome, whereas the form in the presence of IFN-γ is known as the immunoproteasome (Fruh and Yang 1999; Toes et al. 2001). Although subject to debate, it is believed that the immunoproteasome is responsible for the generation of CD8 T-cell eptiopes (Chen et al. 2001; Toes et al. 2001; van Hall et al. 2000). Therefore, to increase the immunological relevance of our study we have modeled the specificity of the proteasome from a set of known CD8 T-cell epitopes and their flanking regions using statistical language models. Three of these models (LMPCP) [(Table 3)] are now implemented in the RANKPEP web site to predict whether the C-terminus of a given peptide might result from proteasome activity. The default model [model one: LMPCP102 (0.1)] predicted about 85% of the cleavage sites (Table 3), providing the largest increase of PCS over the ECS of all tested models (48.4%). In a genome-wide characterization of CD8 T-cell epitopes from influenza virus in mouse, Zhong et al. (Zhong et al. 2003) proved that the combined use of this LMPCP can reduce the list of peptide-MHCI binders by ~30% without compromising the number of peptides that are true T-cell epitopes. The average fragment length yielded by LMPCP102 (0.1) is, however, much smaller (∼3 residues) than that experimentally determined for the proteasome (of 7–9 residues) (Kisselev et al. 1999; Toes et al. 2001), suggesting that the specificity of this particular LMPCP is rather low (many false positives). The smaller fragment size yield by LMPCP102 (0.1) could also reflect the clustering and consequent overlap of epitopes observed within protein regions [(Meister et al. 1995); our own unpublished observations from the HIV CTL database in Los Alamos; url: http://www.hiv.lanl.gov/]. Nevertheless, to anticipate the possibility of rather low specificity of model LMPCP102 (0.1), RANKPEP also provides two additional models, LMPCP42 (0.45) and LMPCP42 (0.7), which although less sensitive (Table 2), produce larger fragments and thereby are expected to be more specific. In particular, LMPCP42 (0.7) yields peptide fragments with an average size (~8 residues) that is consistent with that thought to be generated by the proteasome (Table 3). However, it is important to note that LMPCP are not meant to predict proteasome fragmentation patterns, but to indicate whether the C-terminus of a peptide can result from proteasomal cleavage. Finally, there is an extra benefit of combining LMPCP with peptide-MHCI binding prediction using PSSMs. It is known that the C-terminal position of the peptide is always an anchor residue (Fig. 4). Note that prediction of peptide-MHCI using PSSMs assumes an independent contribution of each residue, and there are occasions in which top ranking peptides may contain a C-terminus that is not likely to be an anchor residue. In this scenario, the coupled usage of LMPCP will help to discard those peptides, thereby improving the MHCI-binding prediction. Moreover, some valuable information about the TAP transport of peptides into the ER may also have been incorporated into our LMPCP. TAP transport is essential for epitope generation, and our LMPCP models were trained using known epitopes.

A similar approach for modeling the proteasome cleavage site from known MHCI-restricted T-cell epitopes using ANN has already been reported (Kesmir et al. 2002). Relative to the language model herein, ANN produced larger fragments (~9 residues) more consistent with those thought to be generated by the proteasome. However, the training set consisted of peptides of up to 18 residues (9 residues at each side of the C-terminus end), and therefore the result would be biased toward the average length of the epitope, as there is significant sequence information with regard to the background along the entire length of the epitope. Other approaches for modeling the proteasome cleavage site include the analysis of the fragmentation patterns of a given protein with purified constitutive proteasome cores (Holzhutter and Kloetzel 2000; Kesmir et al. 2002; Kuttler et al. 2000). The biological relevance of the data generated from these important studies might be limited due to the fact the degradation was mediated by the proteasome rather than the immunoproteasome.

Processing of MHCII-restricted ligands relies mainly on exogenous proteins that are directed to the endosomal compartment, where they are degraded by the action of several endo-peptidases as well as by amino-peptidases and carboxy-peptidases (Pieters 2000; Watts 2001). This processing complexity together with the fact that MHCII molecules bind peptides of different yet overlapping lengths, makes the generation of models for the prediction of MHCII-antigen processing difficult. Nevertheless, recent reports indicate the existence of conserved regions flanking the core CD4 T-cell epitopes that are related to antigen processing rather than peptide-MHC interaction (Sant’Angelo et al. 2002). Moreover, some reports argue that they may contribute to immunogeneicity as well (Carson et al. 1997). Thus, when anticipating MHCII-restricted T-cell epitopes using RANKPEP, we suggest considering peptides consisting of the predicted 9mer binding cores plus the three most proximal amino acids flanking their N-terminal and C-terminal ends.

Conclusions

Engagement of both CD8 and CD4 T-cells is desirable for mounting a strong defensive immune response against cancer cells and pathogens. Since antigen processing and presentation by MHCI and MHCII molecules differ, prediction of T-cell epitopes requires the development of bioinformatics tools that are able to cope with this complexity. To this end, our RANKPEP server represents a powerful tool that allows: (1) the prediction of peptide binding to MHCI and MHCII molecules using motif profiles; (2) greater specificity of CD8 T-cell epitope identification through combined proteasomal cleavage site prediction; and (3) prediction of conserved epitopes from MSAs. Finally, RANKPEP is a versatile and flexible web server, providing many sorting options and the possibility of using custom built matrices for prediction of peptide-MHC binding.