Journal of Molecular Biology
iWRAP: An Interface Threading Approach with Application to Prediction of Cancer-Related Protein–Protein Interactions
Graphical Abstract
Introduction
Protein–protein interactions (PPIs) play a central role in all biological processes. Akin to the complete sequencing of genomes, complete description of interactomes is a fundamental step towards a deeper understanding of biological processes and has a vast potential to impact systems biology, genomics, molecular biology, and therapeutics. Although high-throughput biochemical approaches for discovering PPIs have proven very successful,1, 2, 3, 4 the coverage of experimentally determined PPI data remains poor (Table S1) and is prone to errors.5, 6 Such low coverage is partly because the set of possible PPIs to be verified is so large (50 million for a species with 10,000 genes) that any exhaustive experimental verification will take a long time, even with high-throughput techniques. While the rate of PPI discovery has leveled off in recent years (see Supplementary Fig. S1), the number of solved protein structural complexes has rapidly grown: there has been a 40% increase in the number of complex templates in the 14 months between the two versions of Structural Classification of Proteins database (SCOP, 1.65 and 1.69).7 This growing resource of structural data presents an opportunity to utilize this information for accurate PPI predictions.
There have recently been proposals to harness the information provided by structure-based computational approaches as a potentially high-quality, high-coverage data source for large-scale integrative approaches to interactome construction.8, 9, 10, 11, 12 Prieto et al.13 have reviewed publicly available interaction databases of known structural data that facilitate analysis of PPIs.14, 15, 16 In the absence of a solved structure for a pair of protein “query” sequences, structure-based approaches typically rely on aligning the query sequences to either sequence or structure-based “templates” for solved structures in the Protein Data Bank (PDB).17
In one such approach, homology modeling, two protein sequences are assumed to interact based simply on their primary sequence homology to known interacting proteins. Homology modeling has had considerable success at predicting PPIs on a genome scale11, 18, 19, 20 and reconstructing and predicting three-dimensional multi-protein complexes.9 More recently, Fukuhara and Kawabata have described HOMCOS,21, 22 a web-server that performs a similar task to Aloy and Russell's InterPrets,9 again by homology modeling. MODBase is a database of homology models for protein complexes that have sequence similarity to known structures higher than 50%.23 ADAN is a specialized database for prediction of protein–protein interactions mediated by linear motifs and utilizes position-specific matrices to assess putative interactions.24 Other sequence-based methods utilize genetic information and multiple sequence alignments to predict specific protein–protein interactions.25, 26, 27, 28 However, effective use of homology modeling requires relatively high sequence similarity between the query and template protein pairs.8
In another popular approach, threading, the three-dimensional structure for a pair of protein query sequences is predicted by aligning their sequences to templates, based on both sequence and structure profiles, for complexes in the PDB to see if a similar structure can be found. The goodness of a query pair-template alignment is evaluated using a scoring function. The essential computational components of a PPI threading approach are template construction, alignment of query sequences to templates, and interaction scoring. Lu et al. developed Multiprospector,29 a threading algorithm that constructs statistical potential functions to evaluate potential PPIs.30 Singh et al. further proposed a machine-learning-based threading algorithm, DBLRAP, which also performs full complex threading, and demonstrated its superiority in predicting PPIs over homology modeling and Multiprospector.8, 31 Threading identifies compatible structures for proteins that share less sequence similarity with the template, thus typically widening the range of proteins for which predictions can be made over homology modeling.
While homology modeling/threading approaches work well and have good overall accuracy when sequences are somewhat similar to their putative templates, they perform poorly in the “twilight zone” of sequence identities. In particular, they often give inaccurate alignments in the putative interaction regions for sequences with low similarity and therefore are unable to predict interactions accurately in such cases, which we demonstrated previously for the special case of cytokines.32 It has been observed that functional residues such as those at the interface are more conserved than nonfunctional ones, both in sequence33, 34, 35 and structure.36, 37 Furthermore, it has been shown just recently that partial homology models, based only on interface alignments, are good candidates for templates used in docking studies.38 Here, we capitalize on these observations by performing threading only on the protein–protein interface after a suitable complex template is identified.
We introduce the program iWRAP (Interface Weighted RAPtor), which predicts whether two proteins interact by combining a novel linear programming approach for interface alignment with a boosting classifier39 for interaction prediction. iWRAP simultaneously optimizes contacts in query sequences to templates of protein–protein interfaces, after constraining alignments to only those residues likely to be involved in the interaction. This approach is in contrast to existing threading approaches that align each sequence individually to an entire protein structure in the complex. We recently demonstrated the utility of interface threading on two cytokine receptor families by implementing LTHREADER,32 where we manually generated templates specific to this family and aligned each query sequence separately to each template. The driving hypothesis of iWRAP's approach is that a more accurate prediction of protein–protein interfaces improves predictions of protein–protein interactions. We show here for general PPIs that (i) more accurate interface alignments lead to improved interface contact prediction, which in turn (ii) significantly improves PPI prediction. Thus, by optimizing the interface alignments after identifying a suitable template, iWRAP exploits functional conservation at the interface to predict PPIs.
We demonstrate the efficacy of these techniques on two data sets, SCOPPI, a database that classifies protein complexes in the PDB,40 and the yeast genome. First, we use SCOPPI as our gold standard database to confirm hypothesis: (i) We show that interface threading (i.e., localized threading) leads to better interface contact prediction over full-complex threaders. For difficult alignment problems and a range of sequence identity values less than 40%, iWRAP outperforms standard threading and sequence-based methods, while for easier problems the methods are comparable. Our results on the full yeast genome scan address hypothesis. (ii) We demonstrate that our method, which novelly uses boosting39 to classify iWRAP's interface threading scores for PPI prediction, outperforms methods based on whole-sequence alignments. In particular, we perform a full genome scan of yeast to predict interactions and compare iWRAP's performance on experimental data to DBLRAP, which has been shown to have the best performance amongst available structure-based PPI prediction methods.8, 31
As an application, through mapping of yeast cancer related genes and their putative interactions to the human genome, we identify interactions enriched relative to a recent yeast genetic interaction set.41 We find that these interacting genes are involved in chromatin remodeling, ribonuclear complex assembly, and nucleosome organization,42 processes known to be critically involved in cancer. We focus on yeast cancer-related genes and putative interactions, since the function and interactions of yeast genes are much better understood than human genes.43 Moreover, the malignant behavior of human cells is often caused by dysregulation of cell cycle, growth, and apoptosis processes that are conserved across eukaryotic organisms at the level of genes and their interactions.44
iWRAP's predictions are made publicly available at its website so that they can be used for further exploration or systems-level integrative approaches.
Section snippets
Overview of the threading algorithm
We develop iWRAP, an algorithm for threading query sequence pairs to only the interface of a suitable complex template. Figure 1 is a schematic of iWRAP, displaying a flowchart of the various stages of the algorithm. In the first stage, template construction, from alignments of multiple protein–protein interfaces,36 we construct specific interface profiles based on amino acid propensities, secondary structure, and solvent accessibilities for discrete environmental classes of the interface.
In
Discussion
We introduce the program iWRAP and show that integrating interface profiles into a localized scoring scheme aids in interfacial contact prediction. We introduce the use of across-family templates to mitigate the limited number of templates and also capture convergently evolved interface motifs. We apply our approach to predict interacting proteins encoded by the entire yeast genome. Furthermore, by integrating our predictions in a combined functional and enrichment study of cancer-related genes
Stage 1: Template construction
We utilize the SCOPPI classification of protein–protein interfaces to construct interface profiles. SCOPPI classifies interfaces based on sequence and structural similarity of the interface.40 In addition, for each interacting SCOP family pair, SCOPPI provides a sequence alignment of other interfaces in the same SCOP family pair. Here, we use this classification of interfaces to construct our own multiple interface alignments for each SCOP family pair using CMAPi.36 CMAPi employs a contact-map
Acknowledgements
Thanks to Rohit Singh, Vinay Pulim, and Daniel Park for help with data and software. Thanks to Jerome Waldispuhl and anonymous reviewers for critical reading of the manuscript. Funding was provided by National Institutes of Health grant 1R01GM081871.
References (66)
- et al.
SCOP: a structural classification of proteins database for the investigation of sequences and structures
J. Mol. Biol.
(1995) - et al.
Computational methods for the prediction of protein interactions
Curr. Opin. Struct. Biol.
(2002) - et al.
Development of unified statistical potentials describing protein–protein interactions
Biophys. J.
(2003) - et al.
Apoptosis in yeast
Curr. Opin. Microbiol.
(2004) Hidden markov models
Curr. Opin. Struct. Biol.
(1996)- et al.
ONCOMINE: a cancer microarray database and integrated data-mining platform
Neoplasia
(2004) - et al.
Targeting the human cancer pathway protein interaction network by structural genomics
Mol. Cell Proteomics
(2008) Protein secondary structure prediction based on position-specific scoring matrices
J. Mol. Biol.
(1999)- et al.
A map of the interactome network of the metazoan c. elegans
Science
(2004) - et al.
Towards a proteome-scale map of human protein–protein interaction network
Nature
(2005)