Abstract
Proteins perform many important tasks in living organisms, such as catalysis of biochemical reactions, transport of nutrients, and recognition and transmission of signals. The plethora of aspects of the role of any particular protein is referred to as its “function.” One aspect of protein function that has been the target of intensive research by computational biologists is its subcellular localization. Proteins must be localized in the same subcellular compartment to cooperate toward a common physiological function. Aberrant subcellular localization of proteins can result in several diseases, including kidney stones, cancer, and Alzheimer’s disease. To date, sequence homology remains the most widely used method for inferring the function of a protein. However, the application of advanced artificial intelligence (AI)-based techniques in recent years has resulted in significant improvements in our ability to predict the subcellular localization of a protein. The prediction accuracy has risen steadily over the years, in large part due to the application of AI-based methods such as hidden Markov models (HMMs), neural networks (NNs), and support vector machines (SVMs), although the availability of larger experimental datasets has also played a role. Automatic methods that mine textual information from the biological literature and molecular biology databases have considerably sped up the process of annotation for proteins for which some information regarding function is available in the literature. State-of-the-art methods based on NNs and HMMs can predict the presence of N-terminal-sorting signals extremely accurately. Ab initio methods that predict subcellular localization for any protein sequence using only the native amino acid sequence and features predicted from the native sequence have shown the most remarkable improvements. The prediction accuracy of these methods has increased by over 30% in the past decade. The accuracy of these methods is now on par with high-throughput methods for predicting localization, and they are beginning to play an important role in directing experimental research. In this chapter, we review some of the most important methods for the prediction of subcellular localization.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Venter, J. C., Adams, M. D., Myers, E. W., et al. (2001) The sequence of the human genome. Science 291(5507), 1304–1351.
Brutlag, D. L. (1998) Genomics and computational molecular biology. Curr. Opin. Microbiol. 1(3), 340–345.
Harrison, P. M., Bamborough, P., Daggett, V., Prusiner, S., and Cohen, F. E. (1997) The prion folding problem. Curr. Opin. Struct. Biol. 7, 53–59.
Bork, P. and Koonin, E. V. (1998) Predicting functions from protein sequences—where are the bottlenecks? Nat. Genet. 18(4), 313–318.
Luscombe, N. M., Greenbaum, D., and Gerstein, M. (2001) What is bioinformatics? A proposed definition and overview of the field. Methods Inf. Med. 40(4), 346–358.
Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., and Yuan, Y. (1998) Predicting function: from genes to genomes and back. J. Mol. Biol. 283(4), 707–725.
Rost, B., Liu, J., Nair, R., Wrzeszczynski, K. O., and Ofran, Y. (2003) Automatic prediction of protein function. Cell. Mol. Life Sci. 60(12), 2637–2650.
Apweiler, R., Attwood, T. K., Bairoch, A., et al. (2000) InterPro—an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16(12), 1145–1150.
Ashburner, M., Ball, C. A., Blake, J. A., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25(1), 25–29.
Lodish, H., Berk, A., Baltimore, D., and Darnell, J. (2000) Molecular Cell Biology, 4th ed. W. H. Freeman & Co, New York.
Skach, W. R. (2000) Defects in processing and trafficking of the cystic fibrosis transmembrane conductance regulator. Kidney Int. 57(3), 825–831.
Payne, A. S., Kelly, E. J., and Gitlin, J. D. (1998) Functional expression of the Wilson disease protein reveals mislocalization and impaired copper-dependent trafficking of the common H1069Q mutation. Proc. Natl. Acad. Sci. USA 95(18), 10854–10859.
Parfrey, H., Mahadeva, R., and Lomas, D. A. (2003) Alpha(1)-antitrypsin deficiency, liver disease and emphysema. Int. J. Biochem. Cell Biol. 35(7), 1009–1014.
Davis, T. N. (2004) Protein localization in proteomics. Curr. Opin. Chem. Biol. 8(1), 49–53.
Nakai, K. (2000) Protein sorting signals and prediction of subcellular localization. Adv. Protein Chem. 54, 277–344.
Schneider, G. and Fechner, U. (2004) Advances in the prediction of protein targeting signals. Proteomics 4(6), 1571–1580.
Schatz, G. and Dobberstein, B. (1996) Common principles of protein translocation across membranes. Science 271(5255), 1519–1526.
Darnell, J., Lodish, H., and Baltimore, D. (1990) Molecular Cell Biology, 2nd ed. W. H. Freeman & Co, New York.
Valencia, A. and Pazos, F. (2002) Computational methods for the prediction of protein interactions. Curr. Opin. Struct. Biol. 12(3), 368–373.
Wu, C. H., Nikolskaya, A., Huang, H., et al. (2004) PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res. 32(1), D112–114.
Nakai, K. (2001) Review: prediction of in vivo fates of proteins in the era of genomics and proteomics. J. Struct. Biol. 134(2–3), 103–116.
Apweiler, R., Gateau, A., Contrino, S., et al. (1997) Protein sequence annotation in the genome era: the annotation concept of SWISS-PROT+TREMBL. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5, 33–43.
Bairoch, A. and Apweiler, R. (1997) The SWISS-PROT protein sequence data bank and its new supplement TrEMBL. Nucleic Acids Res. 25, 31–36.
Simpson, J. C., Wellenreuther, R., Poustka, A., Pepperkok, R., and Wiemann, S. (2000) Systematic subcellular localization of novel proteins identified by large-scale cDNA sequencing. EMBO Rep. 1(3), 287–292.
Nakai, K. and Kanehisa, M. (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14(4), 897–911.
Nakai, K. and Horton, P. (1999) PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem. Sci. 24(1), 34–36.
Bannai, H., Tamada, Y., Maruyama, O., Nakai, K., and Miyano, S. (2000) Extensive feature detection of N-terminal protein sorting signals. Bioinformatics 18(2), 298–305.
Gardy, J. L., Spencer, C., Wang, K., et al. (2003) PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res. 31(13), 3613–3617.
Horton, P., Park, K. J., Obayashi, T., et al. (2007) WoLF PSORT: protein localization predictor. Nucleic Acids Res. 35 (Web Server issue), W585–587.
von Heijne, G. (1995) Protein sorting signals: simple peptides with complex functions. EXS 73, 67–76.
Cokol, M., Nair, R., and Rost, B. (2000) Finding nuclear localization signals. EMBO Rep. 1(5), 411–415.
von Heijne, G. (1985) Signal sequences. The limits of variation. J. Mol. Biol. 184, 99–105.
Voos, W., Martin, H., Krimmer, T., and Pfanner, N. (1999) Mechanisms of protein translocation into mitochondria. Biochim. Biophys. Acta 1422(3), 235–254.
Bruce, B. D. (2000) Chloroplast transit peptides: structure, function and evolution. Trends Cell Biol. 10(10), 440–447.
Nielsen, H., Brunak, S., and von Heijne, G. (1999) Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng. 12, 3–9.
Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300(4), 1005–1016.
Boden, M. and Hawkins, J. (2005) Prediction of subcellular localization using sequence-biased recurrent networks. Bioinformatics 21(10), 2279–2286.
Kall, L., Krogh, A., and Sonnhammer, E. L. (2004) A combined transmembrane topology and signal peptide prediction method. J. Mol. Biol. 338(5), 1027–1036.
Emanuelsson, O. and von Heijne, G. (2001) Prediction of organellar targeting signals. Biochim. Biophys. Acta 1541(1–2), 114–119.
Gaasterland, T. and Oprea, M. (2001) Whole-genome analysis: annotations and updates. Curr. Opin. Struct. Biol. 11(3), 377–381.
Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge, UK.
Mattaj, I. W. and Englmeier, L. (1998) Nucleocytoplasmic transport: the soluble phase. Annu. Rev. Biochem. 67, 265–306.
Jans, D. A., Xiao, C. Y., and Lam, M. H. (2000) Nuclear targeting signal recognition: a key control point in nuclear transport? BioEssays 22(6), 532–544.
Brameier, M., Krings, A., and MacCallum, R. M. (2007) NucPred—predicting nuclear localization of proteins. Bioinformatics 23(9), 1159–1160.
Liu, J. and Rost, B. (2002) Target space for structural genomics revisited. Bioinformatics 18(7), 922–933.
Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10(1), 1–6.
Bendtsen, J. D., Nielsen, H., von Heijne, G., and Brunak, S. (2004) Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 340(4), 783–795.
Qian, N. and Sejnowski, T. J. (1988) Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol. 202, 865–884.
Nielsen, H. and Krogh, A. (1998) Prediction of signal peptides and signal anchors by a hidden Markov model. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 122–130.
Nair, R., Carter, P., and Rost, B. (2003) NLSdb: database of nuclear localization signals. Nucleic Acids Res. 31(1), 397–399.
LaCasse, E. C. and Lefebvre, Y. A. (1995) Nuclear localization signals overlap DNA-or RNA-binding domains in nucleic acid-binding proteins. Nucleic Acids Res. 23(10), 1647–1656.
Apweiler, R., Bairoch, A., Wu, C. H., et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 32 (Database issue), D115–119.
Bairoch, A. and Apweiler, R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28(1), 45–48.
Iliopoulos, I., Enright, A. J., and Ouzounis, C. A. (2001) Textquest: document clustering of Medline abstracts for concept discovery in molecular biology. Pac. Symp. Biocomput. 384–395.
Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., and Mostafa, J. (2001) Detecting gene relations from Medline abstracts. Pac. Symp. Biocomput. 483–495.
Friedman, C., Kra, P., Yu, H., Krauthammer, M., and Rzhetsky, A. (2001) GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(Suppl. 1), S74–82.
Stapley, B. J., Kelley, L. A., and Sternberg, M. J. (2002) Predicting the subcellular location of proteins from text using support vector machines. Pac. Symp. Biocomput. 374–385.
Shatkay, H., Hoglund, A., Brady, S., Blum, T., Donnes, P., and Kohlbacher, O. (2007) SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics 23(11), 1410–1417.
Hoglund, A., Blum, T., Brady, S., et al. (2006) Significantly improved prediction of subcellular localization by integrating text and protein sequence data. Pac. Symp. Biocomput. 16–27.
Lu, Z. and Hunter, L. (2005) Go molecular function terms are predictive of subcellular localization. Pac. Symp. Biocomput. 151–161.
Raychaudhuri, S., Schutze, H., and Altman, R. B. (2002) Using text analysis to identify functionally coherent gene groups. Genome Res. 12(10), 1582–1590.
Chalmel, F., Lardenois, A., Thompson, J. D. et al. (2005) GOAnno: GO annotation based on multiple alignment. Bioinformatics 21(9), 2095–2096.
Nair, R. and Rost, B. (2002) Inferring sub-cellular localization through automated lexical analysis. Bioinformatics 18(Suppl. 1), S78–S86.
Lu, Z., Szafron, D., Greiner, R., et al. (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20(4), 547–556.
Tamames, J., Ouzounis, C., Casari, G., Sander, C., and Valencia, A. (1998) EUCLID: automatic classification of proteins in functional classes by their database annotations. Bioinformatics 14(6), 542–543.
Lewis, D. D. and Ringuette, M. (1994) Comparison of two learning algorithms for text categorization. Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR’94). Las Vegas, NV, April 11–13, 1994.
Dasarathy, B. V. (1991) Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, CA.
Kretschmann, E., Fleischmann, W., and Apweiler, R. (2001) Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17(10), 920–926.
Eisenhaber, F. and Bork, P. (1999) Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries. Bioinformatics 15(7–8), 528–535.
Fleischmann, W., Moller, S., Gateau, A., and Apweiler, R. (1999) A novel method for automatic functional annotation of proteins. Bioinformatics 15(3), 228–233.
Mott, R., Schultz, J., Bork, P., and Ponting, C. P. (2002) Predicting protein cellular localization using a domain projection method. Genome Res. 12(8), 1168–1174.
Xie, D., Li, A., Lin, X., Wang, M., Jiang, Z., and Feng, H. (2005) Using motifs in the prediction of eukaryotic protein subcellular localization. Conf. Proc. IEEE Eng. Med. Biol. Soc. 3, 2802–2804.
Guda, C. and Subramaniam, S. (2005) pTARGET: a new method for predicting protein subcellular localization in eukaryotes. Bioinformatics 21(21), 3963–3969.
Nair, R. and Rost, B. (2005) Mimicking cellular sorting improves prediction of subcellular localization. J. Mol. Biol. 348(1), 85–100.
Nishikawa, K. and Ooi, T. (1982) Correlation of the amino acid composition of a protein to its structural and biological characteristics. J. Biochem. 91, 1821–1824.
Nakashima, H. and Nishikawa, K. (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol. 238(1), 54–61.
Andrade, M. A., O’Donoghue, S. I., and Rost, B. (1998) Adaptation of protein surfaces to subcellular location. J. Mol. Biol. 276(2), 517–525.
Nakai, K. and Kanehisa, M. (1991) Expert system for predicting protein localization sites in gram-negative bacteria. Proteins 11, 95–110.
Reinhardt, A. and Hubbard, T. (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 26(9), 2230–2236.
Hua, S. and Sun, Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17(8), 721–728.
Vapnik, V. N. (1995) The Nature of Statistical Learning Theory. Springer-Verlag, New York.
Park, K. J. and Kanehisa, M. (2003) Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid paris. Bioinformatics 19(13), 1656–1663.
Cai, Y. D., Liu, X. J., Xu, X. B., and Chou, K. C. (2002) Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect. J. Cell. Biochem. 84(2), 343–348.
Chou, K. C. and Cai, Y. D. (2003) Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition. J. Cell. Biochem. 90(6), 1250–1260.
Sarda, D., Chua, G. H., Li, K. B,, and Krishnan, A. (2005) pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties. BMC Bioinform. 6, 152.
Ogul, H. and Mumcuogu, E. U. (2007) Subcellular localization prediction with new protein encoding schemes. IEEE/ACM Trans. Comput. Biol. Bioinform. 4(2), 227–232.
Donnes, P. and Hoglund, A. (2004) Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinform. 2(4), 209–215.
Emanuelsson, O., Brunak, S., von Heijne, G., and Nielsen, H. (2007) Locating proteins in the cell using TargetP, SignalP and related tools. Nat. Protoc. 2(4), 953–971.
Yu, C. S., Chen, Y. C., Lu, C. H., and Hwang, J. K. (2006) Prediction of protein subcellular localization. Proteins 64(3), 643–651.
Guda, C. (2006) pTARGET: a web server for predicting protein subcellular localization. Nucleic Acids Res. 34(Web Server issue), W210–213.
Pierleoni, A., Martelli, P. L., Fariselli, P., and Casadio, R. (2006) BaCelLo: a balanced subcellular localization predictor. Bioinformatics 22(14), e408–416.
Sprenger, J., Fink, J. L., and Teasdale, R. D. (2006) Evaluation and comparison of mammalian subcellular localization prediction methods. BMC Bioinform. 7(Suppl. 5), S3.
Gardy, J. L. and Brinkman, F. S. (2006) Methods for predicting bacterial protein subcellular localization. Nat. Rev. Microbiol. 4(10), 741–751.
Nair, R. and Rost, B. (2002) Sequence conserved for subcellular localization. Protein Sci. 11(12), 2836–2847.
Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997) A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int. J. Neural Syst. 8(5–6), 581–599.
Small, I., Peeters, N., Legeai, F., and Lurin, C. (2004) Predotar: a tool for rapidly screening proteomes for N-terminal targeting sequences. Proteomics 4(6), 1581–1590.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Humana Press, Totowa, NJ
About this protocol
Cite this protocol
Nair, R., Rost, B. (2008). Protein Subcellular Localization Prediction Using Artificial Intelligence Technology. In: Thompson, J.D., Ueffing, M., Schaeffer-Reiss, C. (eds) Functional Proteomics. Methods in Molecular Biology, vol 484. Humana Press. https://doi.org/10.1007/978-1-59745-398-1_27
Download citation
DOI: https://doi.org/10.1007/978-1-59745-398-1_27
Publisher Name: Humana Press
Print ISBN: 978-1-58829-971-0
Online ISBN: 978-1-59745-398-1
eBook Packages: Springer Protocols