ReviewClustering and analysis of protein families
Introduction
In the past few years, the technology of sequencing has developed to the stage at which the sequencing of a complete genome can be contemplated as a practical and routine possibility. The complete sequences of more than 55 genomes have been published and at least 100 more are known to be nearing completion. These projects produce large amounts of sequence data lacking experimental determination of the biological function of the predicted gene products. One challenge of the genome era is to predict molecular functions and biological roles for the predicted gene products. Most approaches for the tentative assignment of functions to predicted proteins are based on pairwise sequence similarity searches against known proteins using sequence comparison programs such as FASTA [1] and BLAST [2]. However, the currently used methods, especially if automated, have various drawbacks [3]. Many proteins are multifunctional multidomain proteins, for which the assignment of a single function results in loss of information and outright errors. Also, with more and more predicted proteins from genome projects being added to the protein sequence databases, the best hit in pairwise sequence similarity searches is frequently a hypothetical protein or one that is poorly annotated or simply has a different function; thus, the propagation of wrong annotation is widespread.
To overcome these and other known limitations of functional annotation based on pairwise sequence similarity searches, the use of resources concerning protein families and domains gains more and more importance. These resources allow the assignment of functions to uncharacterised or predicted proteins by selecting proteins that belong to the same group of proteins as a given uncharacterised protein, extracting the annotation shared by all functionally characterised proteins of this group and assigning this common annotation to the unannotated protein [4].
In recognition of the growing importance of protein family and domain resources, we will focus in this review on current developments in the clustering and analysis of protein families. We will start by considering printed reviews of protein families, move on to manually curated protein family and domain databases, and from there discuss sequence-cluster databases. It is also of importance to discuss resources that combine sequence alignments with structural information and to point to recent work on phylogenetic profiles, domain fusion events and their role in predicting functional interactions. We will end this review with a discussion of how to use these valuable resources for the assignment of molecular functions to uncharacterised proteins.
Section snippets
Protein profiles
Comprehensive and accessible information on major groups of proteins is provided by the Protein Profile series published by Oxford University Press (OUP) (Table 1). Each printed volume is focused on a single family or subfamily of proteins, and contains a wealth of information, coupled with an extensive bibliography. From a collaboration between the SWISS-PROT group at the European Bioinformatics Institute (EBI) and OUP, SWISS-PROT and TrEMBL protein sequence data [5] and alignments (Fig. 1)
Databases of protein signatures for families, domains and sites
A number of databases that use different methodologies and a varying degree of biological information on well-characterised protein families, domains and sites to derive protein signatures are available and are used to characterise new protein sequences. There are two main approaches: sequence-motif methods and sequence-cluster databases.
Structural alignment and cluster databases
Structural alignment databases combine protein sequence alignments with structural information obtained from the Protein Data Bank (PDB) [29]. HSSP (Homology-derived Secondary Structure of Proteins) [30], for example, is a database of the alignments of the sequences of proteins with known structure with sequences of all close homologues. The sequence-pattern-embedded discrete state-space models (pDSMs) [31] combine information about functionally conserved sequence patterns with information
Phylogenetic classifications
With the availability of complete proteomes, clustering in the phylogenetic space gains a lot of interest. Analysis of the phylogenetic profiles of protein families and of domain fusion events helps to predict many functional interactions and deduce specific functions for numerous proteins.
A phylogenetic classification of proteins encoded in more than 34 complete genomes representing 26 major phylogenetic lineages can be found in the Clusters of Orthologous Groups of proteins (COGs) database
Conclusions
Very recently, some major advances in the clustering and analysis of protein families have occurred. InterPro, which integrates various sequence motif and cluster databases (PROSITE, PRINTS, Pfam, and ProDom), and the new algorithms for the analysis of both the phylogenetic profiles of protein families and domain fusion events are very powerful resources for the computational functional classification of newly determined sequences and the comparative analysis of whole genomes. The potential of
Acknowledgements
This work was supported, in part, by grant B104-CT97-2099 from the European Commission.
References and recommended reading
Papers of particular interest, published within the annual period of review,have been highlighted as:
• of special interest
•• of outstanding interest
References (48)
- et al.
Basic local alignment search tool
J Mol Biol
(1990) - et al.
Identification of common molecular subsequences
J Mol Biol
(1981) - et al.
Significance of Z-value statistics of Smith-Waterman scores for protein alignments
Comput Chem
(1999) - et al.
Prediction of local structure in proteins using a library of sequence-structure motifs
J Mol Biol
(1998) - et al.
HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins
J Mol Biol
(2000) - et al.
A systematic comparison of protein structure classifications SCOP, CATH and FSSP
Structure
(1999) - et al.
Genome evolution: gene fusion versus gene fission
Trends Genet
(2000) - et al.
Improved tools for biological sequence comparison
Proc Natl Acad Sci USA
(1988) - et al.
Predicting functions from protein sequences — where are the bottlenecks?
Nat Genet
(1998) - et al.
A novel method for automatic and reliable functional annotation
Bioinfomatics
(1999)
The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
Nucleic Acids Res
The PROSITE database, its status in 1999
Nucleic Acids Res
PRINTS-S: the database formerly known as PRINTS
Nucleic Acids Res
The Pfam protein families database
Nucleic Acids Res
SMART: a web-based tool for the study of genetically mobile domains
Nucleic Acids Res
Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations
Bioinformatics
The EMOTIF database
Nucleic Acids Res
The role of pattern databases in sequence analysis
Briefings in Bioinformatics
The InterPro database, an integrated documentation resource for protein families, domains and functional sites
Nucleic Acids Res
Comparative genomics of the eukaryotes
Science
Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes
Nucleic Acids Res
The protein information resource (PIR)
Nucleic Acids Res
MIPS: a database for genomes and protein sequences
Nucleic Acids Res
Cited by (37)
Caveat emptor: limitations of the automated reconstruction of metabolic pathways in Plasmodium
2009, Trends in ParasitologyCitation Excerpt :Therefore, the most dependable gene predictions are those that have been inspected manually. Protein feature predictions (signal peptides and transmembrane domains) can be assisted by identification of Pfam [2] or Interpro [3] domains and Gene Ontology (GO) function predictions [4]. The ultimate identification of the gene product, however, can be achieved only through biochemical and molecular characterization.
Exploiting homogeneity in protein sequence clusters for construction of protein family hierarchies
2006, Pattern RecognitionCitation Excerpt :As more sequence data lack functional characterization, the need for automated annotating procedures is increasing [1–4].
PGraph: Efficient parallel construction of large-scale protein sequence homology graphs
2012, IEEE Transactions on Parallel and Distributed SystemsClusterMaker: A multi-algorithm clustering plugin for Cytoscape
2011, BMC BioinformaticsValue of the microarray for the study of laboratory animal allergy (LAA)
2011, Giornale Italiano di Medicina del Lavoro ed Ergonomia