Review
Clustering and analysis of protein families

https://doi.org/10.1016/S0959-440X(00)00211-6Get rights and content

Abstract

Various sequence-motif and sequence-cluster databases have been integrated into a new resource known as InterPro. Because the contributing databases have different clustering principles and scoring sensitivities, the combined assignments complement each other for grouping protein families and delineating domains. InterPro and new developments in the analysis of both the phylogenetic profiles of protein families and domain fusion events improve the prediction of specific functions for numerous proteins.

Introduction

In the past few years, the technology of sequencing has developed to the stage at which the sequencing of a complete genome can be contemplated as a practical and routine possibility. The complete sequences of more than 55 genomes have been published and at least 100 more are known to be nearing completion. These projects produce large amounts of sequence data lacking experimental determination of the biological function of the predicted gene products. One challenge of the genome era is to predict molecular functions and biological roles for the predicted gene products. Most approaches for the tentative assignment of functions to predicted proteins are based on pairwise sequence similarity searches against known proteins using sequence comparison programs such as FASTA [1] and BLAST [2]. However, the currently used methods, especially if automated, have various drawbacks [3]. Many proteins are multifunctional multidomain proteins, for which the assignment of a single function results in loss of information and outright errors. Also, with more and more predicted proteins from genome projects being added to the protein sequence databases, the best hit in pairwise sequence similarity searches is frequently a hypothetical protein or one that is poorly annotated or simply has a different function; thus, the propagation of wrong annotation is widespread.

To overcome these and other known limitations of functional annotation based on pairwise sequence similarity searches, the use of resources concerning protein families and domains gains more and more importance. These resources allow the assignment of functions to uncharacterised or predicted proteins by selecting proteins that belong to the same group of proteins as a given uncharacterised protein, extracting the annotation shared by all functionally characterised proteins of this group and assigning this common annotation to the unannotated protein [4].

In recognition of the growing importance of protein family and domain resources, we will focus in this review on current developments in the clustering and analysis of protein families. We will start by considering printed reviews of protein families, move on to manually curated protein family and domain databases, and from there discuss sequence-cluster databases. It is also of importance to discuss resources that combine sequence alignments with structural information and to point to recent work on phylogenetic profiles, domain fusion events and their role in predicting functional interactions. We will end this review with a discussion of how to use these valuable resources for the assignment of molecular functions to uncharacterised proteins.

Section snippets

Protein profiles

Comprehensive and accessible information on major groups of proteins is provided by the Protein Profile series published by Oxford University Press (OUP) (Table 1). Each printed volume is focused on a single family or subfamily of proteins, and contains a wealth of information, coupled with an extensive bibliography. From a collaboration between the SWISS-PROT group at the European Bioinformatics Institute (EBI) and OUP, SWISS-PROT and TrEMBL protein sequence data [5] and alignments (Fig. 1)

Databases of protein signatures for families, domains and sites

A number of databases that use different methodologies and a varying degree of biological information on well-characterised protein families, domains and sites to derive protein signatures are available and are used to characterise new protein sequences. There are two main approaches: sequence-motif methods and sequence-cluster databases.

Structural alignment and cluster databases

Structural alignment databases combine protein sequence alignments with structural information obtained from the Protein Data Bank (PDB) [29]. HSSP (Homology-derived Secondary Structure of Proteins) [30], for example, is a database of the alignments of the sequences of proteins with known structure with sequences of all close homologues. The sequence-pattern-embedded discrete state-space models (pDSMs) [31] combine information about functionally conserved sequence patterns with information

Phylogenetic classifications

With the availability of complete proteomes, clustering in the phylogenetic space gains a lot of interest. Analysis of the phylogenetic profiles of protein families and of domain fusion events helps to predict many functional interactions and deduce specific functions for numerous proteins.

A phylogenetic classification of proteins encoded in more than 34 complete genomes representing 26 major phylogenetic lineages can be found in the Clusters of Orthologous Groups of proteins (COGs) database

Conclusions

Very recently, some major advances in the clustering and analysis of protein families have occurred. InterPro, which integrates various sequence motif and cluster databases (PROSITE, PRINTS, Pfam, and ProDom), and the new algorithms for the analysis of both the phylogenetic profiles of protein families and domain fusion events are very powerful resources for the computational functional classification of newly determined sequences and the comparative analysis of whole genomes. The potential of

Acknowledgements

This work was supported, in part, by grant B104-CT97-2099 from the European Commission.

References and recommended reading

Papers of particular interest, published within the annual period of review,have been highlighted as:

  • • of special interest

  • •• of outstanding interest

References (48)

  • A Bairoch et al.

    The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000

    Nucleic Acids Res

    (2000)
  • K Hofmann et al.

    The PROSITE database, its status in 1999

    Nucleic Acids Res

    (1999)
  • T.K Attwood et al.

    PRINTS-S: the database formerly known as PRINTS

    Nucleic Acids Res

    (2000)
  • A Bateman et al.

    The Pfam protein families database

    Nucleic Acids Res

    (2000)
  • J Schultz et al.

    SMART: a web-based tool for the study of genetically mobile domains

    Nucleic Acids Res

    (2000)
  • S Henikoff et al.

    Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations

    Bioinformatics

    (1999)
  • J.Y Huang et al.

    The EMOTIF database

    Nucleic Acids Res

    (2001)
  • T.K Attwood

    The role of pattern databases in sequence analysis

    Briefings in Bioinformatics

    (2000)
  • R Apweiler et al.

    The InterPro database, an integrated documentation resource for protein families, domains and functional sites

    Nucleic Acids Res

    (2001)
  • G.M Rubin et al.

    Comparative genomics of the eukaryotes

    Science

    (2000)
  • Mulder NJ, Fleischmann W, Apweiler R: InterPro as a new tool for whole genome analysis. A comparative analysis of...
  • R Apweiler et al.

    Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes

    Nucleic Acids Res

    (2001)
  • W.C Barker et al.

    The protein information resource (PIR)

    Nucleic Acids Res

    (2000)
  • H.W Mewes et al.

    MIPS: a database for genomes and protein sequences

    Nucleic Acids Res

    (2000)
  • Cited by (37)

    • Caveat emptor: limitations of the automated reconstruction of metabolic pathways in Plasmodium

      2009, Trends in Parasitology
      Citation Excerpt :

      Therefore, the most dependable gene predictions are those that have been inspected manually. Protein feature predictions (signal peptides and transmembrane domains) can be assisted by identification of Pfam [2] or Interpro [3] domains and Gene Ontology (GO) function predictions [4]. The ultimate identification of the gene product, however, can be achieved only through biochemical and molecular characterization.

    • Exploiting homogeneity in protein sequence clusters for construction of protein family hierarchies

      2006, Pattern Recognition
      Citation Excerpt :

      As more sequence data lack functional characterization, the need for automated annotating procedures is increasing [1–4].

    • PGraph: Efficient parallel construction of large-scale protein sequence homology graphs

      2012, IEEE Transactions on Parallel and Distributed Systems
    • Value of the microarray for the study of laboratory animal allergy (LAA)

      2011, Giornale Italiano di Medicina del Lavoro ed Ergonomia
    View all citing articles on Scopus
    View full text