Journal of Molecular Biology
Volume 344, Issue 5, 10 December 2004, Pages 1331-1346
Journal home page for Journal of Molecular Biology

A Domain Interaction Map Based on Phylogenetic Profiling

https://doi.org/10.1016/j.jmb.2004.10.019Get rights and content

Phylogenetic profiling is a well established method for predicting functional relations and physical interactions between proteins. We present a new method for finding such relations based on phylogenetic profiling of conserved domains rather than proteins, avoiding computationally expensive all versus all sequence comparisons among genomes. The resulting domain interaction map (DIMA) can be explored directly or mapped to a genome of interest. We demonstrate that the performance of DIMA is comparable to that of classical phylogenetic profiling and its predictions often yield information that cannot be detected by profiling of entire protein chains. We provide a list of novel domain associations predicted by our method.

Introduction

Similarity-free methods for protein function prediction explore genomic context to establish relations between genes which are not detectable by standard sequence alignment techniques. In very general terms, genomic context can be described as any statistical, physical or biological property of genes, which can be observed or measured, such as chromosomal location, expression patterns, and taxonomic distribution. Genes displaying statistically significant resemblance of genomic context usually act together in some cellular process, typically a metabolic or regulatory pathway.1

One of these methods, termed phylogenetic profiling, relies on the correlation of protein occurrence across a set of genomes to predict functional associations.2 Proteins in genomes are assigned a 1 if an ortholog occurs in a genome and 0 otherwise. A string of 1s and 0s, a phylogenetic profile, is generated when the technique is applied across genomes. When two or more proteins have similar patterns of protein occurrence, this may indicate that the proteins interact with each other directly or share a common functional role.3 The underlying idea is that many pathways or complexes require all their members to be present in order to fulfil their functions. This “all or none” pattern of occurrence tends to be characteristic for many interacting genes.4

One disadvantage of phylogenetic profiling, however, is its high computational cost. In order to assess existence or absence of proteins across genomes, all-against-all comparison of entire genomes by similarity searching techniques, such as BLAST,5 is required. With the number of available sequenced genomes in our PEDANT6 genome database approaching 300, and the total number of genes in these genomes exceeding one million, this process requires an astronomic number of pairwise sequence comparisons and enormous disk space to save resulting alignments. Maintaining such an all-against-all system and updating it to include new genomes represents a major technical challenge. We are aware of only two publicly available web resources offering such service;7, 8 no portable compact tools to perform phylogenetic profiling exist to our knowledge.

Molecular interactions are mediated by a great variety of widely spread interaction domains that are frequently combined in proteins in a complicated mosaic fashion.9 Quite often a protein A will use one of its domains to interact with protein B, and another domain to interact with protein C. It is those domains, and not entire protein chains, that often represent major functional entities in cellular interaction networks. Several high-quality sequence domain databases exist (PFAM,10 SMART11), more recently integrated in the Interpro resource.12 Extensive web sites offer rich data mining capabilities and allow the study of domain combinations in protein chains and their taxonomic distribution.

Here, we sought to explore phylogenetic profiling of individual protein domains, rather than entire protein chains, to build a map of domain–domain relations. While implementing our method, we drew inspiration from several innovative computational approaches to protein function prediction and analysis developed in recent years:

  • (1)

    The original method of protein phylogenetic profiling2, 13 and a conceptually related technique exploiting the similarity of phylogenetic trees.14

  • (2)

    Exploration of gene fusion events whereby separate amino acid chains encoded in one (typically prokaryotic) genome are merged into a single gene product in another (eukaryotic) genome.15, 16

  • (3)

    Analysis and representation of the taxonomic distribution of sequence domains available through domain database web sites, such as SMART,11 PFAM,10 or CDART.17

  • (4)

    Investigation of domain combinations in proteins.18

  • (5)

    Analysis of occurrence patterns of structural domains19 as described in the SCOP database.20

  • (6)

    Genome occurrence in Clusters of Orthologous Groups using principal component analysis.21

  • (7)

    Inferring domain interactions from known protein–protein interactions.22

Our method, called domain interaction map (DIMA) represents a synthesis using many of the approaches and ideas listed above. The basic algorithm of phylogenetic profiling is combined with domain detection to delineate clusters of individual domains, rather than complete gene products, that occur in a coordinated fashion. These clusters may be represented in the form of domain–domain interaction networks, yielding novel insights into the complex interplay of protein modules in cellular processes. In particular, DIMA may provide hints about potential interaction domains. In addition to enhanced capabilities for predicting biomolecular interactions, DIMA has a lot of technical advantages over traditional protein profiling. It does not require exhaustive all-against-all comparison of genomic proteins. Detection of sequence domains needs to be conducted only once for each genome added to the system, a task which is linear with the number of gene products in the genome. As soon as domain finding in freshly added genomes is finished phylogenetic profiles and resulting domain clusters can be re-calculated instantly. Updating such a profiling system is only necessary when new releases of domain databases are made available.

In Figure 1 we provide a graphical overview of the DIMA technique and highlight the key differences between DIMA and whole-protein phylogenetic profiling. In our example we consider six genomes and eight gene products, consisting of one or more structural domains (Figure 1(a)). Figure 1(b) represents the phylogenetic profiles describing the occurrence of the five individual domains in genomes, and the resulting domain interaction network. Figure 1(c) illustrates the results obtained for the same example using standard phylogenetic profiling. Clearly, the two methods considered produce fundamentally different association networks and are in fact complementary.

Section snippets

Results

Here, throughout we explored the properties of DIMA in direct comparison to the classical protein profiling method, which we term as CLASSIC. The CLASSIC approach predicts relations between proteins, but DIMA predicts relations between domains. To facilitate comparison between the two methods, the predicted domain relations from DIMA were mapped to proteins containing the respective domains (see Methods). Each method could then be evaluated against functional annotation and interaction data of

Discussion

We have developed a new method we call DIMA for studying the associations between proteins and protein domains based on phylogenetic profiling of conserved domains. The process involving domain detection, profile generation and clustering yields protein/domain pairs predicted to be functionally related and/or physically interacting (Figure 8). We have demonstrated that the predictions produced by DIMA are complementary to those produced by CLASSIC profiling and therefore represent a true gain

Software environment and genome data

The basis for the present study was the PEDANT genome analysis system.6, 31 The PEDANT database† contains exhaustive functional and structural annotation of all completely sequenced genomes. In particular, detection of PFAM domains10 is conducted using the HMMER software.32 Gene products are also automatically assigned to yeast functional categories,30 SCOP folds,20 and enzyme classes33 based on similarity searches.

Out of ≈300 finished genomic sequences available at the time

Acknowledgements

We are indebted to Grigory Kolesov and Martin Mokrejs for their assistance with the PEDANT database. Thomas Rattei and Roland Arnold were extremely helpful with the usage of SIMAP. This work was funded by a grant of the German Federal Ministry of Education and Research (BMBF) within the BFAM framework (031U112C).

References (41)

  • P. Wong et al.

    Phylogenetic web profiler

    Bioinformatics

    (2003)
  • T. Pawson et al.

    Assembly of cell regulatory systems through protein interaction domains

    Science

    (2003)
  • A. Bateman et al.

    The Pfam protein families database

    Nucl. Acids Res.

    (2002)
  • I. Letunic et al.

    Recent improvements to the SMART domain-based sequence annotation resource

    Nucl. Acids Res.

    (2002)
  • N.J. Mulder et al.

    The InterPro Database, 2003 brings increased coverage and new features

    Nucl. Acids Res.

    (2003)
  • T. Gaasterland et al.

    Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes

    Microb. Comp. Genomics

    (1998)
  • F. Pazos et al.

    Similarity of phylogenetic trees as indicator of protein–protein interaction

    Protein Eng.

    (2001)
  • E.M. Marcotte et al.

    Detecting protein function and protein–protein interactions from genome sequences

    Science

    (1999)
  • A.J. Enright et al.

    Protein interaction maps for complete genomes based on gene fusion events

    Nature

    (1999)
  • L.Y. Geer et al.

    CDART: protein homology by domain architecture

    Genome Res.

    (2002)
  • Cited by (71)

    • Computational analysis of interactomes: Current and future perspectives for bioinformatics approaches to model the host-pathogen interaction space

      2012, Methods
      Citation Excerpt :

      Other integrated methods are based on correlated mutations and search for domain pairs that contain co-evolving residues [87]. The Domain Profile method (DPROF) is based on the same idea as the co-occurrence method to predict interactions between complete proteins: based on the assumption that interacting domains are under evolutionary pressure to be maintained concerted, the phylogenetic distributions of a pair of domains are compared and an interaction is reported if these are sufficiently high and informative [88]. IPfam [89] and 3DiD [90] analyze PDB structures and extract domain pairs that are in close contact in these structures.

    • Protein annotation from protein interaction networks and Gene Ontology

      2011, Journal of Biomedical Informatics
      Citation Excerpt :

      Correlated evolution, correlated RNA expression patterns, plus patterns of domain fusion, have also been used to predict similarities in protein functions [3,4]. Several other approaches have annotated proteins based on phylogenetic profiles of orthologous proteins [5–9]. Bayesian reasoning was used to combine large-scale yeast two-hybrid (Y2H) screens and multiple microarray analyses [10] and Support Vector Machines were used to combine protein sequence and structure data [11] to produce functional predictions.

    • Adaptive compressive learning for prediction of protein-protein interactions from primary sequence

      2011, Journal of Theoretical Biology
      Citation Excerpt :

      Many efforts have been devoted to this area by computational biologists. Some methods have taken genomic information into consideration, such as phylogenetic profiles (Pagel et al., 2004b; Pagel et al., 2006), and gene neighborhood (Overbeek et al., 1999). Some reports show that the prediction performance can be enhanced by combining structural information of proteins (Aloy and Russell, 2002; Aloy and Russell, 2003; Ogmen et al., 2005) or functional domains (Pagel et al., 2008; Ta and Holm, 2009).

    View all citing articles on Scopus
    View full text