Journal of Molecular Biology
Volume 353, Issue 4, 4 November 2005, Pages 911-923
Journal home page for Journal of Molecular Biology

Domain Rearrangements in Protein Evolution

https://doi.org/10.1016/j.jmb.2005.08.067Get rights and content

Most eukaryotic proteins are multi-domain proteins that are created from fusions of genes, deletions and internal repetitions. An investigation of such evolutionary events requires a method to find the domain architecture from which each protein originates. Therefore, we defined a novel measure, domain distance, which is calculated as the number of domains that differ between two domain architectures. Using this measure the evolutionary events that distinguish a protein from its closest ancestor have been studied and it was found that indels are more common than internal repetition and that the exchange of a domain is rare. Indels and repetitions are common at both the N and C-terminals while they are rare between domains. The evolution of the majority of multi-domain proteins can be explained by the stepwise insertions of single domains, with the exception of repeats that sometimes are duplicated several domains in tandem. We show that domain distances agree with sequence similarity and semantic similarity based on gene ontology annotations. In addition, we demonstrate the use of the domain distance measure to build evolutionary trees. Finally, the evolution of multi-domain proteins is exemplified by a closer study of the evolution of two protein families, non-receptor tyrosine kinases and RhoGEFs.

Introduction

Proteins are composed of domains, recurrent protein fragments with distinct structure, function and/or evolutionary history. Protein domains may occur alone, as single-domain proteins, but many are found in combination with other domains in larger polypeptide chains. These multi-domain architectures are more frequent in eukaryotes than prokaryotes.1, 2, 3, 4 During evolution, proteins with new functions or specificities have been invented through domain fusion and recombination as well as differentiation of existing domains. Domain fusion is a mechanism that allows the limited number of functional modules to be reused instead of reinvented. The occurrence of domain families as well as the number of partner families follow a power-law distribution with a few very abundant and/or versatile domains.1, 5 However, the evolution of domain combinations is not purely stochastic, but depends upon selection of certain functions.6 Often two or three domains in tandem have been reused in combination with other domains. These supra-domains may have been selected because the function is dependent on the interface between them or because they are both necessary for proper function.7 It has also been seen that some exon-bordering domains have unexpectedly many combination partners in animals.8

The addition of a domain to a protein is likely to alter its function, for example, it has been estimated that single-domain proteins from the same domain family have a 67% chance of having similar functions, whereas the corresponding number for two-domain proteins with just one of the domains in common is 35%.9 Jensen proposed that ancient enzymes with broad substrate specificities have evolved into more specific enzymes through gene duplication.10 Enzymes often retain their biochemical function while gaining new substrate specificities or regulation mechanisms by the addition of a domain. As a matter of fact, enzymatic function is conserved down to 30% sequence identity for most single-domain enzymes and addition of a second domain rarely affects function.11

Sequence alignment based methods, such as ClustalW,12 are often used to determine the evolutionary or functional relationship between proteins. However, multi-domain proteins may cause problems when creating multiple alignments. The sequences may align poorly for distantly related proteins even if they share the same domain architecture. A tool for finding related proteins based on domain architecture is CDART at NCBI13 and another useful tool is NIFAS,14 which is a domain evolution visualizer that builds trees based on the sequence alignments.

Understanding the underlying mechanisms of protein evolution through domain rearrangements and sequence differentiation is crucial for understanding the development of new functionalities. We have defined a new measure “domain distance”, where each domain addition/deletion between two domain architectures is counted. We explore how domain distances correlate with sequence similarity and functional similarity. Using domain distances we have quantified the frequency of different events such as domain indels, repetitions and exchanges. These results were compared with frequencies obtained using a sequence based method. In addition, we demonstrate the possibility to use trees based on domain distance for exploring protein evolution. Finally, two protein families, the non-receptor tyrosine kinases and the RhoGEFs, serve as examples of domain rearrangements in protein evolution.

Section snippets

Domain Distance

It is well known that multi-domain proteins are created from fusions of whole or parts of genes and from internal duplications. In an attempt to quantify these events we have defined a novel measure of similarity between domain architectures (DAs), called domain distance (DD). Domain distance is calculated as the number of unmatched domains in an alignment of two architectures and is related to the number of evolutionary events required to evolve from one protein to another (see Figure 1).

Domain Distance Trees in Evolutionary Studies

To obtain an increased understanding of the evolution of multi-domain proteins, the domain distances can be used to build evolutionary trees. Such trees have been created using standard neighbor-joining methods, where each addition/deletion of a domain results in a new branch. Below, we exemplify how such a tree can aid our understanding of the evolutionary events for two large protein families: SH2/PTK (Src homology 2 domain containing protein tyrosine kinases) and the RhoGEFs (Rho guanine

Conclusions

We have studied the evolution of multi-domain proteins in terms of domain fusions and repetitions. For each domain architecture, its evolutionary origin was identified based on our novel measure domain distance. Using this measure we have quantified the different evolutionary events leading to complex domain architectures and found that indels are the most common domain events followed by repetitions. The majority of the events can be explained by the addition of single domains. However, in

Protein set

Two datasets were used for calculation of evolutionary events. The first dataset was SWISS-PROT release 44 (5 July 2004)35 with 153,871 proteins. The Pfam-A36 and Pfam-B domain assignments were found in SwissPfam†.

The other dataset consisted of proteins from seven eukaryotic proteomes (Homo sapiens, Mus musculus, Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster, Saccharomyces cerevisiae and Schizosaccharomyces pombe). In the case

Acknowledgements

This work was supported by grants from the Swedish Natural Sciences Research Council, and a STREP grant from European Union FP6 program via the GeneFun project, project number 503567.

References (41)

Cited by (159)

  • Searching protein space for ancient sub-domain segments

    2021, Current Opinion in Structural Biology
View all citing articles on Scopus

A.K.B. and D.E. contributed equally to this work.

View full text