Journal of Molecular Biology
Volume 291, Issue 1, 6 September 1999, Pages 177-196
Journal home page for Journal of Molecular Biology

Regular article
Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function1

https://doi.org/10.1006/jmbi.1999.2911Get rights and content

Abstract

Here, we provide an analysis of molecular evolution of five of the most populated protein folds: immunoglobulin fold, oligonucleotide-binding fold, Rossman fold, alpha/beta plait, and TIM barrels. In order to distinguish between “historic”, functional and structural reasons for amino acid conservations, we consider proteins that acquire the same fold and have no evident sequence homology. For each fold we identify positions that are conserved within each individual family and coincide when non-homologous proteins are structurally superimposed. As a baseline for statistical assessment we use the conservatism expected based on the solvent accessibility. The analysis is based on a new concept of “conservatism-of-conservatism”. This approach allows us to identify the structural features that are stabilized in all proteins having a given fold, despite the fact that actual interactions that provide such stabilization may vary from protein to protein. Comparison with experimental data on thermodynamics, folding kinetics and function of the proteins reveals that such universally conserved clusters correspond to either: (i) super-sites (common location of active site in proteins having common tertiary structures but not function) or (ii) folding nuclei whose stability is an important determinant of folding rate, or both (in the case of Rossman fold). The analysis also helps to clarify the relation between folding and function that is apparent for some folds.

Introduction

The amount of data on protein structure, folding and kinetics are exploding. Progress in genomics (gene sequences) and proteomics (structure, function and expression) studies created a new realm for bioinformatics in which a qualitatively different amount of biological information needs to be properly rationalized and used. Success in achieving this goal depends entirely on our understanding of the principles that govern protein stability, folding and function.

Such understanding progressed over last several years to the point that basic principles of folding begin to emerge from theoretical and experimental studies. Of particular importance is the foldability principle in thermodynamics and the discovery of nucleation in folding kinetics. The foldability principle states that protein-like sequences should have their native conformation as pronounced energy minimum (separated by a large energy gap from the bulk of structurally unrelated misfolded conformations) Goldstein et al 1992, Shakhnovich and Gutin 1993a, Sali et al 1994, Govindarajan and Goldstein 1995, Hao and Scheraga 1994b. Sequences that satisfy this requirement are able to fold fast and have cooperative folding transition Shakhnovich and Gutin 1993a, Shakhnovich 1994, Hao and Scheraga 1994a, Hao and Scheraga 1994b. Their native structures are stable against mutations (Tiana et al., 1998) as well as against variation in solvent conditions and temperature (Pande et al., 1995).

The modern concept of nucleation in protein folding emerged from several theoretical and experimental studies Bryngelson and Wolynes 1990, Abkevich et al 1994, Shakhnovich et al 1996, Mirny et al 1998, Guo and Thirumalai 1995, Pande et al 1998, Itzhaki et al 1995, Martinez et al 1998 as a paradigm to describe transition state ensemble of protein folding, especially for proteins that fold via simple two-state kinetics (Jackson, 1998). Of particular importance is the discovery of specific folding nucleus in some proteins. The specific nucleus scenario of folding suggests that a number of obligatory contacts (specific nucleus) should be formed in order for a protein chain to reach the transition state. The specific nucleus constitutes a spatially contiguous cluster in structure, but not necessarily in sequence: non-local contacts are always present in specific nuclei. After the specific nucleus is formed, subsequent transition occurs downhill in free energy and is fast Abkevich et al 1994, Shakhnovich 1998a. Further, it was noted Abkevich et al 1994, Shakhnovich et al 1996, Mirny et al 1998, Shakhnovich 1998a, Martinez et al 1998 that location of a specific nucleus depends on the structure to a greater extent than it does on sequence. The major implication of this finding is prediction that different (even non-homologous) sequences that fold into the same structure may have similar folding nuclei. In other words, the location of a folding nucleus in a structure may serve as a fingerprint of a protein fold. The specific nucleus model of folding kinetics has direct implication for experimental results, predicting a substantial variance of kinetic effects of mutations at various locations in protein structure. Another important prediction is robustness of specific nucleus with respect to variation in solvent conditions, temperature and other mutations. These predictions are consistent with experiment Itzhaki et al 1995, Viguera et al 1997.

Molecular evolution represents an invaluable natural laboratory to test and further develop our understanding of protein folding. Conversely, our understanding of protein folding and function is a key to rational analysis of signals sent by protein evolution. The fusion of theoretical understanding of protein folding with analysis of evolutionary information is the main aim of this study.

Molecular evolution sends us signals in the form of conservation patterns in multiple sequence alignments. However, those signals are hard to decipher because there may be many reasons for conservation: function, stability or maybe “historical” reasons (insufficient evolutionary time to diverge). Finally, there may be some evolutionary pressure towards fast folding (perhaps to exceed some rate threshold beyond which aggregation and/or proteolysis of party folded species may present a problem). The kinetic factor may give rise to additional conservation in the kinetically important locations related to folding nucleus.

How can one distinguish between different reasons for amino acid conservation? A possible approach is to use as much evolutionary information as possible. In particular, it is known that besides protein homologs, i.e. proteins that have a clear evolutionary connection and are often (but not always) functionally related, there exist analogus, i.e. structurally similar proteins that have non-homologous sequences, unrelated functions and no evident evolutionary relation (Branden & Tooze, 1998). Since in most cases analogs share a common fold but not function, a proper sequence comparison between them may emphasize positions where conservatism is related to structural stability and folding kinetics rather than function (except in the cases when folds contain functional super-sites (Russell et al., 1998a), see below).

However, comparison of sequences of protein analogs should be made with care: a simple sequence alignment between analogs may not always work due to the possibility of multi-amino acid correlated mutations. The easiest way to understand this is to consider a basic example where a certain element of structure needs to be stabilized. However, there are several ways to form strong attractive interactions (i.e. by forming hydrophobic contacts or disulfide bridges or in some cases salt bridges). Therefore, if the same element of structure is stabilized in analogs by different forces, the amino acid residues that deliver such stabilization may be of quite different types. This suggests that a simple sequence alignment between analogs may in some cases yield no indication of conservatism. In other words, energetics may be more conserved than amino acid types that deliver it. On the other hand, within families of homologous proteins one can expect conserved amino acid residues to form stable substructures: the change of amino acid residues in these positions requires compensating mutations in several related positions. Such multi-amino acid correlated mutations are very rare. They can be found only in highly diverged or unrelated proteins rather than within protein families.

This analysis suggests that a factor that may point to a common structure-related property in all analogs may be the intrafamily conservation itself rather than actual amino acid residues at the positions in question. This leads to an important new concept of “conservatism-of-conservatism” (CoC) to analyze evolutionary signals that are specific to a given fold (Mirny et al., 1998). The principle of CoC calls for alignment of intrafamily conservatism profiles between analogs as a method to find and analyze evolutionary signals that reflect features that are characteristic of a particular fold: structural stability, folding kinetics or in some cases common function or common location of active sites between analogs (super-site).

In the following report, we first explain how CoC is computed and what controls and statistical tests we perform. Next, we consider the case of the immunoglobulin fold in detail, and show how CoC analysis helps to identify the evolutionary pressure towards fast folding and distinguish it from evolutionary pressure aimed at protein stabilization and functional pressure. This will help to identify a possible location of folding nucleus for the immunoglobulin fold which allows direct comparison with protein engineering experiments.

Next we carry out similar analysis for all other folds for which sufficient structural and evolutionary data are available oligonucleotide-binding (OB) fold, Rossman fold, alpha/beta plait and TIM barrel). Similar to the case of immunoglobulin fold, the analysis of the observed CoC signal will allow us to identify (in some cases) common nucleation sites characteristic of a given fold and (in some cases) super-sites, i.e. a common location of the active site in proteins with similar structure but possibly different function.

The results of our analysis will be compared with experimental information about the function, thermodynamics and kinetics of studied proteins in cases when such information is available.

Section snippets

Conservatism-of-conservatism (CoC)

As was stated earlier, the analysis of CoC aims to identify positions in a protein structure which are conserved within each family of homologous proteins that acquire this structure. To pursue this goal we need: (i) a large set of analogs - non-homologous proteins sharing the same fold (representative proteins); and (ii) for each representative protein a number of proteins homologous to it (a family).

When these data are available, the evaluation of CoC proceeds as follows: (i) make multiple

Discussion

Here, we report a detailed study of molecular evolution of five of the most populated protein folds. Out of ≈2200 domains in known structures without evident sequence homology, 564 belong to five dominant folds that were analyzed in this study. High data-demanding nature of the method of analysis limits it to the folds that contain at least 20-30 non-homologous families. As the number of solved protein structures increases, this analysis can be extended to other folds.

For each of folds studied,

Conclusion

Here, we provided a detailed statistical analysis of molecular evolution of most common protein folds. Our results clearly point out that physical factors related to protein folding such as stability and folding rate have undergone considerable evolutionary optimization. In particular we presented a direct evidence for evolutionary pressure towards fast (but not necessarily the fastest) folding for several proteins.

One of the most striking discoveries that emerge from growing data on protein

Control for solvent accessibility

If f(s|a) is the probability density function (pdf) of entropy s given accessibility a, (normalized ∫0log(6)f(s|a)d s=1 for ∀a) we can compute the pdf of S(l) based on the H0. Assuming families are independent (see a note below) we can apply central limit theorem (CLT) to compute the pdf of S(l). Since S(l) is a sum of large number of independent random variables sm(l). Hence, according to the CLT S(l) has Gaussian distribution with the mean and the variance:S(l)=1Mm=1Ms(am(l));σS2(l)=1M2m=1Mσ

Acknowledgements

This work is supported by NIH grant RO1 GM52126. We are grateful to Fabrizio Chiti, Jane Clarke, Chris Dobson, Alan Fersht, Stephen Hamil, Mikael Oliveberg and Luis Serrano for illuminating discussions of experimental results and making many of them available to us prior to publication.

After this paper had been completed we heard sad news that Oleg Ptitsyn passed away. Oleg had been very excited about emerging understanding of deep relation between protein folding and evolution, which is the

References (62)

  • E. Merritt et al.

    Raster3Dphosorealistic molecular graphics

    Methods Enzymol.

    (1997)
  • L. Mirny et al.

    Protein structure prediction by threading. Why it works and why it does not

    J. Mol. Biol.

    (1998)
  • V. Pande et al.

    Pathways for protein foldingis a “new view” needed?

    Curr. Opin. Struct. Biol.

    (1998)
  • O. Ptitsyn

    Protein folding and protein evolutioncommon folding nucleus in different subfamilies of c-type cytochromes?

    J. Mol. Biol.

    (1998)
  • R. Russell et al.

    Supersites within superfolds. Binding site similarity in the absence of homology

    J. Mol. Biol.

    (1998)
  • A. Sali et al.

    Kinetics of protein folding. A lattice model study for the requirements for folding to the native state

    J. Mol. Biol.

    (1994)
  • T. Schindler et al.

    The family of cold shock proteins of Bacillus subtilis. Stability and dynamics in vitro and in vivo

    J. Biol. Chem.

    (1999)
  • E. Shakhnovich

    Theoretical studies of protein-folding thermodynamics and kinetics

    Curr. Opin. Struct. Biol.

    (1997)
  • E. Shakhnovich

    Folding nucleusspecific of multiple? Insights from simulations and comparison with experiment

    Fold. Design

    (1998)
  • E. Shakhnovich

    Protein designa perspective from simple tractable models

    Fold. Design

    (1998)
  • V. Villegas et al.

    Structure of the transition state in the folding process of human procarboxypeptidase a2 activation domain

    J. Mol. Biol.

    (1998)
  • V. Abkevich et al.

    Specific nucleus as the transition state for protein foldingevidence from the lattice model

    Biochemistry

    (1994)
  • D. Altschuh et al.

    Coordinated amino acid changes in homologous protein families

    Protein Eng.

    (1988)
  • C. Branden et al.

    Introduction to Protein Structure

    (1998)
  • J. Bryngelson et al.

    A simple statistical field theory of heteropolymer collapse with application to protein folding

    Biopolymers

    (1990)
  • C. Dodge et al.

    The hssp database of protein structure-sequence alignments and family profiles

    Nucl. Acids Res.

    (1998)
  • W. Feller

    An Introduction to Probability Theory and its Applications

    (1970)
  • R. Goldstein et al.

    Optimal protein-folding codes from spin-glass theory

    Proc. Natl Acad. Sci. USA

    (1992)
  • S. Govindarajan et al.

    Why are some protein structures so common?

    Proc. Natl Acad. Sci. USA

    (1995)
  • Z. Guo et al.

    Nucleation mechanism for protein folding and theoretical predictions for hydrogen-exchange labelling experiments

    Biopolymers

    (1995)
  • S. Hamill et al.

    The effect of boundary selection on the stability and folding of the third fibronectin type iii domain from human tenascin

    Biochemistry

    (1998)
  • Cited by (360)

    • Research progress of reduced amino acid alphabets in protein analysis and prediction

      2022, Computational and Structural Biotechnology Journal
    View all citing articles on Scopus
    1

    Edited by A. R. Fersht

    View full text