Journal of Molecular Biology
Regular articleUniversally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function1
Introduction
The amount of data on protein structure, folding and kinetics are exploding. Progress in genomics (gene sequences) and proteomics (structure, function and expression) studies created a new realm for bioinformatics in which a qualitatively different amount of biological information needs to be properly rationalized and used. Success in achieving this goal depends entirely on our understanding of the principles that govern protein stability, folding and function.
Such understanding progressed over last several years to the point that basic principles of folding begin to emerge from theoretical and experimental studies. Of particular importance is the foldability principle in thermodynamics and the discovery of nucleation in folding kinetics. The foldability principle states that protein-like sequences should have their native conformation as pronounced energy minimum (separated by a large energy gap from the bulk of structurally unrelated misfolded conformations) Goldstein et al 1992, Shakhnovich and Gutin 1993a, Sali et al 1994, Govindarajan and Goldstein 1995, Hao and Scheraga 1994b. Sequences that satisfy this requirement are able to fold fast and have cooperative folding transition Shakhnovich and Gutin 1993a, Shakhnovich 1994, Hao and Scheraga 1994a, Hao and Scheraga 1994b. Their native structures are stable against mutations (Tiana et al., 1998) as well as against variation in solvent conditions and temperature (Pande et al., 1995).
The modern concept of nucleation in protein folding emerged from several theoretical and experimental studies Bryngelson and Wolynes 1990, Abkevich et al 1994, Shakhnovich et al 1996, Mirny et al 1998, Guo and Thirumalai 1995, Pande et al 1998, Itzhaki et al 1995, Martinez et al 1998 as a paradigm to describe transition state ensemble of protein folding, especially for proteins that fold via simple two-state kinetics (Jackson, 1998). Of particular importance is the discovery of specific folding nucleus in some proteins. The specific nucleus scenario of folding suggests that a number of obligatory contacts (specific nucleus) should be formed in order for a protein chain to reach the transition state. The specific nucleus constitutes a spatially contiguous cluster in structure, but not necessarily in sequence: non-local contacts are always present in specific nuclei. After the specific nucleus is formed, subsequent transition occurs downhill in free energy and is fast Abkevich et al 1994, Shakhnovich 1998a. Further, it was noted Abkevich et al 1994, Shakhnovich et al 1996, Mirny et al 1998, Shakhnovich 1998a, Martinez et al 1998 that location of a specific nucleus depends on the structure to a greater extent than it does on sequence. The major implication of this finding is prediction that different (even non-homologous) sequences that fold into the same structure may have similar folding nuclei. In other words, the location of a folding nucleus in a structure may serve as a fingerprint of a protein fold. The specific nucleus model of folding kinetics has direct implication for experimental results, predicting a substantial variance of kinetic effects of mutations at various locations in protein structure. Another important prediction is robustness of specific nucleus with respect to variation in solvent conditions, temperature and other mutations. These predictions are consistent with experiment Itzhaki et al 1995, Viguera et al 1997.
Molecular evolution represents an invaluable natural laboratory to test and further develop our understanding of protein folding. Conversely, our understanding of protein folding and function is a key to rational analysis of signals sent by protein evolution. The fusion of theoretical understanding of protein folding with analysis of evolutionary information is the main aim of this study.
Molecular evolution sends us signals in the form of conservation patterns in multiple sequence alignments. However, those signals are hard to decipher because there may be many reasons for conservation: function, stability or maybe “historical” reasons (insufficient evolutionary time to diverge). Finally, there may be some evolutionary pressure towards fast folding (perhaps to exceed some rate threshold beyond which aggregation and/or proteolysis of party folded species may present a problem). The kinetic factor may give rise to additional conservation in the kinetically important locations related to folding nucleus.
How can one distinguish between different reasons for amino acid conservation? A possible approach is to use as much evolutionary information as possible. In particular, it is known that besides protein homologs, i.e. proteins that have a clear evolutionary connection and are often (but not always) functionally related, there exist analogus, i.e. structurally similar proteins that have non-homologous sequences, unrelated functions and no evident evolutionary relation (Branden & Tooze, 1998). Since in most cases analogs share a common fold but not function, a proper sequence comparison between them may emphasize positions where conservatism is related to structural stability and folding kinetics rather than function (except in the cases when folds contain functional super-sites (Russell et al., 1998a), see below).
However, comparison of sequences of protein analogs should be made with care: a simple sequence alignment between analogs may not always work due to the possibility of multi-amino acid correlated mutations. The easiest way to understand this is to consider a basic example where a certain element of structure needs to be stabilized. However, there are several ways to form strong attractive interactions (i.e. by forming hydrophobic contacts or disulfide bridges or in some cases salt bridges). Therefore, if the same element of structure is stabilized in analogs by different forces, the amino acid residues that deliver such stabilization may be of quite different types. This suggests that a simple sequence alignment between analogs may in some cases yield no indication of conservatism. In other words, energetics may be more conserved than amino acid types that deliver it. On the other hand, within families of homologous proteins one can expect conserved amino acid residues to form stable substructures: the change of amino acid residues in these positions requires compensating mutations in several related positions. Such multi-amino acid correlated mutations are very rare. They can be found only in highly diverged or unrelated proteins rather than within protein families.
This analysis suggests that a factor that may point to a common structure-related property in all analogs may be the intrafamily conservation itself rather than actual amino acid residues at the positions in question. This leads to an important new concept of “conservatism-of-conservatism” (CoC) to analyze evolutionary signals that are specific to a given fold (Mirny et al., 1998). The principle of CoC calls for alignment of intrafamily conservatism profiles between analogs as a method to find and analyze evolutionary signals that reflect features that are characteristic of a particular fold: structural stability, folding kinetics or in some cases common function or common location of active sites between analogs (super-site).
In the following report, we first explain how CoC is computed and what controls and statistical tests we perform. Next, we consider the case of the immunoglobulin fold in detail, and show how CoC analysis helps to identify the evolutionary pressure towards fast folding and distinguish it from evolutionary pressure aimed at protein stabilization and functional pressure. This will help to identify a possible location of folding nucleus for the immunoglobulin fold which allows direct comparison with protein engineering experiments.
Next we carry out similar analysis for all other folds for which sufficient structural and evolutionary data are available oligonucleotide-binding (OB) fold, Rossman fold, alpha/beta plait and TIM barrel). Similar to the case of immunoglobulin fold, the analysis of the observed CoC signal will allow us to identify (in some cases) common nucleation sites characteristic of a given fold and (in some cases) super-sites, i.e. a common location of the active site in proteins with similar structure but possibly different function.
The results of our analysis will be compared with experimental information about the function, thermodynamics and kinetics of studied proteins in cases when such information is available.
Section snippets
Conservatism-of-conservatism (CoC)
As was stated earlier, the analysis of CoC aims to identify positions in a protein structure which are conserved within each family of homologous proteins that acquire this structure. To pursue this goal we need: (i) a large set of analogs - non-homologous proteins sharing the same fold (representative proteins); and (ii) for each representative protein a number of proteins homologous to it (a family).
When these data are available, the evaluation of CoC proceeds as follows: (i) make multiple
Discussion
Here, we report a detailed study of molecular evolution of five of the most populated protein folds. Out of ≈2200 domains in known structures without evident sequence homology, 564 belong to five dominant folds that were analyzed in this study. High data-demanding nature of the method of analysis limits it to the folds that contain at least 20-30 non-homologous families. As the number of solved protein structures increases, this analysis can be extended to other folds.
For each of folds studied,
Conclusion
Here, we provided a detailed statistical analysis of molecular evolution of most common protein folds. Our results clearly point out that physical factors related to protein folding such as stability and folding rate have undergone considerable evolutionary optimization. In particular we presented a direct evidence for evolutionary pressure towards fast (but not necessarily the fastest) folding for several proteins.
One of the most striking discoveries that emerge from growing data on protein
Control for solvent accessibility
If f(s|a) is the probability density function (pdf) of entropy s given accessibility a, (normalized ∫0log(6)f(s|a)d s=1 for ∀a) we can compute the pdf of S(l) based on the H0. Assuming families are independent (see a note below) we can apply central limit theorem (CLT) to compute the pdf of (l). Since S(l) is a sum of large number of independent random variables sm(l). Hence, according to the CLT S(l) has Gaussian distribution with the mean and the variance:
Acknowledgements
This work is supported by NIH grant RO1 GM52126. We are grateful to Fabrizio Chiti, Jane Clarke, Chris Dobson, Alan Fersht, Stephen Hamil, Mikael Oliveberg and Luis Serrano for illuminating discussions of experimental results and making many of them available to us prior to publication.
After this paper had been completed we heard sad news that Oleg Ptitsyn passed away. Oleg had been very excited about emerging understanding of deep relation between protein folding and evolution, which is the
References (62)
- et al.
Inter-residue potentials in globular proteins and the dominance of highly specific hydrophilic interactions at close separation
J. Mol. Biol.
(1997) - et al.
The three-dimensional structure of two mutants of the signal transduction protein chey suggest its molecular activation mechanism
J. Mol. Biol.
(1996) - et al.
The immunoglobulin fold. Structural classification, sequence patterns and common core
J. Mol. Biol.
(1994) - et al.
Predicting protein stability changes upon mutation using database-derived potentialssolvent accessibility determines the importance of local versus non-local interactions along the sequence
J. Mol. Biol.
(1997) - et al.
A protein engineering analysis of the transition state for protein foldingsimulation in the lattice model
Fold. Design
(1998) - et al.
Position-based sequence weights
J. Mol. Biol.
(1994) - et al.
Protein structure comparison by alignment of distance matrices
J. Mol. Biol.
(1993) - et al.
The structure of the transition state for folding of chymotrypsin inhibitor 2 analyzed by protein engineering methodsevidence for a nucleation-condensation mechanism for protein folding
J. Mol. Biol.
(1995) How do small single-domain proteins fold?
Fold. Design
(1998)- et al.
Probing residual structure and backbone dynamics on the milli- to picosecond timescale in a urea-denatured fibronectin type III domain
J. Mol. Biol.
(1999)
Raster3Dphosorealistic molecular graphics
Methods Enzymol.
Protein structure prediction by threading. Why it works and why it does not
J. Mol. Biol.
Pathways for protein foldingis a “new view” needed?
Curr. Opin. Struct. Biol.
Protein folding and protein evolutioncommon folding nucleus in different subfamilies of c-type cytochromes?
J. Mol. Biol.
Supersites within superfolds. Binding site similarity in the absence of homology
J. Mol. Biol.
Kinetics of protein folding. A lattice model study for the requirements for folding to the native state
J. Mol. Biol.
The family of cold shock proteins of Bacillus subtilis. Stability and dynamics in vitro and in vivo
J. Biol. Chem.
Theoretical studies of protein-folding thermodynamics and kinetics
Curr. Opin. Struct. Biol.
Folding nucleusspecific of multiple? Insights from simulations and comparison with experiment
Fold. Design
Protein designa perspective from simple tractable models
Fold. Design
Structure of the transition state in the folding process of human procarboxypeptidase a2 activation domain
J. Mol. Biol.
Specific nucleus as the transition state for protein foldingevidence from the lattice model
Biochemistry
Coordinated amino acid changes in homologous protein families
Protein Eng.
Introduction to Protein Structure
A simple statistical field theory of heteropolymer collapse with application to protein folding
Biopolymers
The hssp database of protein structure-sequence alignments and family profiles
Nucl. Acids Res.
An Introduction to Probability Theory and its Applications
Optimal protein-folding codes from spin-glass theory
Proc. Natl Acad. Sci. USA
Why are some protein structures so common?
Proc. Natl Acad. Sci. USA
Nucleation mechanism for protein folding and theoretical predictions for hydrogen-exchange labelling experiments
Biopolymers
The effect of boundary selection on the stability and folding of the third fibronectin type iii domain from human tenascin
Biochemistry
Cited by (360)
Evolutionary conservation of sequence motifs at sites of protein modification
2023, Journal of Biological ChemistryPrediction and confirmation of a switch-like region within the N-terminal domain of hSIRT1
2022, Biochemistry and Biophysics ReportsResearch progress of reduced amino acid alphabets in protein analysis and prediction
2022, Computational and Structural Biotechnology JournalImplication of rare genetic variants of NODAL and ACVR1B in congenital heart disease patients from Indian population
2021, Experimental Cell ResearchDeep-AntiFP: Prediction of antifungal peptides using distanct multi-informative features incorporating with deep neural networks
2021, Chemometrics and Intelligent Laboratory SystemsICTC-RAAC: An improved web predictor for identifying the types of ion channel-targeted conotoxins by using reduced amino acid cluster descriptors
2020, Computational Biology and Chemistry
- 1
Edited by A. R. Fersht