Journal of Molecular Biology
Regular articleEffective use of sequence correlation and conservation in fold recognition1
Introduction
At the molecular level, the natural process of evolution is reflected in the accumulation of variation between sequences of the same protein in distinct organisms. Zuckerkandl & Pauling (1965) initially proposed that protein families vary at different rates and that sequence divergence accumulates in protein regions related to protein function. Since then, it has generally been accepted that functional and structural constraints on proteins lead to the conservation of the chemical character of amino acid residues in polypeptide chains, as observed in multiple sequence alignments of protein families Benner 1989, Benner and Gerloff 1991, Cooperman et al 1992, Howell 1989, Hwang and Fletterick 1986. However, it is almost impossible to distinguish the structural and functional impacts from one another (Ouzounis et al., 1998).
The information derived from multiple sequence alignments has been used successfully in problems of protein family identification Bairoch 1992, Smith and Smith 1992, the search for remote homologous sequences Bork 1989, Gribskov et al 1987, Koonin et al 1996, Thompson et al 1994 and for the comparison of protein structures Artymiuk et al 1993, Bork et al 1995, Godzik and Sander 1989, Holm and Sander 1994, Holm and Sander 1995, Holm et al 1994, Mauzy and Hermodson 1992, Murzin 1996, Pastore and Lesk 1990. Progress in secondary structure and accessibility predictions has been attributed largely to the use of simple sequence profile (Rost & Sander, 1994). Both ab initio folding experiments based on sequence constraints (Taylor, 1991) and popular threading approaches use sequence information to some extent Defay and Cohen 1996, Fisher and Eisenberg 1996, Jones et al 1992, Ouzounis et al 1993, Rost 1995, Rost et al 1997. Sequence information is also commonly used by biologists in a non-systematic way during the analysis of specific protein families. Intuitive rules are followed; for example, conserved histidine or aspartate residues are often replaced by site-directed mutagenesis in the search for active sites.
Other, more sophisticated definitions of sequence conservation can be derived from the analysis of group-specific sequence information, i.e. residues conserved in particular groups of sequences but not in the entire protein family. These conserved residues are probably related to functional adaptations specific to certain sequence groups. Various strategies have been adopted for the analysis of these residues Andrade et al 1997, Casari et al 1995, Lichtarge et al 1996, Livingstone and Barton 1993, Pazos et al 1997c. It is striking that no proper tools have yet been developed to use this information for prediction of protein structure.
Section snippets
Sequence correlation
Information other than conservation can possibly be extracted from multiple sequence alignments, i.e. cases of concerted patterns of variation between different positions in multiple sequence alignments Altschuh et al 1987, Altschuh et al 1988. Correlated changes are more likely to correspond to compensatory substitutions that occur independently in proteins of the same family, to maintain them within the limits of protein stability. The underlying assumption is that compensation may be favored
Sequence information as a constraint for structure prediction
The protein folding problem remains unsolved, despite the enormous amount of effort spent on it. One of the approaches pursued by a number of groups has been to fold proteins using simulation techniques under defined force-fields of different natures. In this particular incarnation of the protein folding problem, one of the major difficulties is to obtain sufficiently accurate long-range constraints to guide the simulation toward the correct fold. It seems natural to believe that general
Predicting inter-residue contacts
Contact predictions can be as accurate as 35 % for small proteins when invariant residues were considered, and as low as 3 % for large proteins (Figure 1). The most representative results were those corresponding to the protein size category of 103 to 166 amino acid residues. For these, contacts were predicted at 13 % accuracy with invariant residues. This was 2.6-fold better than the prediction obtained with variable residues. When the data were analyzed in terms of Xd values, this trend
Discussion
We showed that long-range inter-residue contacts could be predicted at low but significant levels of accuracy using information from sequence conservation and correlation. The tendency of conserved and correlated residues to cluster in space became more obvious when using real values for spatial distances rather than binary contacts: the distance histograms for conserved and correlated residues were shifted toward smaller values than for all other residue pairs.
Definition of conservation
Sequence conservation is defined in terms of the Variability scale (Sander & Schneider, 1993), which ranges from variability zero (invariant residues) to extremely variable residues with values greater than 50. Our results are presented as four conservation classes, 0, 1-13, 14-18, and more than 21 variability units.
Calculation of correlation
Correlated mutations were calculated as described (Göbel et al., 1994). Each position in the alignment is coded by a distance matrix. This position-specific matrix contains the
Acknowledgements
We are indebted to Chris Sander (Whitehead Institute Cambridge, MA) for the initial discussions on the use of conserved residues and to C. Sander and Georg Casari (Lion-AG, Heidelberg) for the suggestion of using the set of incorrectly folded proteins during a meeting in Madrid, 1994. This work was supported, in part, by a grant from the CICYT-Spain BIO94-1067.
References (89)
- et al.
Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus
J. Mol. Biol.
(1987) - et al.
Three-dimensional structural resemblance between the ribonuclease H and connection domain of HIV reverse transcriptase and the ATPase fold revealed using graph theoretical techniques
FEBS Letters
(1993) Patterns of divergence in homologous proteins as indicators of tertiary and quaternary structure
Advan. Enzyme Regul.
(1989)- et al.
Patterns of divergence in homologous proteins as indicators of secondary and tertiary structurea prediction of the structure of the catalytic domain of protein kinases
Advan. Enzyme Regul.
(1991) Recognition of functional regions in primary structures using a set of property patterns
FEBS Letters
(1989)- et al.
Evolutionary conservation of the active site of soluble inorganic pyrophosphatase
Trends Biochem. Sci.
(1992) - et al.
Multiple sequence information for threading algorithms
J . Mol. Biol.
(1996) - et al.
Novel method for the rapid evaluation of packing in protein structures
J. Mol. Biol.
(1990) - et al.
Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force
J. Mol. Biol.
(1990) - et al.
Evaluation of protein models by atomic solvation preference
J. Mol. Biol.
(1992)
Protein structure comparison by alignment of distance matrices
J. Mol. Biol.
Recognizing native folds by the arrangament of hydrophobic and polar residues
J. Mol. Biol.
Using a hydrophobic contact potential to evaluate native and near-native folds generated by molecular simulations
J. Mol. Biol.
Protein sequence comparison at genome scale
Methods Enzymol.
An evolutionary trace method defines binding surfaces common to protein families
J. Mol. Biol.
Contact potential that recognizes the correct folding of globular proteins
J. Mol. Biol.
Test for comparing related amino acid sequences
J. Mol. Biol.
Structural classification of proteinsnew superfamilies
Curr. Opin. Struct. Biol.
An analysis of incorrectly folded models. Implications for structure prediction
J. Mol. Biol.
Improving contact predictions by the combination of correlated mutations and other sources of sequence information
Fold. Des.
Fold assembly of small proteins using monte carlo simulations driven by restraints derived from multiple sequence alignments
J. Mol. Biol.
Prediction of protein structure by evaluation of sequence-structure fitnessaligning sequences to contact profiles derived from three-dimensional structures
J. Mol. Biol.
Coevolving protein residuesmaximum likelihood identification and relationship to structure
J. Mol. Biol.
Protein fold recognition by prediction-based threading
J. Mol. Biol.
Knowledge-based potentials for proteins
Curr. Opin. Struct. Biol.
Analysis of protein main-chain solvation as a function of secondary structure
J. Mol. Biol.
WHAT IFa molecular modelling and drug design program
J. Mol. Graph.
Generating and testing protein folds
Curr. Opin. Struct. Biol.
Evolutionary divergence and convergence in proteins
Prediction of protein secondary structure and active sites using alignment of homologous sequences
J. Mol. Biol.
ICM-a new method for protein modeling and designapplications to docking and structure prediction from the distorted native conformation
J. Comput. Chem.
Coordinated amino acid changes in homologous protein families
Protein Eng.
Classification of protein families and detection of the determinant residues with an improved self-organizing map
Biol. Cybern.
PROSITEa dictionary of sites and patterns in proteins
Nucl. Acids Res.
Divergent evolution of a b/a barrel subclassdetection of numerous phosphate-binding sites by motif search
Protein Sci.
Identification of protein foldsmatching hydrophobicity patterns of sequence sets with solvent accessibility patterns of known structures
Proteins: Struct. Funct. Genet.
Correctly folded proteins make twice as many hydrophobic contacts
J. Int. Pept. Protein Res.
A method to predict functional residues in proteins
Nature Struct. Biol.
An analysis of simultaneous variation in protein structures
Protein Eng.
Protein topology prediction through constraint-based search and the evaluation of topological folding rules
Protein Eng.
A neural network based predictor of residue contacts in proteins
Protein Eng.
Protein fold recognition using sequence-derived predictions
Protein Sci.
On searching for the active sites in proteins and peptide hormones
Comput. Appl. Biosci.
Properties of intraglobular contacts in proteinsAn approach to prediction of tertiary structure
Cited by (0)
- 1
Edited by J. M. Thornton