Journal of Molecular Biology
Volume 293, Issue 5, 12 November 1999, Pages 1221-1239
Journal home page for Journal of Molecular Biology

Regular article
Effective use of sequence correlation and conservation in fold recognition1

https://doi.org/10.1006/jmbi.1999.3208Get rights and content

Abstract

Protein families are a rich source of information; sequence conservation and sequence correlation are two of the main properties that can be derived from the analysis of multiple sequence alignments. Sequence conservation is related to the direct evolutionary pressure to retain the chemical characteristics of some positions in order to maintain a given function. Sequence correlation is attributed to the small sequence adjustments needed to maintain protein stability against constant mutational drift. Here, we showed that sequence conservation and correlation were each frequently informative enough to detect incorrectly folded proteins. Furthermore, combining conservation, correlation, and polarity, we achieved an almost perfect discrimination between native and incorrectly folded proteins. Thus, we made use of this information for threading by evaluating the models suggested by a threading method according to the degree of proximity of the corresponding correlated, conserved, and apolar residues. The results showed that the fold recognition capacity of a given threading approach could be improved almost fourfold by selecting the alignments that score best under the three different sequence-based approaches.

Introduction

At the molecular level, the natural process of evolution is reflected in the accumulation of variation between sequences of the same protein in distinct organisms. Zuckerkandl & Pauling (1965) initially proposed that protein families vary at different rates and that sequence divergence accumulates in protein regions related to protein function. Since then, it has generally been accepted that functional and structural constraints on proteins lead to the conservation of the chemical character of amino acid residues in polypeptide chains, as observed in multiple sequence alignments of protein families Benner 1989, Benner and Gerloff 1991, Cooperman et al 1992, Howell 1989, Hwang and Fletterick 1986. However, it is almost impossible to distinguish the structural and functional impacts from one another (Ouzounis et al., 1998).

The information derived from multiple sequence alignments has been used successfully in problems of protein family identification Bairoch 1992, Smith and Smith 1992, the search for remote homologous sequences Bork 1989, Gribskov et al 1987, Koonin et al 1996, Thompson et al 1994 and for the comparison of protein structures Artymiuk et al 1993, Bork et al 1995, Godzik and Sander 1989, Holm and Sander 1994, Holm and Sander 1995, Holm et al 1994, Mauzy and Hermodson 1992, Murzin 1996, Pastore and Lesk 1990. Progress in secondary structure and accessibility predictions has been attributed largely to the use of simple sequence profile (Rost & Sander, 1994). Both ab initio folding experiments based on sequence constraints (Taylor, 1991) and popular threading approaches use sequence information to some extent Defay and Cohen 1996, Fisher and Eisenberg 1996, Jones et al 1992, Ouzounis et al 1993, Rost 1995, Rost et al 1997. Sequence information is also commonly used by biologists in a non-systematic way during the analysis of specific protein families. Intuitive rules are followed; for example, conserved histidine or aspartate residues are often replaced by site-directed mutagenesis in the search for active sites.

Other, more sophisticated definitions of sequence conservation can be derived from the analysis of group-specific sequence information, i.e. residues conserved in particular groups of sequences but not in the entire protein family. These conserved residues are probably related to functional adaptations specific to certain sequence groups. Various strategies have been adopted for the analysis of these residues Andrade et al 1997, Casari et al 1995, Lichtarge et al 1996, Livingstone and Barton 1993, Pazos et al 1997c. It is striking that no proper tools have yet been developed to use this information for prediction of protein structure.

Section snippets

Sequence correlation

Information other than conservation can possibly be extracted from multiple sequence alignments, i.e. cases of concerted patterns of variation between different positions in multiple sequence alignments Altschuh et al 1987, Altschuh et al 1988. Correlated changes are more likely to correspond to compensatory substitutions that occur independently in proteins of the same family, to maintain them within the limits of protein stability. The underlying assumption is that compensation may be favored

Sequence information as a constraint for structure prediction

The protein folding problem remains unsolved, despite the enormous amount of effort spent on it. One of the approaches pursued by a number of groups has been to fold proteins using simulation techniques under defined force-fields of different natures. In this particular incarnation of the protein folding problem, one of the major difficulties is to obtain sufficiently accurate long-range constraints to guide the simulation toward the correct fold. It seems natural to believe that general

Predicting inter-residue contacts

Contact predictions can be as accurate as 35 % for small proteins when invariant residues were considered, and as low as 3 % for large proteins (Figure 1). The most representative results were those corresponding to the protein size category of 103 to 166 amino acid residues. For these, contacts were predicted at 13 % accuracy with invariant residues. This was 2.6-fold better than the prediction obtained with variable residues. When the data were analyzed in terms of Xd values, this trend

Discussion

We showed that long-range inter-residue contacts could be predicted at low but significant levels of accuracy using information from sequence conservation and correlation. The tendency of conserved and correlated residues to cluster in space became more obvious when using real values for spatial distances rather than binary contacts: the distance histograms for conserved and correlated residues were shifted toward smaller values than for all other residue pairs.

Definition of conservation

Sequence conservation is defined in terms of the Variability scale (Sander & Schneider, 1993), which ranges from variability zero (invariant residues) to extremely variable residues with values greater than 50. Our results are presented as four conservation classes, 0, 1-13, 14-18, and more than 21 variability units.

Calculation of correlation

Correlated mutations were calculated as described (Göbel et al., 1994). Each position in the alignment is coded by a distance matrix. This position-specific matrix contains the

Acknowledgements

We are indebted to Chris Sander (Whitehead Institute Cambridge, MA) for the initial discussions on the use of conserved residues and to C. Sander and Georg Casari (Lion-AG, Heidelberg) for the suggestion of using the set of incorrectly folded proteins during a meeting in Madrid, 1994. This work was supported, in part, by a grant from the CICYT-Spain BIO94-1067.

References (89)

  • L. Holm et al.

    Protein structure comparison by alignment of distance matrices

    J. Mol. Biol.

    (1993)
  • E.S. Huang et al.

    Recognizing native folds by the arrangament of hydrophobic and polar residues

    J. Mol. Biol.

    (1995)
  • E.S. Huang et al.

    Using a hydrophobic contact potential to evaluate native and near-native folds generated by molecular simulations

    J. Mol. Biol.

    (1996)
  • E.V. Koonin et al.

    Protein sequence comparison at genome scale

    Methods Enzymol.

    (1996)
  • O. Lichtarge et al.

    An evolutionary trace method defines binding surfaces common to protein families

    J. Mol. Biol.

    (1996)
  • V.N. Maiorov et al.

    Contact potential that recognizes the correct folding of globular proteins

    J. Mol. Biol.

    (1992)
  • A.D. Mclachlan

    Test for comparing related amino acid sequences

    J. Mol. Biol.

    (1971)
  • A.G. Murzin

    Structural classification of proteinsnew superfamilies

    Curr. Opin. Struct. Biol.

    (1996)
  • J. Novotny et al.

    An analysis of incorrectly folded models. Implications for structure prediction

    J. Mol. Biol.

    (1984)
  • O. Olmea et al.

    Improving contact predictions by the combination of correlated mutations and other sources of sequence information

    Fold. Des.

    (1997)
  • A.R. Ortiz et al.

    Fold assembly of small proteins using monte carlo simulations driven by restraints derived from multiple sequence alignments

    J. Mol. Biol.

    (1998)
  • C. Ouzounis et al.

    Prediction of protein structure by evaluation of sequence-structure fitnessaligning sequences to contact profiles derived from three-dimensional structures

    J. Mol. Biol.

    (1993)
  • D.D. Pollock et al.

    Coevolving protein residuesmaximum likelihood identification and relationship to structure

    J. Mol. Biol.

    (1999)
  • B. Rost et al.

    Protein fold recognition by prediction-based threading

    J. Mol. Biol.

    (1997)
  • M.J. Sippl

    Knowledge-based potentials for proteins

    Curr. Opin. Struct. Biol.

    (1995)
  • N. Thanki et al.

    Analysis of protein main-chain solvation as a function of secondary structure

    J. Mol. Biol.

    (1991)
  • G. Vriend

    WHAT IFa molecular modelling and drug design program

    J. Mol. Graph.

    (1990)
  • S.J. Wodak et al.

    Generating and testing protein folds

    Curr. Opin. Struct. Biol.

    (1993)
  • E. Zuckerkandl et al.

    Evolutionary divergence and convergence in proteins

  • M.J. Zvelebil et al.

    Prediction of protein secondary structure and active sites using alignment of homologous sequences

    J. Mol. Biol.

    (1987)
  • R. Abagyan et al.

    ICM-a new method for protein modeling and designapplications to docking and structure prediction from the distorted native conformation

    J. Comput. Chem.

    (1994)
  • D. Altschuh et al.

    Coordinated amino acid changes in homologous protein families

    Protein Eng.

    (1988)
  • M.A. Andrade et al.

    Classification of protein families and detection of the determinant residues with an improved self-organizing map

    Biol. Cybern.

    (1997)
  • A. Bairoch

    PROSITEa dictionary of sites and patterns in proteins

    Nucl. Acids Res.

    (1992)
  • P. Bork et al.

    Divergent evolution of a b/a barrel subclassdetection of numerous phosphate-binding sites by motif search

    Protein Sci.

    (1995)
  • D. Bowie et al.

    Identification of protein foldsmatching hydrophobicity patterns of sequence sets with solvent accessibility patterns of known structures

    Proteins: Struct. Funct. Genet.

    (1990)
  • S.H. Bryant et al.

    Correctly folded proteins make twice as many hydrophobic contacts

    J. Int. Pept. Protein Res.

    (1987)
  • G. Casari et al.

    A method to predict functional residues in proteins

    Nature Struct. Biol.

    (1995)
  • G. Chelvanayagam et al.

    An analysis of simultaneous variation in protein structures

    Protein Eng.

    (1997)
  • D.A. Clark et al.

    Protein topology prediction through constraint-based search and the evaluation of topological folding rules

    Protein Eng.

    (1991)
  • P. Fariselli et al.

    A neural network based predictor of residue contacts in proteins

    Protein Eng.

    (1999)
  • D. Fisher et al.

    Protein fold recognition using sequence-derived predictions

    Protein Sci.

    (1996)
  • A.E. Gabrielian et al.

    On searching for the active sites in proteins and peptide hormones

    Comput. Appl. Biosci.

    (1990)
  • S.G. Galaktionov et al.

    Properties of intraglobular contacts in proteinsAn approach to prediction of tertiary structure

  • Cited by (0)

    1

    Edited by J. M. Thornton

    View full text