Journal of Molecular Biology
Volume 342, Issue 5, 1 October 2004, Pages 1487-1504
Journal home page for Journal of Molecular Biology

Distinguishing Structural and Functional Restraints in Evolution in Order to Identify Interaction Sites

https://doi.org/10.1016/j.jmb.2004.08.022Get rights and content

Structural genomics projects are producing many three-dimensional structures of proteins that have been identified only from their gene sequences. It is therefore important to develop computational methods that will predict sites involved in productive intermolecular interactions that might give clues about functions. Techniques based on evolutionary conservation of amino acids have the advantage over physiochemical methods in that they are more general. However, the majority of techniques neither use all available structural and sequence information, nor are able to distinguish between evolutionary restraints that arise from the need to maintain structure and those that arise from function. Three methods to identify evolutionary restraints on protein sequence and structure are described here. The first identifies those residues that have a higher degree of conservation than expected: this is achieved by comparing for each amino acid position the sequence conservation observed in the homologous family of proteins with the degree of conservation predicted on the basis of amino acid type and local environment. The second uses information theory to identify those positions where environment-specific substitution tables make poor predictions of the overall amino acid substitution pattern. The third method identifies those residues that have highly conserved positions when three-dimensional structures of proteins in a homologous family are superposed. The scores derived from these methods are mapped onto the protein three-dimensional structures and contoured, allowing identification clusters of residues with strong evolutionary restraints that are sites of interaction in proteins involved in a variety of functions. Our method differs from other published techniques by making use of structural information to identify restraints that arise from the structure of the protein and differentiating these restraints from others that derive from intermolecular interactions that mediate functions in the whole organism.

Introduction

Although structural studies of proteins have been traditionally carried out on systems that have been functionally well-characterized, structural genomic projects are reversing this tendency, in that they explicitly select proteins that are unrelated to others of known structure and function.1, 2, 3 As a consequence, the structures of many proteins that are poorly characterized in terms of function and biochemistry are now being determined. In order to make best use of these structures, it is essential to identify their functions.

The function of a protein may be described on any of several levels. For example the molecular function of an enzyme depends on its specificity and the reaction it catalyses; the cellular function may depend additionally on its temporal and spatial expression in the cell, and the physiological function on the organ in which the expressing cells are found (for a review see van Helden et al.4). As we move up this hierarchy, additional information is required to understand the function. However, each level of function depends on the level beneath. Thus, assignment of molecular function, which is often defined in terms of the productive interactions the protein makes in the cell, is an essential first step. Localisation of such interactions to so-called “functional sites” or “interaction sites” will allow us to understand how the protein may recognise other molecules, to gain clues about its likely function at the level of the cell and organism and to identify important binding sites that may serve as useful targets for pharmaceutical design.

There have been a wide variety of approaches to the problem of functional site detection. Functional sites have been assigned on the basis of changed binding or enzymatic properties resulting from chemical modification of specific amino acid residues or site-directed mutagenesis.5, 6 Sequence motif databases, such as PROSITE,7 which identify sequentially conserved residues and also depend on literature searches, record specific residues likely to be involved in function. This information has been used to annotate various sequence alignment databases, such as Pfam8 and HOMSTRAD9.

Three-dimensional descriptors of functional sites have an advantage over linear descriptors as the sites themselves are usually comprised of discontinuous regions of protein sequence, giving rise to weak linear sequence motifs but stronger three-dimensional motifs. Kasuya & Thornton10 have correlated one-dimensional PROSITE motifs with tertiary structures to define three-dimensional functional patterns. Skolnick and co-workers have annotated protein structures with experimentally derived functional information to produce what they term a “fuzzy functional form”, which is a three-dimensional descriptor of the functional site of a protein.11, 12, 13, 14, 15 Since these annotations are derived from experimentally determined data they are likely to be of high quality, but are laboriously gathered. There have been, therefore, several attempts to identify or predict functional/interaction sites computationally. These studies fall into two broad classes: those that use physical features of the protein and those that seek to identify evolutionary conservation.

Several authors have analysed protein structures in search of steric strain or other types of high-energy conformations. Hertzberg & Moult16 found that Ramachandran outliers are often correlated with functional sites. Similarly, Heringa & Argos17 found that non-rotameric side-chains forming interacting clusters are often found in functional sites. Elcock18 extended this work by predicting functional sites on the basis that they have a high energy, as calculated from a molecular mechanics force field. Ota et al.19 have used a similar method in combination with identification of conserved residues to predict catalytic residues in enzymes.

Laskowski et al.20 analysed clefts in proteins and found the largest cleft tends to be involved in ligand binding areas. The same group has also characterised protein–protein interaction sites by the analysis of the nature of interface residues compared with other residues on the protein surface,21 finding that in homo-complexes they are significantly more hydrophobic than the rest of the protein surface. Functional sites have been identified on the basis of clusters of charge,22 particularly mixed charge residues23 and clusters of conserved, exposed, polar residues.24

A marked disadvantage of identifying interaction/functional sites from physical chemistry is that different functional sites have different characteristics. For example, nucleic acid binding sites on proteins can be identified by their positively charged nature,25 but this is not generally applicable to protein–protein interaction sites. Indeed, protein–protein interaction interfaces in homo and hetero-complexes are significantly different.21 Further differences will occur in systems where assembly and disassembly is critical to function, and the protein–protein interactions are not enduring. In contrast, with the notable exception of immunoglobulins, which have their binding site produced by recombination of VDJ genes, almost all protein functional sites are optimised through mutation and Darwinian selection. Mutations that change or abolish function will usually be directly selected against and hence functional sites will be the most highly conserved regions of a protein. It has long been known that residue identity is highly conserved in enzyme active sites,26 although there are more exceptions to this than was originally envisaged.27, 28 Attempts to produce a general method for identifying functional sites have centred on conservation of sequence. However, conservation of structure can also be used,29 or, as here a combination of sequence and structure conservation.

The observation that those regions in which main-chain conformation is most conserved in enzymes is likely to be the active site was first made by Chothia & Lesk.30 McPhalen et al.31 developed an iterative three-dimensional fitting method to define structurally conserved regions in proteins. This was applied by Irving et al.29 to identify enzyme active sites in homologous families.

The most widely used method based on evolutionary conservation of sequence is “evolutionary trace”.32, 33, 34, 35 In this technique a phylogenetic tree is constructed based on sequences, and the tree partitioned. Residues that are conserved in different partitions, or in all partitions, are highlighted on the structure, which is then examined visually. The technique has had a number of successes (for a review, see Litcharge & Sowa36).

The original method relied on visual identification of clusters making the technique difficult to apply automatically to the whole database of structures, although subsequent addition of automated clustering has solved this problem.34 Other authors have removed the dependence on absolute conservation, using a substitution table to allow for substitution of physiochemically similar residues,37, 38 and to allow for non-uniform rates of evolution at each site. Landgraf et al.39 have identified sequence conservation in three-dimensional clusters, and found it adds sensitivity to the evolutionary trace method.

There are, however, outstanding problems. Bork & Koonin40 and Karp41 discuss the problems with assigning function from the literature. Annotation of databases with experimental information suffers from the rapid growth of sequence and structural information that is not matched by the growth of experimental information, a gap that is growing. Annotation of sequence databases has the additional problem of the linear form of the data representation not always being compatible with the three-dimensional form of the functional site.

Predictions based on description of the physical characteristics of functional sites usually work well for the class of sites studied (usually enzymes) but often cannot be generalised. Predictions based on high-energy or strained conformations will often find active sites, but are unlikely to be able to identify many other classes of functional sites, particularly those involved in protein–protein interactions.

Methods that identify sequence conservation do not distinguish evolutionary restraints that arise from function from those that arise from structure, and the consideration of chemical similarity does little to make the distinction. Because the core of the protein is likely to be conserved for structural reasons, often only surface residues are analysed, simply by viewing a space-filling representation of the protein, or a solvent accessible surface. However, solvent accessible surfaces miss many residues that provide hydrogen bonds or that become accessible after conformational changes. This is a particular problem for catalytic residues of enzymes, which can often be relatively inaccessible to solvent.

The conservation of amino acid residues has been shown to be strongly dependent on the environment in which they occur in the folded protein and amino acid substitution tables that give the likely substitutions of amino acids in particular local environments have been derived.42, 43 We present here a method for using these environment-specific substitution tables to distinguish those restraints placed on protein structure from additional restraints due to particular functions mediated by interactions with other molecules. We find that the clusters of residues apparently subjected to these additional restraints in evolution correlate well with the functional sites in proteins defined by experimental methods. We also analyse conservation of local structure in homologous families of proteins and develop a term to describe structural conservation that can be used to increase the accuracy of functional site identification. The method relies on the clustering of residues in three-dimensional space.34 We apply it to a set of well-characterised protein families and are able to identify functional sites. The technique is fast, automatic and predicts functional sites with a high degree of accuracy.

Section snippets

Datasets used

We have selected three sets of families from the HOMSTRAD database. HOMSTRAD was chosen because it provides structure-based sequence alignments of evolutionarily related protein families that can be used as the basis for collecting sequence homologues. The three datasets were: a “jack-knife” set of ten families that were not used to derive the original substitution tables, a set of enzymes that were added to HOMSTRAD after we derived the substitution tables and performed the initial analysis

Discussion

Evolutionary constraints on protein sequence can arise from a variety of sources, and can have differing strengths.55 We have shown that analysis of these restraints can be used as to identify functional sites in proteins. Due to the ubiquitous nature of Darwinian evolution in determining protein sequence these techniques can be used to identify active sites, sugar binding sites and protein–protein binding sites. They can also be used to identify the binding sites of both homo- and

Database of aligned proteins

Environment-specific substitution tables were derived from the HOMSTRAD database9 using the SUBST program (K. Mizuguchi, unpublished results) as described.42, 43 At the time of compilation of the Tables, the database consisted of 3022 structures, grouped into 907 families and aligned on the basis of structure using MNYFIT59, 60 and COMPARER.61 Non-environment-specific substitution tables were calculated by taking the mean of all environment-specific

Acknowledgements

We thank Ross Munro for providing information about characterised enzyme active sites and Kenji Mizuguchi for useful suggestions. S.C.L. was supported as a Wellcome Trust Fellow of Mathematical Biology. V.C. is supported by the Nehru Cambridge Trust and Overseas Research Studentship. L.C. was supported by the Cambridge Overseas Trust and an Overseas Research Studentship award.

References (70)

  • A.E. Todd et al.

    Evolution of function in protein superfamilies, from a structural perspective

    J. Mol. Biol.

    (2001)
  • A.E. Todd et al.

    Plasticity of enzyme active sites

    Trends Biochem Sci.

    (2002)
  • O. Lichtarge et al.

    The evolutionary trace method defines the binding surfaces common to a protein family

    J. Mol. Biol.

    (1996)
  • O. Lichtarge et al.

    Identification of functional surfaces of the zinc binding domains of intracellular receptors

    J. Mol. Biol.

    (1997)
  • S. Madabushi et al.

    Structural clusters of evolutionary trace residues are statistically significant and common in proteins

    J. Mol. Biol.

    (2002)
  • H. Yao et al.

    An accurate, sensitive and scalable method to identify functional sites in protein structures

    J. Mol. Biol.

    (2003)
  • A. Armon et al.

    ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information

    J. Mol. Biol.

    (2001)
  • R. Landgraf et al.

    Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins

    J. Mol. Biol.

    (2001)
  • L.H. Pearl et al.

    The active site of aspartic proteinases

    FEBS Letters

    (1984)
  • A.D.M. Chapman et al.

    Structural basis of substrate specificity in malate dehydrogenases: crystal structure of a ternary complex of procine cytoplasmic malate dehydrogenase, alpha-ketomalonate and tetrahydroNAD

    J. Mol. Biol.

    (1999)
  • J.M. Conlon

    Evolution of the insulin molecule: insights into structure-activity and phylogenetic relationships

    Peptide

    (2001)
  • C. Kristensen et al.

    Alanine scanning mutagenesis of insulin

    J. Biol. Chem.

    (1997)
  • A. Sali et al.

    Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming

    J. Mol. Biol.

    (1990)
  • J.M. Word et al.

    Visualizing and quantifying molecular goodness of fit: small-probe contact dots with explicit hydrogen atoms

    J. Mol. Biol.

    (1999)
  • J.M. Word et al.

    Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation

    J. Mol. Biol.

    (1999)
  • G. Yona et al.

    Within the twighlight zone: a sensitive profile–profile comparison tool based on information theory

    J. Mol. Biol.

    (2002)
  • L.A. Mirny et al.

    Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function

    J. Mol. Biol.

    (1999)
  • T.L. Blundell et al.

    Knowledge-based prediction of protein structures and the design of novel molecules

    Nature

    (1987)
  • A. Sali

    100,000 protein structures for the biologist

    Nature Struct. Biol.

    (1998)
  • S.K. Burley et al.

    Structural genomics: beyond the Human Genome Project

    Nature Genet.

    (1999)
  • J. van Helden et al.

    Representing and analysing molecular and cellular function using the computer

    Biol. Chem.

    (2000)
  • M.A. Andrade et al.

    Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families

    Bioinformatics

    (1998)
  • M.A. Andrade et al.

    Automated genome sequence analysis and annotation

    Bioinformatics

    (1999)
  • K. Hofmann et al.

    The PROSITE database, its status in 1999

    Nucl. Acids Res.

    (1999)
  • A. Bateman et al.

    The Pfam Protein Families Database

    Nucl. Acids Res.

    (2002)
  • Cited by (93)

    • A novel strategy for molecular interfaces optimization: The case of Ferritin-Transferrin receptor interaction

      2020, Computational and Structural Biotechnology Journal
      Citation Excerpt :

      In particular, protein-protein non-covalent associations play an essential role in several aspects, such as biocatalysis, organism immunity or cell regulatory network construction [3,4]. Since the complex network of energetic couplings occurring between interacting atoms ensures the stability of the molecular complex, residues involved in protein interfaces undergo additional evolutionary pressure, and therefore they are more conserved than other surface residues [5,6]. Indeed the substitution of an amino acid residue can modify the protein structure, stability, binding affinity and function, thus potentially leading to an altered activity of the whole complex, often pathogenic [7,8].

    • Structure-based methods for computational protein functional site prediction

      2013, Computational and Structural Biotechnology Journal
    View all citing articles on Scopus

    Present address: S. C. Lovell, School of Biological Sciences University of Manchester, Smith Building, Oxford Road, Manchester M13 9PT, UK.

    View full text