Journal of Molecular Biology
Distinguishing Structural and Functional Restraints in Evolution in Order to Identify Interaction Sites
Introduction
Although structural studies of proteins have been traditionally carried out on systems that have been functionally well-characterized, structural genomic projects are reversing this tendency, in that they explicitly select proteins that are unrelated to others of known structure and function.1, 2, 3 As a consequence, the structures of many proteins that are poorly characterized in terms of function and biochemistry are now being determined. In order to make best use of these structures, it is essential to identify their functions.
The function of a protein may be described on any of several levels. For example the molecular function of an enzyme depends on its specificity and the reaction it catalyses; the cellular function may depend additionally on its temporal and spatial expression in the cell, and the physiological function on the organ in which the expressing cells are found (for a review see van Helden et al.4). As we move up this hierarchy, additional information is required to understand the function. However, each level of function depends on the level beneath. Thus, assignment of molecular function, which is often defined in terms of the productive interactions the protein makes in the cell, is an essential first step. Localisation of such interactions to so-called “functional sites” or “interaction sites” will allow us to understand how the protein may recognise other molecules, to gain clues about its likely function at the level of the cell and organism and to identify important binding sites that may serve as useful targets for pharmaceutical design.
There have been a wide variety of approaches to the problem of functional site detection. Functional sites have been assigned on the basis of changed binding or enzymatic properties resulting from chemical modification of specific amino acid residues or site-directed mutagenesis.5, 6 Sequence motif databases, such as PROSITE,7 which identify sequentially conserved residues and also depend on literature searches, record specific residues likely to be involved in function. This information has been used to annotate various sequence alignment databases, such as Pfam8 and HOMSTRAD9†.
Three-dimensional descriptors of functional sites have an advantage over linear descriptors as the sites themselves are usually comprised of discontinuous regions of protein sequence, giving rise to weak linear sequence motifs but stronger three-dimensional motifs. Kasuya & Thornton10 have correlated one-dimensional PROSITE motifs with tertiary structures to define three-dimensional functional patterns. Skolnick and co-workers have annotated protein structures with experimentally derived functional information to produce what they term a “fuzzy functional form”, which is a three-dimensional descriptor of the functional site of a protein.11, 12, 13, 14, 15 Since these annotations are derived from experimentally determined data they are likely to be of high quality, but are laboriously gathered. There have been, therefore, several attempts to identify or predict functional/interaction sites computationally. These studies fall into two broad classes: those that use physical features of the protein and those that seek to identify evolutionary conservation.
Several authors have analysed protein structures in search of steric strain or other types of high-energy conformations. Hertzberg & Moult16 found that Ramachandran outliers are often correlated with functional sites. Similarly, Heringa & Argos17 found that non-rotameric side-chains forming interacting clusters are often found in functional sites. Elcock18 extended this work by predicting functional sites on the basis that they have a high energy, as calculated from a molecular mechanics force field. Ota et al.19 have used a similar method in combination with identification of conserved residues to predict catalytic residues in enzymes.
Laskowski et al.20 analysed clefts in proteins and found the largest cleft tends to be involved in ligand binding areas. The same group has also characterised protein–protein interaction sites by the analysis of the nature of interface residues compared with other residues on the protein surface,21 finding that in homo-complexes they are significantly more hydrophobic than the rest of the protein surface. Functional sites have been identified on the basis of clusters of charge,22 particularly mixed charge residues23 and clusters of conserved, exposed, polar residues.24
A marked disadvantage of identifying interaction/functional sites from physical chemistry is that different functional sites have different characteristics. For example, nucleic acid binding sites on proteins can be identified by their positively charged nature,25 but this is not generally applicable to protein–protein interaction sites. Indeed, protein–protein interaction interfaces in homo and hetero-complexes are significantly different.21 Further differences will occur in systems where assembly and disassembly is critical to function, and the protein–protein interactions are not enduring. In contrast, with the notable exception of immunoglobulins, which have their binding site produced by recombination of VDJ genes, almost all protein functional sites are optimised through mutation and Darwinian selection. Mutations that change or abolish function will usually be directly selected against and hence functional sites will be the most highly conserved regions of a protein. It has long been known that residue identity is highly conserved in enzyme active sites,26 although there are more exceptions to this than was originally envisaged.27, 28 Attempts to produce a general method for identifying functional sites have centred on conservation of sequence. However, conservation of structure can also be used,29 or, as here a combination of sequence and structure conservation.
The observation that those regions in which main-chain conformation is most conserved in enzymes is likely to be the active site was first made by Chothia & Lesk.30 McPhalen et al.31 developed an iterative three-dimensional fitting method to define structurally conserved regions in proteins. This was applied by Irving et al.29 to identify enzyme active sites in homologous families.
The most widely used method based on evolutionary conservation of sequence is “evolutionary trace”.32, 33, 34, 35 In this technique a phylogenetic tree is constructed based on sequences, and the tree partitioned. Residues that are conserved in different partitions, or in all partitions, are highlighted on the structure, which is then examined visually. The technique has had a number of successes (for a review, see Litcharge & Sowa36).
The original method relied on visual identification of clusters making the technique difficult to apply automatically to the whole database of structures, although subsequent addition of automated clustering has solved this problem.34 Other authors have removed the dependence on absolute conservation, using a substitution table to allow for substitution of physiochemically similar residues,37, 38 and to allow for non-uniform rates of evolution at each site. Landgraf et al.39 have identified sequence conservation in three-dimensional clusters, and found it adds sensitivity to the evolutionary trace method.
There are, however, outstanding problems. Bork & Koonin40 and Karp41 discuss the problems with assigning function from the literature. Annotation of databases with experimental information suffers from the rapid growth of sequence and structural information that is not matched by the growth of experimental information, a gap that is growing. Annotation of sequence databases has the additional problem of the linear form of the data representation not always being compatible with the three-dimensional form of the functional site.
Predictions based on description of the physical characteristics of functional sites usually work well for the class of sites studied (usually enzymes) but often cannot be generalised. Predictions based on high-energy or strained conformations will often find active sites, but are unlikely to be able to identify many other classes of functional sites, particularly those involved in protein–protein interactions.
Methods that identify sequence conservation do not distinguish evolutionary restraints that arise from function from those that arise from structure, and the consideration of chemical similarity does little to make the distinction. Because the core of the protein is likely to be conserved for structural reasons, often only surface residues are analysed, simply by viewing a space-filling representation of the protein, or a solvent accessible surface. However, solvent accessible surfaces miss many residues that provide hydrogen bonds or that become accessible after conformational changes. This is a particular problem for catalytic residues of enzymes, which can often be relatively inaccessible to solvent.
The conservation of amino acid residues has been shown to be strongly dependent on the environment in which they occur in the folded protein and amino acid substitution tables that give the likely substitutions of amino acids in particular local environments have been derived.42, 43 We present here a method for using these environment-specific substitution tables to distinguish those restraints placed on protein structure from additional restraints due to particular functions mediated by interactions with other molecules. We find that the clusters of residues apparently subjected to these additional restraints in evolution correlate well with the functional sites in proteins defined by experimental methods. We also analyse conservation of local structure in homologous families of proteins and develop a term to describe structural conservation that can be used to increase the accuracy of functional site identification. The method relies on the clustering of residues in three-dimensional space.34 We apply it to a set of well-characterised protein families and are able to identify functional sites. The technique is fast, automatic and predicts functional sites with a high degree of accuracy.
Section snippets
Datasets used
We have selected three sets of families from the HOMSTRAD database. HOMSTRAD was chosen because it provides structure-based sequence alignments of evolutionarily related protein families that can be used as the basis for collecting sequence homologues. The three datasets were: a “jack-knife” set of ten families that were not used to derive the original substitution tables, a set of enzymes that were added to HOMSTRAD after we derived the substitution tables and performed the initial analysis
Discussion
Evolutionary constraints on protein sequence can arise from a variety of sources, and can have differing strengths.55 We have shown that analysis of these restraints can be used as to identify functional sites in proteins. Due to the ubiquitous nature of Darwinian evolution in determining protein sequence these techniques can be used to identify active sites, sugar binding sites and protein–protein binding sites. They can also be used to identify the binding sites of both homo- and
Database of aligned proteins
Environment-specific substitution tables were derived from the HOMSTRAD database9† using the SUBST program (K. Mizuguchi, unpublished results) as described.42, 43 At the time of compilation of the Tables, the database consisted of 3022 structures, grouped into 907 families and aligned on the basis of structure using MNYFIT59, 60 and COMPARER.61 Non-environment-specific substitution tables were calculated by taking the mean of all environment-specific
Acknowledgements
We thank Ross Munro for providing information about characterised enzyme active sites and Kenji Mizuguchi for useful suggestions. S.C.L. was supported as a Wellcome Trust Fellow of Mathematical Biology. V.C. is supported by the Nehru Cambridge Trust and Overseas Research Studentship. L.C. was supported by the Cambridge Overseas Trust and an Overseas Research Studentship award.
References (70)
- et al.
Three-dimensional structure analysis of PROSITE patterns
J. Mol. Biol.
(1999) - et al.
Functional analysis of the Escherichia coli genome using the sequence-to-structure-to-function paradigm: identification of proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase activity
J. Mol. Biol.
(1998) - et al.
Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases
J. Mol. Biol.
(1998) - et al.
Enhanced functional annotation of protein sequences via the use of structural descriptors
J. Struct. Biol.
(2001) Prediction of functionally important residues based solely on the computed energetics of protein structrue
J. Mol. Biol.
(2001)- et al.
Prediction of catalytic residues in enzymes based on known tertiary structure, stability profile and sequence conservation
J. Mol. Biol.
(2003) - et al.
Analysis of protein–protein interaction sites using surface patches
J. Mol. Biol.
(1997) - et al.
Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking
J. Mol. Biol.
(2001) - et al.
Annotating nucleic acid-binding function based on protein structure
J. Mol. Biol.
(2003) - et al.
Prediction of protein secondary structure and active sites using the alignment of homologous sequences
J. Mol. Biol.
(1987)
Evolution of function in protein superfamilies, from a structural perspective
J. Mol. Biol.
Plasticity of enzyme active sites
Trends Biochem Sci.
The evolutionary trace method defines the binding surfaces common to a protein family
J. Mol. Biol.
Identification of functional surfaces of the zinc binding domains of intracellular receptors
J. Mol. Biol.
Structural clusters of evolutionary trace residues are statistically significant and common in proteins
J. Mol. Biol.
An accurate, sensitive and scalable method to identify functional sites in protein structures
J. Mol. Biol.
ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information
J. Mol. Biol.
Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins
J. Mol. Biol.
The active site of aspartic proteinases
FEBS Letters
Structural basis of substrate specificity in malate dehydrogenases: crystal structure of a ternary complex of procine cytoplasmic malate dehydrogenase, alpha-ketomalonate and tetrahydroNAD
J. Mol. Biol.
Evolution of the insulin molecule: insights into structure-activity and phylogenetic relationships
Peptide
Alanine scanning mutagenesis of insulin
J. Biol. Chem.
Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming
J. Mol. Biol.
Visualizing and quantifying molecular goodness of fit: small-probe contact dots with explicit hydrogen atoms
J. Mol. Biol.
Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation
J. Mol. Biol.
Within the twighlight zone: a sensitive profile–profile comparison tool based on information theory
J. Mol. Biol.
Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function
J. Mol. Biol.
Knowledge-based prediction of protein structures and the design of novel molecules
Nature
100,000 protein structures for the biologist
Nature Struct. Biol.
Structural genomics: beyond the Human Genome Project
Nature Genet.
Representing and analysing molecular and cellular function using the computer
Biol. Chem.
Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families
Bioinformatics
Automated genome sequence analysis and annotation
Bioinformatics
The PROSITE database, its status in 1999
Nucl. Acids Res.
The Pfam Protein Families Database
Nucl. Acids Res.
Cited by (93)
A novel strategy for molecular interfaces optimization: The case of Ferritin-Transferrin receptor interaction
2020, Computational and Structural Biotechnology JournalCitation Excerpt :In particular, protein-protein non-covalent associations play an essential role in several aspects, such as biocatalysis, organism immunity or cell regulatory network construction [3,4]. Since the complex network of energetic couplings occurring between interacting atoms ensures the stability of the molecular complex, residues involved in protein interfaces undergo additional evolutionary pressure, and therefore they are more conserved than other surface residues [5,6]. Indeed the substitution of an amino acid residue can modify the protein structure, stability, binding affinity and function, thus potentially leading to an altered activity of the whole complex, often pathogenic [7,8].
Mutations at protein-protein interfaces: Small changes over big surfaces have large impacts on human health
2017, Progress in Biophysics and Molecular BiologyChemical specificity and conformational flexibility in proteinase-inhibitor interaction: Scaffolds for promiscuous binding
2014, Progress in Biophysics and Molecular BiologyStructure-based methods for computational protein functional site prediction
2013, Computational and Structural Biotechnology Journal
- †
Present address: S. C. Lovell, School of Biological Sciences University of Manchester, Smith Building, Oxford Road, Manchester M13 9PT, UK.