A Family of Evolution–Entropy Hybrid Methods for Ranking Protein Residues by Importance

https://doi.org/10.1016/j.jmb.2003.12.078Get rights and content

Abstract

In order to identify the amino acids that determine protein structure and function it is useful to rank them by their relative importance. Previous approaches belong to two groups; those that rely on statistical inference, and those that focus on phylogenetic analysis. Here, we introduce a class of hybrid methods that combine evolutionary and entropic information from multiple sequence alignments. A detailed analysis in insulin receptor kinase domain and tests on proteins that are well-characterized experimentally show the hybrids' greater robustness with respect to the input choice of sequences, as well as improved sensitivity and specificity of prediction. This is a further step toward proteome scale analysis of protein structure and function.

Introduction

When doing protein mutation studies, it is helpful to have an estimate of the relative importance of residues as a guide. In this way, priority can be given to the residues more likely to play a critical role in the protein function or structure. Here, we re-examine how to rank residues by importance, starting with a set of homologous sequences.

All of the alignment or evolutionary residue-scoring methods assume that the importance of a residue is reflected in its evolutionary conservation: the more important the residue, the sooner it becomes fixed in different evolutionary branches, and the more divergent are the branches between which it does vary. (As a working definition of “important residue”, we might take “the residue that cannot be mutated without measurably affecting the protein structure or function.”) There are various approaches to turning this observation into a quantitative prediction of relative residue importance: scoring strict conservation, property conservation, or entropy of a position, or, more elaborately, scoring conservation in related families (even if not across the families).

In one of the earliest attempts to quantify the conservation of a residue at certain alignment position, Zvelebil et al. converted the count of residues with a certain property (hydrophobicity, size, etc.) into a property called conservation number, which enabled them to distinguish poorly conserved loop regions from the rest of the protein structure.1 For the current progress in this line of thought, see Valdar.2

Casari et al. proposed an interesting method in their 1995 work: consider the whole sequence as a vector in number-of-residue-types times length-dimensional space, and think of the alignment as a matrix of such vectors.3 Its eigenvectors should then carry the information about residue preference for each subfamily represented in the alignment. In that way, the tree information is recovered from the analysis, rather than being its input. The information-theoretic school likes to place the root of its genealogical tree straight at the historical work of Shannon & Weaver, where the entropy of a finite state system was reinterpreted as a measure of its information content.4 With the advance of computational methods in genetics, the idea resurfaced in connection with the information content and thermodynamics of DNA-binding sites.5., 6. In 1991 Shenkin et al. used entropy as a robust measure of variability of positions in immunoglobulin sequence, and noted that high variability of a position can be a result of its evolutionary neutrality.7 More recently, entropy-based measures of position conservation have been used for systematic computational analysis of conservation profile in multiple sequence alignment,8., 9. and the approach was also elegantly extended to detect correlated mutations in a sequence.10., 11. Mirny & Shakhnovich12 as well as Hannenhalli & Russell13 introduced the notion of summing or averaging the site entropy over several related protein groups, but stopped short of iteratively applying this approach to the hierarchical division of sequences into groups induced by evolutionary tree, the idea that we propose here.

In a parallel and independent development, Lichtarge et al.,14 as well as Livingstone & Barton,15 pointed out that low variability is not equivalent to the conservation of a residue within a subgroup (or a subtree), and that the knowledge of such within-group conservation can be used successfully in estimation of the residue importance. A similar observation was made by Ptitsyn in the context of structurally important residues.16 While Livingston & Barton incorporated and built onto the work of Zvelebil et al., Lichtarge et al. took an all-or-none approach in considering the within-group conservation, but pointed out how to include the tree information in a systematic, iterative way. The method was named evolutionary trace (ET), and has since been shown capable of detecting protein interaction sites and directing protein mutation studies.17., 18., 19., 20. The present work grew out of the effort to make the ET more robust against deviations from the ideal family-tree picture, occurring in the actual protein evolution (and database-dependent research).

A comparative study of various methods mentioned, in terms of their capability to rank the residues, was, to the best of our knowledge, never performed (see del Sol Mesa et al.11 for comparison of several methods' ability to pick residues physically close to functionally important residues). In good part, the reason is that it is impossible to obtain from the experiment equivalent and independent information, an experimental yardstick against which to measure the performance of various theoretical approaches. This would involve mutation of every single residue in the protein, or at least sizable portion thereof, presently not a feasible option. As an alternative, we construct by literature search a tentative key residue set for several well-investigated proteins, a task in which we are greatly assisted by Protein Mutant Database.21 We then estimate the quality of a method by its capability to rank the members of the key set highly. In other words, taking this set as a “gold standard”, we study the sensitivity–specificity performance of a method. The methods we focus on in this work are entropy, ET, and two hybrid methods. We propose a general way to construct a hybrid, illustrate the use of these methods using insulin receptor kinase domains as an example, and find that combinations of ET and entropic approaches are more robust against small irregularities in the input, and have increased prediction sensitivity and specificity.

Section snippets

Theory

In this section we want to lay out the framework for incorporating entropy into a tree-based residue ranking system (or vice versa). To this purpose, we need to define some terminology: node ordering in a binary tree, and hierarchical division of leaves into groups induced by the tree. We express the existing methods (ET and entropy) in these terms, and review their complementary strengths. Next, we propose a straightforward combination of the two, and point out a way to think about this

Case study: insulin receptor kinase

To illustrate the ideas laid out in the previous section, we look more closely at one particular case: insulin receptor kinase domain. This protein is well studied, and in our analysis we rely on the work by Hubbard.25

Insulin receptors are transmembrane proteins that bind the insulin hormone. The binding leads to autophosphorylation of tyrosine residues in the activation loop of the protein. This results in enhancement of catalytic activity and creation of binding sites for downstream signaling

Conclusion

We discussed several methods of ranking protein residues by their importance for the protein as a whole. They are of necessity somewhat schematic. In particular, they do not systematically handle the varying evolution rates across the evolutionary tree (a problem that we attempted to address partially in defining the zoom method; equations , , ), and they ignore some important possibilities such as correlated mutations. We suggested a general way of integrating tree and variability information,

Key residues in the study case

The following residues were taken to be the key residues for the protein function (the numbers refer to the enumeration in the 1irk PDB entry): 1006, 1010, 1030, 1038, 1042, 1045, 1047, 1054,1061, 1077, 1079, 1082, 1083, 1085, 1089, 1092, 1131, 1136, 1137, 1139, 1150, 1151, 1152, 1155, 1158, 1162, 1163, 1132, 1164, 1166, 1171, 1172, 1173, 1176, 1181, 1215, 1216, 1219. For a full discussion, see Hubbard.25

Proteins in the test set

In constructing the test set we tried to put together as diverse a set of proteins as

Acknowledgements

The authors give many special thanks to the team of creators and curators of the Protein Mutant Database, without which it would be practically impossible to construct our test set. Thanks go also to the members of the Lichtarge laboratory for critically reading the manuscript. This work was supported by NIH GM066099 and NSF DBI-0318415 grants. I.M. gratefully acknowledges partial support from the W.M. Keck Center for Computational and Structural Biology, and I.R. from the March of Dimes

References (34)

  • G. Casari et al.

    A method to predict functional residues in proteins

    Nature Struct. Biol.

    (1995)
  • C. Shannon et al.

    The Mathematical Theory of Communication

    (1949)
  • O. Berg et al.

    Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters

    J. Mol. Biol.

    (1987)
  • P. Shenkin et al.

    Information-thoretical entropy as a measure of sequence variability

    Proteins: Struct. Funct. Genet.

    (1991)
  • S. Sunyaev et al.

    PSIC: profile extraction from sequence align- ments with position-specific counts of independent observations

    Protein Eng.

    (1999)
  • J. Pei et al.

    Al2CO: calculation of positional conservation in a protein sequence alignment

    Bioinformatics

    (2001)
  • R. Atchley et al.

    Positional dependence, cliques, and predictive motifs in the bhlh protein domain

    J. Mol. Evol.

    (1999)
  • Cited by (0)

    View full text