Journal of Molecular Biology
Volume 396, Issue 5, 12 March 2010, Pages 1451-1473
Journal home page for Journal of Molecular Biology

Evolutionary Trace Annotation of Protein Function in the Structural Proteome

https://doi.org/10.1016/j.jmb.2009.12.037Get rights and content

Abstract

By design, structural genomics (SG) solves many structures that cannot be assigned function based on homology to known proteins. Alternative function annotation methods are therefore needed and this study focuses on function prediction with three-dimensional (3D) templates: small structural motifs built of just a few functionally critical residues. Although experimentally proven functional residues are scarce, we show here that Evolutionary Trace (ET) rankings of residue importance are sufficient to build 3D templates, match them, and then assign Gene Ontology (GO) functions in enzymes and non-enzymes alike. In a high-specificity mode, this Evolutionary Trace Annotation (ETA) method covered half (53%) of the 2384 annotated SG protein controls. Three-quarters (76%) of predictions were both correct and complete. The positive predictive value for all GO depths (all-depth PPV) was 84%, and it rose to 94% over GO depths 1–3 (depth 3 PPV). In a high-sensitivity mode, coverage rose significantly (84%), while accuracy fell moderately: 68% of predictions were both correct and complete, all-depth PPV was 75%, and depth 3 PPV was 86%. These data concur with prior mutational experiments showing that ET rank information identifies key functional determinants in proteins. In practice, ETA predicted functions in 42% of 3461 unannotated SG proteins. In 529 cases—including 280 non-enzymes and 21 for metal ion ligands—the expected accuracy is 84% at any GO depth and 94% down to GO depth 3, while for the remaining 931 the expected accuracies are 60% and 71%, respectively. Thus, local structural comparisons of evolutionarily important residues can help decipher protein functions to known reliability levels and without prior assumption on functional mechanisms. ETA is available at http://mammoth.bcm.tmc.edu/eta.

Introduction

Most proteins lack known function.1 This is also the case for proteins with known structure, since, as of November 2008, 1784 of the 3433 structures solved by the Protein Structure Initiative (PSI),2, 3, 4, 5 or over half, were labeled “hypothetical” or “unknown function” in the Protein Data Bank (PDB).6 One cause for this lack of knowledge is that detailed functional characterization of individual protein function are lacking, given their demands on time and resources. One possibility is to use high-throughput experimental strategies, but large-scale functional screens are only now being developed.7 Moreover, protein function can be context dependent8, 9 or nonspecific, and a battery of positive in vitro assays may overestimate the actual in vivo function. For now, fewer than 5% of annotations are experimental;10, 11 our own analysis of the Uniprot database12 suggests that number may be as low as 2.4%. A second reason is that the most widely used annotation method, which transfers annotation among sequence homologs identified by BLAST13 or PSI-BLAST,14 is increasingly error prone as evolutionary distances grow,15 or as variations impinge nearer to functional sites.16 Thus, homology-based function transfer of the four Enzyme Commission (EC) digits that describe enzymatic reactions17 are only 90% accurate down to 70% sequence identity.10, 18, 19, 20 For three EC digit annotation transfers, the standard that we use to define the correctness of enzyme function predictions, they are reliable only down to 50–60% sequence identity.18, 19, 20 Yet many annotations are based on 30% sequence identity homologs.21 This has raised repeated concerns that annotation errors occur and then propagate22, 23, 24, 25, 26, 27 and justifies the search for new approaches.2, 28

Among these, three-dimensional (3D) templates are of special interest because they probe directly the molecular basis of function through small structural motifs of just a few key amino acids that, ideally, identify functional determinants and their functionally relevant matches in other proteins.29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47 Preliminary studies hint that 3D templates may be accurate when sequence identity is below 40%48 and thus prove complementary to sequence comparison methods. Two limitations decrease the potential impact of 3D templates, however. First, precise and experimentally validated groups of functional residues, such as the well-known catalytic triad,49 that together define a structure–function motif are few.50 Second, when templates are built from residues chosen through heuristics, simple geometric thresholds of root-mean-square deviation (RMSD) between a template and its matches can be nonspecific, failing to reliably discriminate between relevant and random matches.51, 52

The Evolutionary Trace Annotation (ETA) pipeline was developed to address these issues. It exploits the rankings of evolutionary importance produced by the Evolutionary Trace (ET).'53 These rankings are obtained when ET correlates residue variation patterns observed during evolution with sequence divergences. Both retrospective computational studies and numerous experimental validation studies show that top-ranked ET residues on protein surfaces have desirable properties: they cluster structurally; these clusters overlap and predict functional sites;54, 55 and mutations targeted to top-ranked residues reliably block and separate functions56–58 or transfer function among homologs upon exchange of cognate residues.59 These data buttress the basic hypothesis that surface clusters of top-ranked residues may indicate essential determinants of function and specificity. Since this is precisely the type of information that 3D templates seek to embody, ETA analyzes these ET clusters to create 3D templates. Crucially importantly, in so doing, it bypasses the need for any prior knowledge of functional mechanisms. By contrast, the ProFunc metaserver's Enzyme Active Site method,60 for example, relies on experimental knowledge of functional sites taken from the Catalytic Site Atlas (CSA)50—although another of its component methods does use de novo templates, but was improved upon by ETA.45 A notable strength of the ETA approach is that it stringently filters the matches between a 3D template, s, extracted from protein S, and a protein structure T, so as to enhance specificity. First, the matched site in T must be evolutionarily important in its own right;51 second, a template t extracted from T must reciprocally match the structure of protein S;61 and finally, T's function must achieve a plurality over all matches.62 ETA thus reached 92% three-digit EC positive predictive value (PPV) in large, retrospective controls of enzyme function prediction using the enzyme classification (EC) system. These encouraging results fell short, however, since (a) most proteins are not enzymes; and (b) the EC scheme does not carry over to non-enzymes. Thus, for ETA to be tested realistically it must be (a) controlled over enzymes and non-enzymes alike and (b) adapted to a universal function classification scheme.

To these ends, this work generalized ETA to the Gene Ontology (GO) classification, and then evaluated its annotation performance in enzymes and in non-enzymes—first on small control sets, and then on all previously annotated proteins solved by structural genomics (SG) projects. Finally, the possibility of predicting ligand type was assessed for the basic case of proteins that bind metal ions.

GO was a natural choice because it provides a controlled vocabulary that hierarchically describes the biological processes, cellular components, and molecular functions that any protein takes part in.63 Mapping a protein to its correct GO term, or terms, is not entirely straightforward, however. Often multiple overlapping GO annotations are relevant—either due to multiple functions, or because GO typically splits functions into several component terms, each at different levels of specificity. ETA was thus extended to predict multiple, hierarchical functions. To control that ETA annotations reflected in part the choice of template residues, templates built from evolutionarily unimportant residues were also tried, and performed worse. Surprisingly, and for a number of possible reasons, so did templates that took CSA information as the primary basis for templates. The control studies focused on the PPV down to depth 3 of the GO hierarchy (depth 3 PPV), which for enzymes corresponds to the second of the four hierarchical EC numbers. But PPV for any GO depth is reported as well as the fraction of predictions that are both accurate at all GO depths and entirely complete. The results are that when specificity is maximized (depth 3 PPV, 94%), the prediction availability, or coverage, was moderate, at 53%. But this coverage could be increased to 84% if one was willing to decrease accuracy (86%, depth 3 PPV). Finally, ETA was then applied to SG proteins without known annotations. This produced 529 predictions at an expected GO depth 3 PPV of 94%, including 280 predicted non-enzymes and 21 ion-binding proteins and 931 additional predictions with a lower depth 3 PPV of 71%, yielding a total of 1460 new annotations.

Section snippets

Overview of ETA

The ETA pipeline proceeds in a series of steps best illustrated in an example. To annotate Mycobacterium tuberculosis v1626 (PDB ID 1sd5, chain A), shown in Fig. 1, ET ranked its residues by evolutionary importance. The first cluster of 10 or more top-ranked residues on the protein's surface appeared at the sixth percentile rank. From these ETA picked a template: the Cα atom coordinates for the six top-ranked amino acid positions, each one labeled by its allowed side-chain type(s), given the

Discussion

This study extends automated function prediction to any type of protein structure regardless of whether its function is enzymatic or not. ETA first identifies structural motifs of key functional residues, then it tallies local similarities of these residues among all other already annotated structures, and finally it transfers GO annotation between matches, using independent arguments to filter those that best reflect functional rather than random similarities. Prior studies suggested that ETA

Conclusions

ETA transfers functional annotations among protein based on local structural and evolutionary similarities of their functional sites. It differs from related applications in two important ways. First, it relies on evolutionary analysis by ET to pick templates. This circumvents any need for prior information on protein function or mechanisms, instead letting evolution dictate where functionally relevant motifs are likely to lie in the structure. Second, it is biased to make specific predictions

Function definition

Protein functions were defined for controls as the GO molecular function terms found on the GO website§ and the EC annotations from the PDB, mapped to GO.63 For metal-ion-binding predictions, the type of ion bound in the structure was taken as the function.

Data sets

A training set of 53 enzymes51 was used to train the SVM (see below) that selects functionally relevant matches, and to initially select parameter values for the template search (also below).

The “Non-enzyme Test

Note added in proof

After the original submission, another template method called FLORA was released79 and showed marked improvement in predicting three-digit EC17 functions compared to Reverse Templates45, CATHEDRAL,80 and CE.81 We benchmarked ETA against FLORA on the exact same data set of 821 protein chains, using the same leave-one-out procedure to match each protein against the other 820. ETA predictions of three-digit EC functions were made as described previously. Although this is not a stringent comparison

Acknowledgements

O.L. gratefully acknowledges partial support from NSF DBI-0547695, CCF-0905536, and NIH-GM079656 and GM066099. Work by S.E. and R.M.W. was also supported by training fellowships from the National Library of Medicine to the Keck Center for Interdisciplinary Bioscience Training of the Gulf Coast Consortia (NLM grant 5T15LM07093).

References (86)

  • de RinaldisM. et al.

    Three-dimensional profiles: a new tool to identify protein surface similarities

    J. Mol. Biol.

    (1998)
  • LaskowskiR.A.

    SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions

    J. Mol. Graph.

    (1995)
  • Shulman-PelegA. et al.

    Recognition of functional sites in protein structures

    J. Mol. Biol.

    (2004)
  • SchmittS. et al.

    A new method to detect related function among proteins independent of sequence and fold homology

    J. Mol. Biol.

    (2002)
  • LaskowskiR.A. et al.

    Protein function prediction using local 3D templates

    J. Mol. Biol.

    (2005)
  • RedfernO.C. et al.

    Exploring the structure and function paradigm

    Curr. Opin. Struct. Biol.

    (2008)
  • WatsonJ.D. et al.

    Towards fully automated structure-based function prediction in structural genomics: a case study

    J. Mol. Biol.

    (2007)
  • LichtargeO. et al.

    An evolutionary trace method defines binding surfaces common to protein families

    J. Mol. Biol.

    (1996)
  • MadabushiS. et al.

    Structural clusters of evolutionary trace residues are statistically significant and common in proteins

    J. Mol. Biol.

    (2002)
  • YaoH. et al.

    An accurate, sensitive, and scalable method to identify functional sites in protein structures

    J. Mol. Biol.

    (2003)
  • MadabushiS. et al.

    Evolutionary trace of G protein-coupled receptors reveals clusters of residues that determine global and class-specific functions

    J. Biol. Chem.

    (2004)
  • BartlettG.J. et al.

    Analysis of catalytic residues in enzyme active sites

    J. Mol. Biol.

    (2002)
  • PalD. et al.

    Inference of protein function from protein structure

    Structure

    (2005)
  • ShenH.B. et al.

    EzyPred: a top-down approach for predicting enzyme functional classes and subclasses

    Biochem. Biophys. Res. Commun.

    (2007)
  • BateP. et al.

    Enzyme/non-enzyme discrimination and prediction of enzyme active site location using charge-based methods

    J. Mol. Biol.

    (2004)
  • DobsonP.D. et al.

    Distinguishing enzyme structures from non-enzymes without alignments

    J. Mol. Biol.

    (2003)
  • MihalekI. et al.

    A family of evolution-entropy hybrid methods for ranking protein residues by importance

    J. Mol. Biol.

    (2004)
  • LeeD. et al.

    Predicting protein function from sequence and structure

    Nat. Rev. Mol. Cell Biol.

    (2007)
  • RentzschR. et al.

    Protein function prediction—the power of multiplicity

    Trends Biotechnol.

    (1998)
  • ChandoniaJ.M. et al.

    The impact of structural genomics: expectations and outcomes

    Science

    (2006)
  • BurleyS.K.

    An overview of structural genomics

    Nat. Struct. Biol.

    (2000)
  • BrennerS.E.

    A tour of structural genomics

    Nat. Rev. Genet.

    (2001)
  • XieL. et al.

    Functional coverage of the human genome by existing structures, structural genomics targets, and homology models

    PLoS Comput. Biol.

    (2005)
  • MercierK.A. et al.

    FAST-NMR: functional annotation screening technology using NMR spectroscopy

    J. Am. Chem. Soc.

    (2006)
  • WaldronK.J. et al.

    How do bacterial cells ensure that metalloproteins get the correct metal?

    Nat. Rev. Microbiol.

    (2009)
  • FriedbergI.

    Automated protein function prediction—the genomic challenge

    Brief Bioinf.

    (2006)
  • R. Apweiler et al.

    The Universal Protein Resource (UniProt) in 2010

    Nucleic Acids Res.

    (2010)
  • AltschulS.F. et al.

    Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

    Nucleic Acids Res.

    (1997)
  • GerltJ.A. et al.

    Can sequence determine function?

    Genome Biol.

    (2000)
  • ZhangB. et al.

    From fold predictions to function predictions: automation of functional site conservation analysis for functional genome predictions

    Protein Sci.

    (1999)
  • GalperinM.Y. et al.

    Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption

    In Silico Biol.

    (1998)
  • KarpP.D.

    What we do not know about sequence analysis and sequence databases

    Bioinformatics

    (1998)
  • Cited by (0)

    These authors contributed equally to this work.

    View full text