Genome-scale phylogenetic function annotation of large and diverse protein families

  1. Steven E. Brenner3
  1. 1Electrical Engineering and Computer Science Department, University of California, Berkeley, California 94720, USA;
  2. 2Statistics Department, University of California, Berkeley, California 94720, USA;
  3. 3Plant & Microbial Biology Department, University of California, Berkeley, California 94720, USA
    • Present addresses: 4Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA;

    Abstract

    The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function prediction. Here we present a revised approach (SIFTER version 2.0) that enables annotations on a genomic scale. SIFTER 2.0 produces equivalently precise predictions compared to the earlier version on a carefully studied family and on a collection of 100 protein families. We have added an approximation method to SIFTER 2.0 and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family. On the Nudix protein family, previously inaccessible to the SIFTER framework because of the 66 possible molecular functions, SIFTER achieved 47.4% accuracy on experimental data (where BLAST achieved 34.0%). Finally, we used SIFTER to annotate all of the Schizosaccharomyces pombe proteins with experimental functional characterizations, based on annotations from proteins in 46 fungal genomes. SIFTER precisely predicted molecular function for 45.5% of the characterized proteins in this genome, as compared with four current function prediction methods that precisely predicted function for 62.6%, 30.6%, 6.0%, and 5.7% of these proteins. We use both precision-recall curves and ROC analyses to compare these genome-scale predictions across the different methods and to assess performance on different types of applications. SIFTER 2.0 is capable of predicting protein molecular function for large and functionally diverse protein families using an approximate statistical model, enabling phylogenetics-based protein function prediction for genome-wide analyses. The code for SIFTER and protein family data are available at http://sifter.berkeley.edu.

    Footnotes

    • 5 Molecular and Cellular Biology Department, Harvard University, Cambridge, MA 02138, USA.

    • 6 Corresponding author.

      E-mail bee{at}compbio.berkeley.edu.

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.104687.109.

    • Received December 29, 2009.
    • Accepted July 11, 2011.

    Freely available online through the Genome Research Open Access option.

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server