Improving gene annotation using peptide mass spectrometry

  1. Stephen Tanner1,6,
  2. Zhouxin Shen2,
  3. Julio Ng1,
  4. Liliana Florea3,
  5. Roderic Guigó4,
  6. Steven P. Briggs2, and
  7. Vineet Bafna5
  1. 1 Bioinformatics Program, University of California, San Diego, La Jolla, California 92093-0419, USA;
  2. 2 Department of Biology, University of California, San Diego, La Jolla, California 92093-0346, USA;
  3. 3 Department of Computer Science, George Washington University, Washington, DC 20052, USA;
  4. 4 Centre de Regualció Genòmica, 08003 Barcelona, Spain;
  5. 5 Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093-0404, USA

Abstract

Annotation of protein-coding genes is a key goal of genome sequencing projects. In spite of tremendous recent advances in computational gene finding, comprehensive annotation remains a challenge. Peptide mass spectrometry is a powerful tool for researching the dynamic proteome and suggests an attractive approach to discover and validate protein-coding genes. We present algorithms to construct and efficiently search spectra against a genomic database, with no prior knowledge of encoded proteins. By searching a corpus of 18.5 million tandem mass spectra (MS/MS) from human proteomic samples, we validate 39,000 exons and 11,000 introns at the level of translation. We present translation-level evidence for novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins, and discover or confirm over 40 alternative splicing events. Polymorphisms are efficiently encoded in our database, allowing us to observe variant alleles for 308 coding SNPs. Finally, we demonstrate the use of mass spectrometry to improve automated gene prediction, adding 800 correct exons to our predictions using a simple rescoring strategy. Our results demonstrate that proteomic profiling should play a role in any genome sequencing project.

Footnotes

  • 6 Corresponding author.

    6 E-mail stanner{at}ucsd.edu; fax (858) 534-7029

  • [Supplemental material is available online at www.genome.org.]

  • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5646507

    • Received June 15, 2006.
    • Accepted November 9, 2006.
| Table of Contents

Preprint Server