Evaluation of Gene-Finding Programs on Mammalian Sequences

  1. Sanja Rogic1,5,
  2. Alan K. Mackworth2, and
  3. Francis B.F. Ouellette3
  1. 1Computer Science Department, The University of California at Santa Cruz, Santa Cruz 95064, California; 2Computer Science Department, The University of British Columbia, Vancouver, BC V6T 1Z4, Canada; 3Centre for Molecular Medicine and Therapeutics, Vancouver, BC V5Z 4H4, Canada

Abstract

We present an independent comparative analysis of seven recently developed gene-finding programs: FGENES,GeneMark.hmm, Genie, Genscan,HMMgene, Morgan, and MZEF. For evaluation purposes we developed a new, thoroughly filtered, and biologically validated dataset of mammalian genomic sequences that does not overlap with the training sets of the programs analyzed. Our analysis shows that the new generation of programs has substantially better results than the programs analyzed in previous studies. The accuracy of the programs was also examined as a function of various sequence and prediction features, such as G + C content of the sequence, length and type of exons, signal type, and score of the exon prediction. This approach pinpoints the strengths and weaknesses of each individual program as well as those of computational gene-finding in general. The dataset used in this analysis (HMR195) as well as the tables with the complete results are available athttp://www.cs.ubc.ca/∼rogic/evaluation/.

Footnotes

  • 5 Corresponding author.

  • E-MAIL: rogic{at}cse.ucsc.edu; FAX: (831) 459–4046.

  • Article and publication are at www.genome.org/cgi/doi/10.1101/gr.147901.

    • Received May 17, 2000.
    • Accepted February 27, 2001.
| Table of Contents

Preprint Server