Human–Mouse Gene Identification by Comparative Evidence Integration and Evolutionary Analysis

  1. Lingang Zhang1,3,
  2. Vladimir Pavlovic2,3,5,6,
  3. Charles R Cantor1,3,4, and
  4. Simon Kasif2,3,5
  1. 1 Center for Advanced Biotechnology
  2. 2 Bioinformatics Program
  3. 3 Department of Biomedical Engineering, Boston University, Boston, Massachusetts 02215, USA
  4. 4 Sequenom Inc., San Diego, California 92121, USA

Abstract

The identification of genes in the human genome remains a challenge, as the actual predictions appear to disagree tremendously and vary dramatically on the basis of the specific gene-finding methodology used. Because the pattern of conservation in coding regions is expected to be different from intronic or intergenic regions, a comparative computational analysis can lead, in principle, to an improved computational identification of genes in the human genome by using a reference, such as mouse genome. However, this comparative methodology critically depends on three important factors: (1) the selection of the most appropriate reference genome. In particular, it is not clear whether the mouse is at the correct evolutionary distance from the human to provide sufficiently distinctive conservation levels in different genomic regions, (2) the selection of comparative features that provide the most benefit to gene recognition, and (3) the selection of evidence integration architecture that effectively interprets the comparative features. We address the first question by a novel evolutionary analysis that allows us to explicitly correlate the performance of the gene recognition system with the evolutionary distance (time) between the two genomes. Our simulation results indicate that there is a wide range of reference genomes at different evolutionary time points that appear to deliver reasonable comparative prediction of human genes. In particular, the evolutionary time between human and mouse generally falls in the region of good performance; however, better accuracy might be achieved with a reference genome further than mouse. To address the second question, we propose several natural comparative measures of conservation for identifying exons and exon boundaries. Finally, we experiment with Bayesian networks for the integration of comparative and compositional evidence.

Footnotes

  • 6 Although BLOSUM80 is expected to better characterize the divergence between human and mouse protein, we experimented with both BLOSUM62 and BLOSUM80 and found that BLOSUM62 was slightly better than, although very similar to, BLOSUM80 at identifying protein-coding regions. Hence, we used BLOSUM62 in our experiments. All other BLAST parameters are used as defaults.

  • 7 Kullback-Leibler or KL divergence (Cover and Thomas 1991). For two distributions, p and q KL divergence is defined as Formula

    7 It can be shown that one type of annotation error depends on the KL divergence as error ∼ exp(-KL). See V. Pavlovic, L. Zhang, and S. Kasif(in prep.) for more details.

  • 8 Stated more precisely, results of maximum likelihood estimation are not the substitution matrices Q themselves, but rather the estimates of products Q′ = Q · t′, in which t′ is the distance between human and mouse. Hence, substituting Q′ and t = 1 in, for instance, (2) yields the exponent Q′ · t = Q′ · 1 = Q · t′· 1 = Q · t′, and, thus, the probability of substitutions at the evolutionary distance between human and mouse.

  • 9 Error = exp(-KL[PcPn]). Precision = 1 - error.

  • [Software is available on request from the authors.]

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.703903. Article published online before print in May 2003.

  • 5 Corresponding authors. E-MAIL kasif{at}bu.edu; FAX (617) 353-6766. E-MAIL vladimir{at}cs.rutgers.edu

  • 6 Present address: Dept. of Computer Science, Rutgers University, Piscataway 08854, NJ.

    • Accepted February 3, 2003.
    • Received August 9, 2002.
| Table of Contents

Preprint Server