Recent advances in gene structure prediction

https://doi.org/10.1016/j.sbi.2004.05.007Get rights and content

Abstract

De novo gene predictors are programs that predict the exon-intron structures of genes using the sequences of one or more genomes as their only input. In the past two years, dual-genome de novo predictors, which exploit local rates and patterns of mutation inferred from alignments between two genomes, have led to significant improvements in accuracy. Systems that exploit more than two genomes simultaneously have only recently begun to appear and are not yet competitive on practical tasks, but offer the greatest hope for near-term improvements. Dual-genome de novo prediction for compact eukaryotic genomes such as those of Arabidopsis thaliana and Caenorhabditis elegans is already quite accurate. Although mammalian gene prediction lags behind in accuracy, it is yielding ever more useful results. Coupled with significant improvements in pseudogene detection methods, which have eliminated many false positives, we have reached the point where de novo gene predictions are being used as hypotheses to drive experimental annotation via systematic RT-PCR and sequencing.

Introduction

The past two years have seen the flowering of the genomic era, a period during which metazoan genome sequencing has been transformed from a major international event to a common undertaking that barely makes the covers of scientific journals, much less popular newspapers. The wealth of raw data generated by this technological triumph has greatly accelerated scientific progress even while it remains far from fully analyzed. It has also driven a series of advances in computational genome analysis, including methods for predicting the exon-intron structures of genes. Such methods can be divided into those that make use of expression data (including sequences from cDNAs and potentially data from hybridization experiments) and those that use only the sequences of one or more genomes (de novo or ab initio methods). The focus of this review is recent developments in de novo gene prediction for the genomes of higher eukaryotes.

De novo gene predictors can be categorized into those that use a single genome sequence, those that use two genome sequences to infer local rates and patterns of mutation along the genome, and those that use more than two genomes for the same purpose. Single-genome predictors reached a state of relative maturity with the development of systems based on hidden Markov models (HMMs) (e.g. GENSCAN [1], GENIE [2] and HMMGENE [3]) and related models (e.g. GENEID [4] and FGENESH [5]). Dual-genome de novo predictors (e.g. SGP-2 [6••], SLAM [7••] and TWINSCAN 8., 9.••) have led to the greatest practical improvement in the accuracy of prediction over the past two years. Systems that exploit more than two genomes simultaneously (e.g. 10.••, 11.) have only recently begun to appear and are not yet competitive on practical tasks, but offer the greatest hope for near-term improvements in accuracy.

Since the first animal and plant genomes were sequenced, de novo gene finders have been part of the standard toolbox for genome annotation and analysis. With the advent of dual-genome predictors, the accuracy for compact genomes, such as that of Arabidopsis thaliana, has become so good that one-half to two-thirds of all known genes are predicted exactly right, from the start codon through every splice site to the stop codon, and most of the imperfect predictions are only slightly off ([12]; Chaochun Wei, personal communication). The accuracy for mammalian genomes has lagged behind owing to inherent challenges, such as the large number of pseudogenes and small fraction of coding sequence, that affect all mammalian annotation methods. Although dual-genome de novo systems now correctly predict about 75% of all known exons at both splice sites, only 15–20% of known gene structures are predicted correctly throughout the coding region 6.••, 9.••. Annotation pipelines such as ENSEMBL [13], which require homology to known expressed sequences, are somewhat more accurate at predicting exons of known genes [9••], but they tend to miss many predicted exons and genes that can be verified experimentally 14.••, 15., 16.. Perhaps the most significant development of the past year in mammalian annotation has been the application of recently developed pseudogene detection methods 17., 18., which have eliminated many false positives from both de novo and pipeline-style annotation. Indeed, the advent of dual-genome systems, together with the elimination of many pseudogenes, has improved the de novo prediction accuracy to the point where systematic reverse transcription-polymerase chain reaction (RT-PCR) and sequencing of de novo predictions is a cost-effective complement to sequencing of random cDNA clones, even in mammalian genomes [19].

Section snippets

Single-genome predictors

Although methods based on the comparison of two or more genomes have greater accuracy, there are good reasons to develop improved single-genome de novo predictors. First, they are easier to train and faster to run than multi-genome predictors, and are consequently the first systems used to annotate newly sequenced genomes. Second, although comparative methods exploit an underlying sequence alignment, the information from the alignment is usually integrated with information from intrinsic

Dual-genome predictors

Dual-genome gene predictors rely on the fact that functional regions of a genome sequence — protein-coding genes in particular — are more conserved during evolution than non-functional ones (Figure 1, bottom two tracks). Over the past four years, several programs have been developed that exploit sequence conservation between two genomes to predict genes. A wide variety of strategies have been explored. In the pair HMM approach (e.g. SLAM [7••]), a joint probability model for sequence alignment

Multi-genome predictors

Dual-genome de novo systems work by using alignments between two genomes to draw inferences about the rate of evolution at each nucleotide. If the two sequences match at a particular base, that base is conserved; if they do not, it is not conserved. Although this has proven effective in practice, it is clearly a crude measure of evolutionary rate. Using multiple alignments among several genomes can provide a more precise measure of evolutionary rate and, in principle, this should lead to

Combining the output of gene predictors

Human annotators and automated genome annotation ‘pipelines’ 13., 33. generally operate by combining information that ultimately derives from cDNAs (expressed sequence tags [ESTs], full-length cDNA sequences and conceptual translations) with information from one or more de novo gene finders. Human annotators use their intuition and experience to synthesize the often contradictory evidence into a single gene structure, whereas pipelines generally use rules based on the intuition and experience

Improving models of DNA sequences and their roles in protein production

De novo gene prediction programs work primarily by recognizing patterns in genomic sequences that are characteristic of splice sites, translation initiation and termination sites, protein-coding regions, poly-adenylation sites and sites with other specific functions in gene expression. In most systems, pattern recognition is based on probability models for each of these functions. For example, given a DNA sequence, the splice donor model assigns a likelihood to the proposition that the sequence

Extending the functionality of gene predictors

Despite progress in modeling sequence signals that function in gene expression, overall the models underlying current computational methods are still quite simple, encoding a rather naive view of the eukaryotic gene. Almost without exception, computational gene finders predict only the coding fraction of a single spliced form of non-overlapping, canonical protein-coding genes. They deal poorly, if at all, with untranslated regions (UTRs), alternative spliced forms, overlapping or embedded

Experimental verification and refinement of predicted gene structures

High-throughput sequencing of genomes and cDNA libraries is sometimes described as a ‘data-driven’ approach to biology, in contrast to the traditional hypothesis-driven approach. In keeping with this spirit, gene prediction systems are typically run on entire genomes, the results are published or distributed on web sites, and it is hoped that some of the predictions might influence the hypotheses pursued by experimental biologists. Recently, however, the limitations of the data-driven approach

Concluding remarks

De novo gene prediction for compact eukaryotic genomes is already quite accurate. Although mammalian gene prediction lags behind in accuracy, it is yielding ever more useful results. In particular, the use of de novo gene predictions as hypotheses to drive experimental annotation based on systematic RT-PCR and sequencing will improve mammalian annotation greatly in the coming year. As the new approaches described above are integrated with state-of-the-art gene finders, de novo accuracy can be

References and recommended reading

Papers of particular interest, published within the annual period of review, have been highlighted as:

  • of special interest

  • ••

    of outstanding interest

Acknowledgements

We thank David Shteynberg for help with Figure 1 and Josep F Abril for help with Figure 2. MB is supported, in part, by grants HG02278, HG003150 and AI051209 from the National Institutes of Health, and DBI-0091270 from the National Science foundation. RG is supported by grants from the Plan Nacional de I+D (Spain), QLK3-CT-2002-02062 from the European Community and HG003150-01 from the National Institutes of Health.

References (65)

  • C Burge et al.

    Prediction of complete gene structures in human genomic DNA

    J Mol Biol

    (1997)
  • R Guigó et al.

    Prediction of gene structure

    J Mol Biol

    (1992)
  • Wang M, Buhler J, Brent MR: The effects of evolutionary distance on TWINSCAN, an algorithm for pairwise comparative...
  • D Kulp et al.

    A generalized hidden Markov model for the recognition of human genes in DNA

    Proc Int Conf Intell Syst Mol Biol

    (1996)
  • A Krogh

    Two methods for improving performance of an HMM and their application for gene finding

    Proc Int Conf Intell Syst Mol Biol

    (1997)
  • A.A Salamov et al.

    Ab initio gene finding in Drosophila genomic DNA

    Genome Res

    (2000)
  • G Parra et al.

    Comparative gene prediction in human and mouse

    Genome Res

    (2003)
  • M Alexandersson et al.

    SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model

    Genome Res

    (2003)
  • I Korf et al.

    Integrating genomic homology into gene structure prediction

    Bioinformatics

    (2001)
  • P Flicek et al.

    Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map

    Genome Res

    (2003)
  • Siepel AC, Haussler D: Computational identification of evolutionarily conserved exons. In RECOMB 2004: Proceedings of...
  • J.S Pedersen et al.

    Gene finding with a hidden Markov model of genome structure and evolution

    Bioinformatics

    (2003)
  • J.E Allen et al.

    Computational gene prediction using multiple sources of evidence

    Genome Res

    (2004)
  • T Hubbard et al.

    The Ensembl genome database project

    Nucleic Acids Res

    (2002)
  • R Guigó et al.

    Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes

    Proc Natl Acad Sci USA

    (2003)
  • J.Q Wu et al.

    Identification of rat genes by TWINSCAN gene prediction, RT-PCR, and direct sequencing

    Genome Res

    (2004)
  • C Dewey et al.

    Accurate identification of novel human genes through simultaneous gene prediction in human, mouse, and rat

    Genome Res

    (2004)
  • Z Zhang et al.

    Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome

    Genome Res

    (2003)
  • D Torrents et al.

    A genome-wide survey of human pseudogenes

    Genome Res

    (2003)
  • The MGC Project Team: The status, quality and expansion of the NIH full-length cDNA project (MGC). Genome Res 2004,...
  • L Zhang et al.

    Human-mouse gene identification by comparative evidence integration and evolutionary analysis

    Genome Res

    (2003)
  • D Kotlar et al.

    Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions

    Genome Res

    (2003)
  • R.H Waterston et al.

    Initial sequencing and comparative analysis of the mouse genome

    Nature

    (2002)
  • A Nekrutenko et al.

    An evolutionary approach reveals a high protein-coding capacity of the human genome

    Trends Genet

    (2003)
  • A Nekrutenko et al.

    ETOPE: evolutionary test of predicted exons

    Nucleic Acids Res

    (2003)
  • J.E Moore et al.

    Gene structure prediction in syntenic DNA segments

    Nucleic Acids Res

    (2003)
  • H Noguchi et al.

    A novel index which precisely derives protein coding regions from cross-species genome alignments

    Genome Inform Ser Workshop Genome Inform

    (2002)
  • G Parra et al.

    GeneID in Drosophila

    Genome Res

    (2000)
  • Guigó R, Wiehe T: Gene prediction accuracy in large DNA sequences. In Frontiers in Computational Genomics. Edited by...
  • D Boffelli et al.

    Phylogenetic shadowing of primate sequences to find functional regions of the human genome

    Science

    (2003)
  • I Holmes et al.

    Evolutionary HMMs: a Bayesian approach to multiple alignment

    Bioinformatics

    (2001)
  • Siepel AC, Haussler D: Combining phylogenetic and hidden Markov models in biosequence analysis. In RECOMB 2003:...
  • Cited by (0)

    View full text