Recent advances in gene structure prediction
Introduction
The past two years have seen the flowering of the genomic era, a period during which metazoan genome sequencing has been transformed from a major international event to a common undertaking that barely makes the covers of scientific journals, much less popular newspapers. The wealth of raw data generated by this technological triumph has greatly accelerated scientific progress even while it remains far from fully analyzed. It has also driven a series of advances in computational genome analysis, including methods for predicting the exon-intron structures of genes. Such methods can be divided into those that make use of expression data (including sequences from cDNAs and potentially data from hybridization experiments) and those that use only the sequences of one or more genomes (de novo or ab initio methods). The focus of this review is recent developments in de novo gene prediction for the genomes of higher eukaryotes.
De novo gene predictors can be categorized into those that use a single genome sequence, those that use two genome sequences to infer local rates and patterns of mutation along the genome, and those that use more than two genomes for the same purpose. Single-genome predictors reached a state of relative maturity with the development of systems based on hidden Markov models (HMMs) (e.g. GENSCAN [1], GENIE [2] and HMMGENE [3]) and related models (e.g. GENEID [4] and FGENESH [5]). Dual-genome de novo predictors (e.g. SGP-2 [6••], SLAM [7••] and TWINSCAN 8., 9.••) have led to the greatest practical improvement in the accuracy of prediction over the past two years. Systems that exploit more than two genomes simultaneously (e.g. 10.••, 11.) have only recently begun to appear and are not yet competitive on practical tasks, but offer the greatest hope for near-term improvements in accuracy.
Since the first animal and plant genomes were sequenced, de novo gene finders have been part of the standard toolbox for genome annotation and analysis. With the advent of dual-genome predictors, the accuracy for compact genomes, such as that of Arabidopsis thaliana, has become so good that one-half to two-thirds of all known genes are predicted exactly right, from the start codon through every splice site to the stop codon, and most of the imperfect predictions are only slightly off ([12]; Chaochun Wei, personal communication). The accuracy for mammalian genomes has lagged behind owing to inherent challenges, such as the large number of pseudogenes and small fraction of coding sequence, that affect all mammalian annotation methods. Although dual-genome de novo systems now correctly predict about 75% of all known exons at both splice sites, only 15–20% of known gene structures are predicted correctly throughout the coding region 6.••, 9.••. Annotation pipelines such as ENSEMBL [13], which require homology to known expressed sequences, are somewhat more accurate at predicting exons of known genes [9••], but they tend to miss many predicted exons and genes that can be verified experimentally 14.••, 15., 16.. Perhaps the most significant development of the past year in mammalian annotation has been the application of recently developed pseudogene detection methods 17., 18., which have eliminated many false positives from both de novo and pipeline-style annotation. Indeed, the advent of dual-genome systems, together with the elimination of many pseudogenes, has improved the de novo prediction accuracy to the point where systematic reverse transcription-polymerase chain reaction (RT-PCR) and sequencing of de novo predictions is a cost-effective complement to sequencing of random cDNA clones, even in mammalian genomes [19].
Section snippets
Single-genome predictors
Although methods based on the comparison of two or more genomes have greater accuracy, there are good reasons to develop improved single-genome de novo predictors. First, they are easier to train and faster to run than multi-genome predictors, and are consequently the first systems used to annotate newly sequenced genomes. Second, although comparative methods exploit an underlying sequence alignment, the information from the alignment is usually integrated with information from intrinsic
Dual-genome predictors
Dual-genome gene predictors rely on the fact that functional regions of a genome sequence — protein-coding genes in particular — are more conserved during evolution than non-functional ones (Figure 1, bottom two tracks). Over the past four years, several programs have been developed that exploit sequence conservation between two genomes to predict genes. A wide variety of strategies have been explored. In the pair HMM approach (e.g. SLAM [7••]), a joint probability model for sequence alignment
Multi-genome predictors
Dual-genome de novo systems work by using alignments between two genomes to draw inferences about the rate of evolution at each nucleotide. If the two sequences match at a particular base, that base is conserved; if they do not, it is not conserved. Although this has proven effective in practice, it is clearly a crude measure of evolutionary rate. Using multiple alignments among several genomes can provide a more precise measure of evolutionary rate and, in principle, this should lead to
Combining the output of gene predictors
Human annotators and automated genome annotation ‘pipelines’ 13., 33. generally operate by combining information that ultimately derives from cDNAs (expressed sequence tags [ESTs], full-length cDNA sequences and conceptual translations) with information from one or more de novo gene finders. Human annotators use their intuition and experience to synthesize the often contradictory evidence into a single gene structure, whereas pipelines generally use rules based on the intuition and experience
Improving models of DNA sequences and their roles in protein production
De novo gene prediction programs work primarily by recognizing patterns in genomic sequences that are characteristic of splice sites, translation initiation and termination sites, protein-coding regions, poly-adenylation sites and sites with other specific functions in gene expression. In most systems, pattern recognition is based on probability models for each of these functions. For example, given a DNA sequence, the splice donor model assigns a likelihood to the proposition that the sequence
Extending the functionality of gene predictors
Despite progress in modeling sequence signals that function in gene expression, overall the models underlying current computational methods are still quite simple, encoding a rather naive view of the eukaryotic gene. Almost without exception, computational gene finders predict only the coding fraction of a single spliced form of non-overlapping, canonical protein-coding genes. They deal poorly, if at all, with untranslated regions (UTRs), alternative spliced forms, overlapping or embedded
Experimental verification and refinement of predicted gene structures
High-throughput sequencing of genomes and cDNA libraries is sometimes described as a ‘data-driven’ approach to biology, in contrast to the traditional hypothesis-driven approach. In keeping with this spirit, gene prediction systems are typically run on entire genomes, the results are published or distributed on web sites, and it is hoped that some of the predictions might influence the hypotheses pursued by experimental biologists. Recently, however, the limitations of the data-driven approach
Concluding remarks
De novo gene prediction for compact eukaryotic genomes is already quite accurate. Although mammalian gene prediction lags behind in accuracy, it is yielding ever more useful results. In particular, the use of de novo gene predictions as hypotheses to drive experimental annotation based on systematic RT-PCR and sequencing will improve mammalian annotation greatly in the coming year. As the new approaches described above are integrated with state-of-the-art gene finders, de novo accuracy can be
References and recommended reading
Papers of particular interest, published within the annual period of review, have been highlighted as:
- •
of special interest
- ••
of outstanding interest
Acknowledgements
We thank David Shteynberg for help with Figure 1 and Josep F Abril for help with Figure 2. MB is supported, in part, by grants HG02278, HG003150 and AI051209 from the National Institutes of Health, and DBI-0091270 from the National Science foundation. RG is supported by grants from the Plan Nacional de I+D (Spain), QLK3-CT-2002-02062 from the European Community and HG003150-01 from the National Institutes of Health.
References (65)
- et al.
Prediction of complete gene structures in human genomic DNA
J Mol Biol
(1997) - et al.
Prediction of gene structure
J Mol Biol
(1992) - Wang M, Buhler J, Brent MR: The effects of evolutionary distance on TWINSCAN, an algorithm for pairwise comparative...
- et al.
A generalized hidden Markov model for the recognition of human genes in DNA
Proc Int Conf Intell Syst Mol Biol
(1996) Two methods for improving performance of an HMM and their application for gene finding
Proc Int Conf Intell Syst Mol Biol
(1997)- et al.
Ab initio gene finding in Drosophila genomic DNA
Genome Res
(2000) - et al.
Comparative gene prediction in human and mouse
Genome Res
(2003) - et al.
SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model
Genome Res
(2003) - et al.
Integrating genomic homology into gene structure prediction
Bioinformatics
(2001) - et al.
Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map
Genome Res
(2003)