Key Points
-
It is not currently possible to determine the precise structure of every protein-coding gene in a complex, eukaryotic genome. However, the past 10 years have seen steady progress in the accuracy and completeness of methods for automated genome annotation.
-
Currently, the gold standard in the annotation of exon–intron structures is the alignment of a full-length cDNA sequence to the sequence of the genomic region from which it was transcribed.
-
For a significant fraction of genes, it is not practical to obtain full-length cDNA sequences by sequencing randomly selected cDNA clones or by screening clone libraries.
-
Some of these genes can be accurately annotated by aligning the sequence of a cDNA (or its translation) to a very similar genomic region other than the one from which it was transcribed.
-
The first driver of recent improvements in annotation is the sequencing of many genomes that can be compared with one another, a trend that is likely to continue.
-
A second source of improvement is the development of better probability models for de novo gene prediction, most recently those based on the conditional random field modelling framework.
-
A third significant source of improvement in mammalian genome annotation has been the development of software for automatically detecting processed pseudogenes.
-
By designing PCR primers for predicted cDNA sequences, it is possible to specifically amplify and sequence thousands of cDNAs, the sequences of which could not be obtained by traditional methods.
-
By using a combiner program to adjudicate among predictions and alignments produced by several methods, one can now come closer than ever before to producing complete and accurate gene catalogues.
Abstract
The sequencing of large, complex genomes has become routine, but understanding how sequences relate to biological function is less straightforward. Although much attention is focused on how to annotate genomic features such as developmental enhancers and non-coding RNAs, there is still no higher eukaryote for which we know the correct exon–intron structure of at least one ORF for each gene. Despite this uncomfortable truth, genome annotation has made remarkable progress since the first drafts of the human genome were analysed. By combining several computational and experimental methods, we are now closer to producing complete and accurate gene catalogues than ever before.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
The MGC Project Team. The status, quality, and expansion of the NIH full-length cDNA project: the mammalian gene collection (MGC). Genome Res. 14, 2121–2127 (2004).
Bernal, A., Crammer, K., Hatzigeorgiou, A. & Pereira, F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3, e54 (2007). This paper presents CRAIG, a CRF-based, single-genome de novo gene predictor with the best published accuracy for the human genome among programs that do not use comparison with related genome sequences.
Decaprio, D. et al. CONRAD: gene prediction using conditional random fields. Genome Res. 17, 1389–1398 (2007). This paper presents CONRAD, a CRF-based, multi-genome de novo gene predictor with the best published benchmark accuracy on fungal genomes.
Gross, S. S., Do, C. B., Sirota, M. & Batzoglou, S. CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol. (in the press). This paper presents CONTRAST, a CRF-based, multi-genome de novo gene predictor that is currently the most accurate predictor, at least for mammals and flies. CONTRAST is also likely to work well on other complex eukaryotic genomes.
ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).
Gerstein, M. B. et al. What is a gene, post-ENCODE? History and updated definition. Genome Res. 17, 669–681 (2007).
Mott, R. EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci. 13, 477–478 (1997).
Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).
Shibata, Y. et al. Cloning full-length, cap-trapper-selected cDNAs by using the single-strand linker ligation method. Biotechniques 30, 1250–1254 (2001).
Suzuki, Y. et al. Statistical analysis of the 5′ untranslated region of human mRNA using 'oligo-capped' cDNA libraries. Genomics 64, 286–297 (2000).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Guigó, R. et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl Acad. Sci. USA 100, 1140–1145 (2003).
Wu, J. Q., Shteynberg, D., Arumugam, M., Gibbs, R. A. & Brent, M. R. Identification of rat genes by TWINSCAN gene prediction, RT-PCR, and direct sequencing. Genome Res. 14, 665–671 (2004).
Eyras, E. et al. Gene finding in the chicken genome. BMC Bioinformatics 6, 131 (2005).
Denoeud, F. et al. Prominent use of distal 5′ transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 17, 746–759 (2007).
Siepel, A. et al. Targeted discovery of novel human exons by comparative genomics. Genome Res. 17, 1763–1773 (2007). This paper shows that de novo gene prediction followed by RT-PCR and direct sequencing can be used to elucidate many novel exons and introns even in a genome as thoroughly studied as the human genome.
Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).
Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, 988–995 (2004).
Birney, E. et al. An overview of ENSEMBL. Genome Res. 14, 925–928 (2004).
Meyer, I. M. & Durbin, R. Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res. 32, 776–783 (2004).
Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
Brent, M. R. How does eukaryotic gene prediction work? Nature Biotechnol. 25, 883–885 (2007).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Pavy, N. et al. Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics 15, 887–899 (1999).
Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J. & Tettelin, H. Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24–31 (1999).
Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E. S. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254 (2003). This paper presents the RFC method of identifying protein-coding regions using only multi-genome alignments.
Waterston, R. H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene structure prediction. Bioinformatics 17, S140–S148 (2001).
Flicek, P. & Brent, M. R. Using several pair-wise informant sequences for de novo prediction of alternatively spliced transcripts. Genome Biol. 7, S8 (2006).
Parra, G. et al. Comparative gene prediction in human and mouse. Genome Res. 13, 108–117 (2003).
Parra, G., Blanco, E. & Guigo, R. GeneID in Drosophila. Genome Res. 10, 511–515 (2000).
Clamp, M. et al. Distinguishing protein-coding and non-coding genes in the human genome. Proc. Natl Acad. Sci. USA (in the press).
Wang, M., Buhler, J. & Brent, M. R. in The Genome of Homo Sapiens (eds Stillman, B. & Stewart, D.) 125–130 (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 2004).
Zhang, L., Pavlovic, V., Cantor, C. R. & Kasif, S. Human–mouse gene identification by comparative evidence integration and evolutionary analysis. Genome Res. 13, 1190–1202 (2003).
Clark, A. G. et al. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203–218 (2007).
Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. R. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res. 13, 46–54 (2003). This paper shows that unassembled sequencing reads representing three- to fourfold coverage of an informant genome are almost as useful as a high-coverage informant assembly for de novo gene prediction.
Siepel, A. C. & Haussler, D. in RECOMB (ACM, San Diego, 2004).
Gross, S. S. & Brent, M. R. Using multiple alignments to improve gene prediction. J. Comput. Biol. 13, 379–393 (2006). This paper presents N-SCAN, a multi-genome de novo gene predictor that was the most accurate program for animal genomes until CONTRAST was introduced.
Do, C. B., Woods, D. A. & Batzoglou, S. CONRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 22, e90–e98 (2006).
Gross, S. S., Russakovsky, O., Do, C. B. & Batzoglou, S. Training conditional random fields for maximum labelwise accuracy. Adv. Neural Inf. Process. Syst. 19, (Neural Information Processing Systems Foundation, 2006).
Wei, C. et al. Closing in on the C. elegans ORFeome by cloning TWINSCAN predictions. Genome Res. 15, 577–582 (2005).
Wei, C. & Brent, M. R. Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7, 327 (2006).
Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).
Moskal, W. A. Jr. et al. Experimental validation of novel genes predicted in the un-annotated regions of the Arabidopsis genome. BMC Genomics 8, 18 (2007).
Allen, J. E., Pertea, M. & Salzberg, S. L. Computational gene prediction using multiple sources of evidence. Genome Res. 14, 142–148 (2004).
van Baren, M. J. & Brent, M. R. Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res. 16, 678–685 (2006). This paper presents PPFINDER, a program that can remove processed pseudogene fragments from gene predictions even when there is no database of previously known functional genes.
Torrents, D., Suyama, M., Zdobnov, E. & Bork, P. A genome-wide survey of human pseudogenes. Genome Res. 13, 2559–2567 (2003).
Zhang, Z. & Gerstein, M. Large-scale analysis of pseudogenes in the human genome. Curr. Opin. Genet. Dev. 14, 328–335 (2004).
Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 7, S4 (2006). This paper provides useful insights into a modern manual annotation effort and how it compares with both automated annotation and experimental verification.
Pruitt, K., Tatusova, T. & Maglott, D.R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 1, 501–504 (2005).
Arumugam, M., Wei, C., Brown, R. H. & Brent, M. R. Pairagon+N-SCAN_EST: a model-based gene annotation pipeline. Genome Biol. 7, S5 (2006).
Stanke, M., Schoffmann, O., Morgenstern, B. & Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006).
Stanke, M., Tzvetkova, A. & Morgenstern, B. AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol. 7, S11 (2006).
Howe, K. L., Chothia, T. & Durbin, R. GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res. 12, 1418–1427 (2002).
Elsik, C. G. et al. Creating a honey bee consensus gene set. Genome Biol. 8, R13 (2007).
Allen, J. E. & Salzberg, S. L. Jigsaw: integration of multiple sources of evidence for gene prediction. Bioinformatics 21, 3596–3603 (2005). This paper presents Jigsaw, a highly accurate system for combining predictions that are produced by other methods.
Coghlan, A. & Durbin, R. Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron–exon structure. Bioinformatics 23, 1468–1475 (2007).
Guigo, R. et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 7, S2 (2006). This paper describes detailed benchmarks on the accuracy of several gene prediction programs that use a range of methods and evaluating them on 30 Mb of the human genome.
Brent, M. R. Genome annotation past, present and future: how to define an ORF at each locus. Genome Res. 15, 1777–1786 (2005).
D'Haeseleer, P. What are DNA sequence motifs? Nature Biotechnol. 24, 423–425 (2006).
Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003). This paper presents AUGUSTUS, currently the most accurate GHMM-based, single-genome de novo predictor for flies. AUGUSTUS uses innovative splice-site and intron-length models.
Acknowledgements
I am deeply grateful to M. J. van Baren, M. Schuster and E. Birney for help with the GeneWise analysis, R. Brown for analysis of informant genome utility, M. J. van Baren and S. Gross for comments on the manuscript, and L. Kyro and L. Langton for help with figures. M.R.B. is supported in part by grants from the National Institutes of Health (HG002278, HG003700, HG004271) and Monsanto.
Author information
Authors and Affiliations
Glossary
- cDNA library
-
A collection of clones that propagate and amplify copies of diverse (usually random) cDNA sequences.
- Cis alignment
-
The alignment of a cDNA sequence to the locus that matches it best in its source genome — the presumed template for its transcription.
- Trans alignment
-
The alignment of a cDNA or protein sequence to a homologous locus other than the one from which it was transcribed.
- De novo gene prediction
-
An approach to gene prediction in which the only inputs are genome sequences; no evidence derived from RNA is used.
- Target genome
-
The genome to be annotated, as opposed to informant genomes or other supporting sequences. In gene prediction, informant genomes are genome sequences that are aligned to the target genome and used as auxiliary information for annotating it.
- Conditional random field
-
A type of discriminative model that is used for assigning probabilities to possible annotations of a sequence. A discriminative model is a probability model in which the most likely values of hidden variables (for example, annotations of DNA segments) are calculated directly from the observed variable values (for example, the DNA sequences) without using the probability of the observed values.
- Shotgun mass spectrometry
-
A method for simultaneously identifying many of the protein species present in a complex mixture by fragmenting them and precisely measuring the charge-to-mass ratios of the fragments in a mass spectrometer.
- Processivity
-
The tendency of a polymerase to continue to move along a template molecule rather than falling off prematurely.
- Robustness
-
The ability to function well in difficult circumstances or in unexpected circumstances for which it was not designed.
- Nearly full-length (NFL) protein alignment
-
Alignment of a protein sequence to a genome in which the alignment extends to the ends of the protein, or nearly so.
- Profile hidden Markov model
-
A mathematical model that represents the conserved elements of an entire family of related proteins or a family of conserved functional domains.
- Training data
-
In de novo gene prediction, it is a set of known gene structures with the corresponding genomic sequence (and alignments to informant genomes, if available). Training data are used in specializing the probability model to fit the characteristics of a particular genome.
- Parse
-
A segmentation of a string of letters together with a labelling of the segments.
- Bayes' rule
-
A mathematical identity (Pr(x|y) = Pr(y|x) Pr(x)/Pr(y)) that allows one to swap variables in a conditional probability expression.
- Negative selection
-
Sequences are under negative selection when mutations are deleterious to fitness and hence tend to be weeded out over time.
- Substitutions per synonymous site
-
An estimate of evolutionary distance that makes use of silent substitutions in protein-coding regions, similar to the rate of substitutions in fourfold degenerate sites.
- Generative model
-
A probability model in which, to calculate the most likely values of hidden variables (annotations of DNA segments), one must also calculate the probability of the observed variable values (the DNA sequence).
- Generalized hidden Markov model
-
A type of generative model that is used for assigning probabilities to possible annotations of a sequence. Generalized hidden Markov models are preferred over ordinary hidden Markov models for gene prediction because they make it possible to model the distribution of exon lengths.
Rights and permissions
About this article
Cite this article
Brent, M. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9, 62–73 (2008). https://doi.org/10.1038/nrg2220
Issue Date:
DOI: https://doi.org/10.1038/nrg2220
This article is cited by
-
Comprehensive analysis of LDHAP5 pseudogene expression and potential pathogenesis in ovarian serous cystadenocarcinoma
Cancer Cell International (2020)
-
Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models
BMC Genomics (2019)
-
Recent Perspective of Next Generation Sequencing: Applications in Molecular Plant Biology and Crop Improvement
Proceedings of the National Academy of Sciences, India Section B: Biological Sciences (2018)
-
Identification and characterization of protein coding genes in monsonia (Monsonia burkeana Planch. ex harv) using a combination of approaches
Genes & Genomics (2017)
-
The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies
Algorithms for Molecular Biology (2016)