Steady progress and recent breakthroughs in the accuracy of automated genome annotation

Brent, Michael R.

doi:10.1038/nrg2220

Review Article
Published: January 2008

Steady progress and recent breakthroughs in the accuracy of automated genome annotation

Michael R. Brent¹

Nature Reviews Genetics volume 9, pages 62–73 (2008)Cite this article

1536 Accesses
108 Citations
3 Altmetric
Metrics details

Key Points

It is not currently possible to determine the precise structure of every protein-coding gene in a complex, eukaryotic genome. However, the past 10 years have seen steady progress in the accuracy and completeness of methods for automated genome annotation.
Currently, the gold standard in the annotation of exon–intron structures is the alignment of a full-length cDNA sequence to the sequence of the genomic region from which it was transcribed.
For a significant fraction of genes, it is not practical to obtain full-length cDNA sequences by sequencing randomly selected cDNA clones or by screening clone libraries.
Some of these genes can be accurately annotated by aligning the sequence of a cDNA (or its translation) to a very similar genomic region other than the one from which it was transcribed.
The first driver of recent improvements in annotation is the sequencing of many genomes that can be compared with one another, a trend that is likely to continue.
A second source of improvement is the development of better probability models for de novo gene prediction, most recently those based on the conditional random field modelling framework.
A third significant source of improvement in mammalian genome annotation has been the development of software for automatically detecting processed pseudogenes.
By designing PCR primers for predicted cDNA sequences, it is possible to specifically amplify and sequence thousands of cDNAs, the sequences of which could not be obtained by traditional methods.
By using a combiner program to adjudicate among predictions and alignments produced by several methods, one can now come closer than ever before to producing complete and accurate gene catalogues.

Abstract

The sequencing of large, complex genomes has become routine, but understanding how sequences relate to biological function is less straightforward. Although much attention is focused on how to annotate genomic features such as developmental enhancers and non-coding RNAs, there is still no higher eukaryote for which we know the correct exon–intron structure of at least one ORF for each gene. Despite this uncomfortable truth, genome annotation has made remarkable progress since the first drafts of the human genome were analysed. By combining several computational and experimental methods, we are now closer to producing complete and accurate gene catalogues than ever before.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Performance of GeneWise, a *trans*-alignment program.**

**Figure 2: The steadily increasing accuracy of *de novo* gene prediction algorithms.**

**Figure 3: Criteria for selecting the best informant genome.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Simultaneous single-cell three-dimensional genome and gene expression profiling uncovers dynamic enhancer connectivity underlying olfactory receptor choice

Article Open access 15 April 2024

Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain

Article Open access 09 April 2024

References

The MGC Project Team. The status, quality, and expansion of the NIH full-length cDNA project: the mammalian gene collection (MGC). Genome Res. 14, 2121–2127 (2004).
Bernal, A., Crammer, K., Hatzigeorgiou, A. & Pereira, F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3, e54 (2007). This paper presents CRAIG, a CRF-based, single-genome de novo gene predictor with the best published accuracy for the human genome among programs that do not use comparison with related genome sequences.
Article Google Scholar
Decaprio, D. et al. CONRAD: gene prediction using conditional random fields. Genome Res. 17, 1389–1398 (2007). This paper presents CONRAD, a CRF-based, multi-genome de novo gene predictor with the best published benchmark accuracy on fungal genomes.
Article CAS Google Scholar
Gross, S. S., Do, C. B., Sirota, M. & Batzoglou, S. CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol. (in the press). This paper presents CONTRAST, a CRF-based, multi-genome de novo gene predictor that is currently the most accurate predictor, at least for mammals and flies. CONTRAST is also likely to work well on other complex eukaryotic genomes.
ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).
Gerstein, M. B. et al. What is a gene, post-ENCODE? History and updated definition. Genome Res. 17, 669–681 (2007).
Article CAS Google Scholar
Mott, R. EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci. 13, 477–478 (1997).
CAS PubMed Google Scholar
Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).
Article CAS Google Scholar
Shibata, Y. et al. Cloning full-length, cap-trapper-selected cDNAs by using the single-strand linker ligation method. Biotechniques 30, 1250–1254 (2001).
Article CAS Google Scholar
Suzuki, Y. et al. Statistical analysis of the 5′ untranslated region of human mRNA using 'oligo-capped' cDNA libraries. Genomics 64, 286–297 (2000).
Article CAS Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Article CAS Google Scholar
Guigó, R. et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl Acad. Sci. USA 100, 1140–1145 (2003).
Article Google Scholar
Wu, J. Q., Shteynberg, D., Arumugam, M., Gibbs, R. A. & Brent, M. R. Identification of rat genes by TWINSCAN gene prediction, RT-PCR, and direct sequencing. Genome Res. 14, 665–671 (2004).
Article CAS Google Scholar
Eyras, E. et al. Gene finding in the chicken genome. BMC Bioinformatics 6, 131 (2005).
Article Google Scholar
Denoeud, F. et al. Prominent use of distal 5′ transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 17, 746–759 (2007).
Article CAS Google Scholar
Siepel, A. et al. Targeted discovery of novel human exons by comparative genomics. Genome Res. 17, 1763–1773 (2007). This paper shows that de novo gene prediction followed by RT-PCR and direct sequencing can be used to elucidate many novel exons and introns even in a genome as thoroughly studied as the human genome.
Article CAS Google Scholar
Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Article CAS Google Scholar
Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).
Article Google Scholar
Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, 988–995 (2004).
Article CAS Google Scholar
Birney, E. et al. An overview of ENSEMBL. Genome Res. 14, 925–928 (2004).
Article CAS Google Scholar
Meyer, I. M. & Durbin, R. Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res. 32, 776–783 (2004).
Article CAS Google Scholar
Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
Article CAS Google Scholar
Brent, M. R. How does eukaryotic gene prediction work? Nature Biotechnol. 25, 883–885 (2007).
Article CAS Google Scholar
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Article CAS Google Scholar
Pavy, N. et al. Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics 15, 887–899 (1999).
Article CAS Google Scholar
Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J. & Tettelin, H. Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24–31 (1999).
Article CAS Google Scholar
Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E. S. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254 (2003). This paper presents the RFC method of identifying protein-coding regions using only multi-genome alignments.
Article CAS Google Scholar
Waterston, R. H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Article CAS Google Scholar
Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene structure prediction. Bioinformatics 17, S140–S148 (2001).
Article Google Scholar
Flicek, P. & Brent, M. R. Using several pair-wise informant sequences for de novo prediction of alternatively spliced transcripts. Genome Biol. 7, S8 (2006).
Article Google Scholar
Parra, G. et al. Comparative gene prediction in human and mouse. Genome Res. 13, 108–117 (2003).
Article CAS Google Scholar
Parra, G., Blanco, E. & Guigo, R. GeneID in Drosophila. Genome Res. 10, 511–515 (2000).
Article CAS Google Scholar
Clamp, M. et al. Distinguishing protein-coding and non-coding genes in the human genome. Proc. Natl Acad. Sci. USA (in the press).
Wang, M., Buhler, J. & Brent, M. R. in The Genome of Homo Sapiens (eds Stillman, B. & Stewart, D.) 125–130 (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 2004).
Google Scholar
Zhang, L., Pavlovic, V., Cantor, C. R. & Kasif, S. Human–mouse gene identification by comparative evidence integration and evolutionary analysis. Genome Res. 13, 1190–1202 (2003).
Article CAS Google Scholar
Clark, A. G. et al. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203–218 (2007).
Article Google Scholar
Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. R. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res. 13, 46–54 (2003). This paper shows that unassembled sequencing reads representing three- to fourfold coverage of an informant genome are almost as useful as a high-coverage informant assembly for de novo gene prediction.
Article CAS Google Scholar
Siepel, A. C. & Haussler, D. in RECOMB (ACM, San Diego, 2004).
Google Scholar
Gross, S. S. & Brent, M. R. Using multiple alignments to improve gene prediction. J. Comput. Biol. 13, 379–393 (2006). This paper presents N-SCAN, a multi-genome de novo gene predictor that was the most accurate program for animal genomes until CONTRAST was introduced.
Article CAS Google Scholar
Do, C. B., Woods, D. A. & Batzoglou, S. CONRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 22, e90–e98 (2006).
Article CAS Google Scholar
Gross, S. S., Russakovsky, O., Do, C. B. & Batzoglou, S. Training conditional random fields for maximum labelwise accuracy. Adv. Neural Inf. Process. Syst. 19, (Neural Information Processing Systems Foundation, 2006).
Wei, C. et al. Closing in on the C. elegans ORFeome by cloning TWINSCAN predictions. Genome Res. 15, 577–582 (2005).
Article CAS Google Scholar
Wei, C. & Brent, M. R. Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7, 327 (2006).
Article Google Scholar
Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).
Article CAS Google Scholar
Moskal, W. A. Jr. et al. Experimental validation of novel genes predicted in the un-annotated regions of the Arabidopsis genome. BMC Genomics 8, 18 (2007).
Article Google Scholar
Allen, J. E., Pertea, M. & Salzberg, S. L. Computational gene prediction using multiple sources of evidence. Genome Res. 14, 142–148 (2004).
Article CAS Google Scholar
van Baren, M. J. & Brent, M. R. Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res. 16, 678–685 (2006). This paper presents PPFINDER, a program that can remove processed pseudogene fragments from gene predictions even when there is no database of previously known functional genes.
Article CAS Google Scholar
Torrents, D., Suyama, M., Zdobnov, E. & Bork, P. A genome-wide survey of human pseudogenes. Genome Res. 13, 2559–2567 (2003).
Article CAS Google Scholar
Zhang, Z. & Gerstein, M. Large-scale analysis of pseudogenes in the human genome. Curr. Opin. Genet. Dev. 14, 328–335 (2004).
Article CAS Google Scholar
Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 7, S4 (2006). This paper provides useful insights into a modern manual annotation effort and how it compares with both automated annotation and experimental verification.
Article Google Scholar
Pruitt, K., Tatusova, T. & Maglott, D.R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 1, 501–504 (2005).
Google Scholar
Arumugam, M., Wei, C., Brown, R. H. & Brent, M. R. Pairagon+N-SCAN_EST: a model-based gene annotation pipeline. Genome Biol. 7, S5 (2006).
Article Google Scholar
Stanke, M., Schoffmann, O., Morgenstern, B. & Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006).
Article Google Scholar
Stanke, M., Tzvetkova, A. & Morgenstern, B. AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol. 7, S11 (2006).
Article Google Scholar
Howe, K. L., Chothia, T. & Durbin, R. GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res. 12, 1418–1427 (2002).
Article CAS Google Scholar
Elsik, C. G. et al. Creating a honey bee consensus gene set. Genome Biol. 8, R13 (2007).
Article Google Scholar
Allen, J. E. & Salzberg, S. L. Jigsaw: integration of multiple sources of evidence for gene prediction. Bioinformatics 21, 3596–3603 (2005). This paper presents Jigsaw, a highly accurate system for combining predictions that are produced by other methods.
Article CAS Google Scholar
Coghlan, A. & Durbin, R. Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron–exon structure. Bioinformatics 23, 1468–1475 (2007).
Article CAS Google Scholar
Guigo, R. et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 7, S2 (2006). This paper describes detailed benchmarks on the accuracy of several gene prediction programs that use a range of methods and evaluating them on 30 Mb of the human genome.
Article Google Scholar
Brent, M. R. Genome annotation past, present and future: how to define an ORF at each locus. Genome Res. 15, 1777–1786 (2005).
Article CAS Google Scholar
D'Haeseleer, P. What are DNA sequence motifs? Nature Biotechnol. 24, 423–425 (2006).
Article CAS Google Scholar
Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003). This paper presents AUGUSTUS, currently the most accurate GHMM-based, single-genome de novo predictor for flies. AUGUSTUS uses innovative splice-site and intron-length models.
Article Google Scholar

Download references

Acknowledgements

I am deeply grateful to M. J. van Baren, M. Schuster and E. Birney for help with the GeneWise analysis, R. Brown for analysis of informant genome utility, M. J. van Baren and S. Gross for comments on the manuscript, and L. Kyro and L. Langton for help with figures. M.R.B. is supported in part by grants from the National Institutes of Health (HG002278, HG003700, HG004271) and Monsanto.

Author information

Authors and Affiliations

Center for Genome Sciences, Campus BOX 8510, Washington University, 4444 Forest Park Blvd, Saint Louis, 63108, Missouri, USA
Michael R. Brent

Authors

Michael R. Brent
View author publications
You can also search for this author in PubMed Google Scholar

Glossary

cDNA library: A collection of clones that propagate and amplify copies of diverse (usually random) cDNA sequences.
Cis alignment: The alignment of a cDNA sequence to the locus that matches it best in its source genome — the presumed template for its transcription.
Trans alignment: The alignment of a cDNA or protein sequence to a homologous locus other than the one from which it was transcribed.
De novo gene prediction: An approach to gene prediction in which the only inputs are genome sequences; no evidence derived from RNA is used.
Target genome: The genome to be annotated, as opposed to informant genomes or other supporting sequences. In gene prediction, informant genomes are genome sequences that are aligned to the target genome and used as auxiliary information for annotating it.
Conditional random field: A type of discriminative model that is used for assigning probabilities to possible annotations of a sequence. A discriminative model is a probability model in which the most likely values of hidden variables (for example, annotations of DNA segments) are calculated directly from the observed variable values (for example, the DNA sequences) without using the probability of the observed values.
Shotgun mass spectrometry: A method for simultaneously identifying many of the protein species present in a complex mixture by fragmenting them and precisely measuring the charge-to-mass ratios of the fragments in a mass spectrometer.
Processivity: The tendency of a polymerase to continue to move along a template molecule rather than falling off prematurely.
Robustness: The ability to function well in difficult circumstances or in unexpected circumstances for which it was not designed.
Nearly full-length (NFL) protein alignment: Alignment of a protein sequence to a genome in which the alignment extends to the ends of the protein, or nearly so.
Profile hidden Markov model: A mathematical model that represents the conserved elements of an entire family of related proteins or a family of conserved functional domains.
Training data: In de novo gene prediction, it is a set of known gene structures with the corresponding genomic sequence (and alignments to informant genomes, if available). Training data are used in specializing the probability model to fit the characteristics of a particular genome.
Parse: A segmentation of a string of letters together with a labelling of the segments.
Bayes' rule: A mathematical identity (Pr(x|y) = Pr(y|x) Pr(x)/Pr(y)) that allows one to swap variables in a conditional probability expression.
Negative selection: Sequences are under negative selection when mutations are deleterious to fitness and hence tend to be weeded out over time.
Substitutions per synonymous site: An estimate of evolutionary distance that makes use of silent substitutions in protein-coding regions, similar to the rate of substitutions in fourfold degenerate sites.
Generative model: A probability model in which, to calculate the most likely values of hidden variables (annotations of DNA segments), one must also calculate the probability of the observed variable values (the DNA sequence).
Generalized hidden Markov model: A type of generative model that is used for assigning probabilities to possible annotations of a sequence. Generalized hidden Markov models are preferred over ordinary hidden Markov models for gene prediction because they make it possible to model the distribution of exon lengths.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brent, M. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9, 62–73 (2008). https://doi.org/10.1038/nrg2220

Download citation

Issue Date: January 2008
DOI: https://doi.org/10.1038/nrg2220

This article is cited by

Comprehensive analysis of LDHAP5 pseudogene expression and potential pathogenesis in ovarian serous cystadenocarcinoma
- Shitong Lin
- Yifan Meng
- Peng Wu
Cancer Cell International (2020)
Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models
- Jeanne Wilbrandt
- Bernhard Misof
- Oliver Niehuis
BMC Genomics (2019)
Recent Perspective of Next Generation Sequencing: Applications in Molecular Plant Biology and Crop Improvement
- Prashant Yadav
- Era Vaidya
- Dhiraj Singh
Proceedings of the National Academy of Sciences, India Section B: Biological Sciences (2018)
Identification and characterization of protein coding genes in monsonia (Monsonia burkeana Planch. ex harv) using a combination of approaches
- Adugna A. Woldesemayat
- Khayalethu Ntushelo
- David M. Modise
Genes & Genomics (2017)
The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies
- Henrike Indrischek
- Nicolas Wieseke
- Sonja J. Prohaska
Algorithms for Molecular Biology (2016)

Steady progress and recent breakthroughs in the accuracy of automated genome annotation

Key Points

Abstract

Access options

Similar content being viewed by others

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Simultaneous single-cell three-dimensional genome and gene expression profiling uncovers dynamic enhancer connectivity underlying olfactory receptor choice

Single-cell long-read sequencing-based mapping reveals specialized splicing patterns in developing and adult mouse and human brain

References

Acknowledgements

Author information

Authors and Affiliations

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

Comprehensive analysis of LDHAP5 pseudogene expression and potential pathogenesis in ovarian serous cystadenocarcinoma

Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models

Recent Perspective of Next Generation Sequencing: Applications in Molecular Plant Biology and Crop Improvement

Identification and characterization of protein coding genes in monsonia (Monsonia burkeana Planch. ex harv) using a combination of approaches

The paralog-to-contig assignment problem: high quality gene models from fragmented assemblies

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Related links

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links