Computational gene finding in plants

Pertea, Mihaela; Salzberg, Steven L.

doi:10.1023/A:1013770123580

Computational gene finding in plants

Published: January 2002

Volume 48, pages 39–48, (2002)
Cite this article

Plant Molecular Biology Aims and scope Submit manuscript

Mihaela Pertea¹ &
Steven L. Salzberg¹

182 Accesses
33 Citations
Explore all metrics

Abstract

Automated methods for identifying protein coding regions in genomic DNA have progressed significantly in recent years, but there is still a strong need for more accurate computational solutions to the gene finding problem. Large-scale genome sequencing projects depend greatly on gene finding to generate accurate and complete gene annotation. Improvements in gene finding software are being driven by the development of better computational algorithms, a better understanding of the cell's mechanisms for transcription and translation, and the enormous increases in genomic sequence data. This paper reviews some of the most widely used algorithms for gene finding in plants, including technical descriptions of how they work and recent measurements of their success on the genomes of Arabidopsis thaliana and rice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., George, R.A., Lewis, S.E., Richards, S., Ashburner, M., Henderson, S.N., Sutton, G.G., Wortman, J.R., Yandell, M.D., Zhang, Q., Chen, L.X., Brandon, R.C., Rogers, Y.H., Blazej, R.G., Champe, M., Pfeiffer, B.D., Wan, K.H., Doyle, C., Baxter, E.G., Helt, G., Nelson, C.R., Gabor, G.L., Abril, J.F., Agbayani, A., An, H.J., Andrews-Pfannkoch, C., Baldwin, D., Ballew, R.M., Basu, A., Baxendale, J., Bayraktaroglu, L., Beasley, E.M., Beeson, K.Y., Benos, P.V., Berman, B.P., Bhandari, D., Bolshakov, S., Borkova, D., Botchan, M.R., Bouck, J., et al. 2000. The genome sequence of Drosophila melanogaster. Science 287(5461): 2185–2195.
Google Scholar
Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408(6814): 796–815.
Google Scholar
Brunak, S., Engelbrecht, J. and Knudsen, S. 1991. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J. Mol. Biol. 220: 49–65.
Google Scholar
Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78–94.
Google Scholar
Claverie, J.M. 1997. Computational methods for the identification of genes in vertebrate genomic sequences. Human Mol. Genet. 6: 1735–1744.
Google Scholar
Duret L., Mouchiroud D. and Gautier C. 1995. Statistical analysis of vertebrate sequences reveals that long genes are scarce in GCrich isochores. J. Mol. Evol. 40: 308–317.
Google Scholar
Ermolaeva, M.D., Khalak, H.G., White, O., Smith, H.O. and Salzberg, S.L. 2000. Prediction of transcription terminators in bacterial genomes. J. Mol. Biol. 301: 27–33.
Google Scholar
Farber, R., Lapedes, A. and Sirotkin, K. 1992. Determination of eukaryotic protein coding regions using neural networks and information theory. J. Mol. Biol. 226: 471–479.
Google Scholar
Fickett, J.W. 1996. The gene identification problem: an overview for developers. Comp. Chem. 20(1): 103–118.
Google Scholar
Franco, G.R., Adams, M.D., Soares, M.B., Simpson, A.J., Venter, J.C. and Pena, S.D. 1995. Identification of new Schistosoma mansoni genes by the EST strategy using a directional cDNA library. Gene 152: 141–147.
Google Scholar
Gelfand, M.S. 1995. Prediction of function in DNA sequence analysis. J. Comput. Biol. 2: 87–115.
Google Scholar
Guigo, R. 1997. Computational gene identification: an open problem. Comp. Chem. 21: 215–222.
Google Scholar
Hebsgaard, S.M., Korning, P.G., Tolstrup, N., Engelbrecht, J., Rouze, P. and Brunak, S. 1996. Splice site prediction in Arabidopsis thaliana DNA by combining local and global sequence information. Nucl. Acids Res. 24: 3439–3452.
Google Scholar
Jelinek, F. 1998. Statistical Methods for Speech Recognition. MIT Press.
Krogh, A. 1998. An introduction to hidden Markov models for biological sequences. In: S.L. Salzberg, D.B. Searls and S. Kasif (Eds.) Computational Methods in Molecular Biology, Elsevier, Amsterdam, Chap. 4, pp. 45–65.
Google Scholar
Lin, X., Kaul, S., Rounsley, S., Shea, T.P., Benito, M.-I., Town, C.D., Fujii, C.Y., Mason, T., Bowman, C.L., Barnstead, M., Feldblyum, T., Buell, C.R., Ketchum, K.A., Ronning, C.M., Koo, H., Moffat, K., Cronin, L., Shen, M., Pai, G., van Aken, S., Umayam, L., Tallon, L., Gill, J., Adams, M.D., Carrera, A.J., Creasy, T.H., Goodman, H.M., Somerville, C.R., Copenhaver, G., Preuss, D., Nierman, W.C., White, O., Eisen, J.A., Salzberg, S., Fraser, C. and Venter, J.C. 1999. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402: 761–768.
Google Scholar
Lowe, T.M. and Eddy, S.R. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucl. Acids Res. 25: 955–964.
Google Scholar
Lowe, T.M. and Eddy, S.R. 1999. A computational screen for methylation guide snoRNAs in yeast. Science 283(5405): 1168–1171.
Google Scholar
Lukashin, A.V. and Borodovsky, M. 1998. GeneMark.hmm: new solutions for gene finding. Nucl. Acids Res. 26: 1107–1115.
Google Scholar
Matis, S., Xu, Y., Shah, M., Guan, X., Einstein, J.R., Mural, R. and Uberbacher, E. 1996. Detection of RNA polymerase II promoters and polyadenylation sites in human DNA sequence. Comp. Chem. 20(1): 135–140.
Google Scholar
O'Neill, M.C. 1991. Training back-propagation neural networks to define and detect DNA-binding sites. Nucl. Acids Res. 19: 313–318.
Google Scholar
O'Neill, M.C. 1992. Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes. Nucl. Acids Res. 20: 3471–3477.
Google Scholar
Pavy, N., Rombauts, S., Dehais, P., Mathe, C., Ramana, D.V., Leroy, P. and Rouze, P. 1999. Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics 15: 887–899.
Google Scholar
Quackenbush, J., Cho, J., Lee, D., Liang, F., Holt, I., Karamycheva, S., Parvizi, B., Pertea, G., Sultana, R. and White, J. 2001. The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucl. Acids Res. 29: 159–164.
Google Scholar
Salzberg, S. 1995. Locating protein coding regions in human DNA using a decision tree algorithm. J. Comput. Biol. 2: 473–485.
Google Scholar
Salzberg, S.L. 1997. A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput. Appl. Biosci. 13: 365–376.
Google Scholar
Salzberg, S.L., Searls, D. and Kasif, S. (Eds.). 1998a. Computational Methods in Molecular Biology. Elsevier Science, Amsterdam.
Google Scholar
Salzberg, S.L., Delcher, A.L., Kasif, S. and White, O. 1998b. Microbial gene identification using interpolated Markov models. Nucl. Acids Res. 26: 544–548.
Google Scholar
Salzberg, S., Delcher, A.L., Fasman, K.H. and Henderson, J. 1998c. A decision tree system for finding genes in DNA. J. Comput. Biol. 5: 667–680.
Google Scholar
Salzberg, S.L., Pertea, M., Delcher, A.L., Gardner, M.J. and Tettelin, H. 1999. Interpolated Markov models for eukaryotic gene finding. Genomics 59: 24–31.
Google Scholar
Solovyev, V.V., Salamov, A.A. and Lawrence, C.B. 1994. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucl. Acids Res. 22: 5156–5163.
Google Scholar
Solovyev, V.V., Salamov, A.A. and Lawrence, C.B. 1995. Identification of human gene structure using linear discriminant functions and dynamic programming. In: Proceedings of the International Conference on Intelligent Systems in Molecular Biology 3: 367–375.
Google Scholar
Stormo, G.D. 1990. Consensus patterns in DNA. Meth. Enzymol. 183: 211–221.
Google Scholar
Stormo, G.D. 2000. Gene-finding approaches for eukaryotes. Genome Res. 10: 394–397.
Google Scholar
Tompa, M. 1999. An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. In: Proceedings of the International Conference on Intelligent Systems in Molecular Biology, pp. 262-271.
Yuan, Q., Quackenbush, J., Sultana, R., Pertea, M., Salzberg, S. and Buell, C.R. 2001. Rice bioinformatics. Analysis of rice sequence data and leveraging the data to other plant species. Plant Physiol. 125: 1166–1174.
Google Scholar
Zhang, M.Q. and Marr, T.G. 1993. A weight array method for splicing signal analysis. Comput. Appl. Biosci. 9: 499–509.
Google Scholar
Zien, A., Ratsch, G., Mika, S., Scholkopf, B., Lengauer, T. and Muller, K.R. 2000. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16: 799–807.
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Genome Research, 9712 Medical Center Drive, Rockville, MD , 20850, USA
Mihaela Pertea & Steven L. Salzberg

Authors

Mihaela Pertea
View author publications
You can also search for this author in PubMed Google Scholar
Steven L. Salzberg
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pertea, M., Salzberg, S.L. Computational gene finding in plants. Plant Mol Biol 48, 39–48 (2002). https://doi.org/10.1023/A:1013770123580

Download citation

Issue Date: January 2002
DOI: https://doi.org/10.1023/A:1013770123580

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Computational gene finding in plants

Abstract

Access this article

Similar content being viewed by others

Annotation of Protein-Coding Genes in Plant Genomes

geneHummus: an R package to define gene families and their expression in legumes and beyond

Gene Family Prediction and Annotation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Computational gene finding in plants

Abstract

Access this article

Similar content being viewed by others

Annotation of Protein-Coding Genes in Plant Genomes

geneHummus: an R package to define gene families and their expression in legumes and beyond

Gene Family Prediction and Annotation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation