Expressed sequence tags: alternative or complement to whole genome sequences?

https://doi.org/10.1016/S1360-1385(03)00131-6Get rights and content

Abstract

Over three million sequences from approximately 200 plant species have been deposited in the publicly available plant expressed sequence tag (EST) sequence databases. Many of the ESTs have been sequenced as an alternative to complete genome sequencing or as a substrate for cDNA array-based expression analyses. This creates a formidable resource from both biodiversity and gene-discovery standpoints. Bioinformatics-based sequence analysis tools have extended the scope of EST analysis into the fields of proteomics, marker development and genome annotation. Although EST collections are certainly no substitute for a whole genome scaffold, this ‘poor man's genome’ resource forms the core foundations for various genome-scale experiments within the as yet unsequenceable plant genomes.

Section snippets

Contemporary uses of ESTs

Plant genome sizes extend over at least four orders of magnitude. Arabidopsis and Oryza sativa (rice), our model plants with fully sequenced genomes, have among the smallest known genomes: 125 Mbp and 430 Mbp, respectively. Tomato has a genome size of ∼950 Mbp [6] and maize has a genome size of ∼2670 Mbp. Cycad and wheat have genome sizes of ∼14 000 Mbp and ∼17 000 Mbp, respectively. The largest known genomes are currently those of Fritillaria assyriaca (125 000 Mbp) and Psilotum nudum

EST sequence availability and biodiversity

With the latest release of the EMBL sequence database [18] and the weekly updates to the EST database (ftp://ftp.ebi.ac.uk/pub/databases/embl/new/), there were ∼16.1 million ESTs available within the public domain by 14 April 2003. Of these, over 3.1 million are from plant species and account for 1550 Mbp sequence, with almost 200 species represented. Table 1 lists the plant species with most available ESTs ranked by the number of ESTs.

When we consider the overall biodiversity represented

ESTs and their limitations

There are two main problems associated with EST sequences: (1) the overall representation of host genes within a library and (2) the overall quality of any individual sequence within a collection.

Bioinformatics of plant EST collections

Bioinformatics-based sequence resources have been developed that address the quality, redundancy and partial nature of EST sequences. Sequence resources such as the dbEST database [4] and the EMBL database [18] archive all the available ESTs and provide methods to search for individual sequences on the basis of species, clone or homology attributes. However, these searches are limited to the sequence features that are supplied when the sequence is submitted.

A range of plant specific EST

ESTs as a current alternative to complete genomes

Within the field of ‘reconstructomics’ [30], ESTs have widely been applied as the foundation sequence of some genome-scale analyses. Such reconstructomic analyses use the EST cluster assemblies and singletons as an equivalent to a whole genome's gene collection. EST derived cluster sequences have been widely annotated with tentative functions. Sources of functional annotation have included non-redundant protein databases [31], the Arabidopsis genome annotation [6] and catalogues of functionally

ESTs as a complement to complete genomes

Complete genome sequences have been produced for Arabidopsis [33] and rice 34, 38. The complete genome scaffolds for Zea mays, Medicago truncatula, Brassica napus and Populus are either within the sequencing or preparation stages and other plant genomes will follow. ESTs really spring into the limelight when we are presented with a new complete genome sequence and wish to start annotating genes to the chromosomes. Although the underlying methods and science required for the detection and

New tricks with old sequences

It is only recently that plant biologists have taken these vast EST datasets in hand and started a concerted effort to mine the data for novel attributes, started de novo annotation of the sequences, used the sequences within proteomics-based analysis pipelines and exploited the sequences for molecular marker development. There has recently been much interest in the field of expression profiling. By clustering and relating genes on the basis of their expression patterns, genes can be identified

Final comment

As long as ESTs continue to be actively sequenced to fill in knowledge gaps from the gene complement of the large plant genomes, our potential knowledge bases will continue to grow. EST sequencing certainly avoids the biggest problems associated with genome size and the accompanying retrotransposon repetitiveness. The EST sequence resources have been shown to have a wide range of applications and novel uses have been found for the resources. There are, however, some fundamental limitations to

Acknowledgements

Thanks to Heiko Schoof and Wojciech Karlowski for critical appraisal of the manuscript. I am funded within the GABI project by the BMBF (0312270/4).

References (56)

  • B.S. Gaut

    Maize as a model for the evolution of plant nuclear genomes

    Proc. Natl. Acad. Sci. U. S. A.

    (2000)
  • J.S. Heslop-Harrison

    Comparative genome organization in plants: from sequence and markers to chromatin and chromosomes

    Plant Cell

    (2000)
  • J.L. Bennetzen

    Mechanisms and rates of genome expansion and contraction in flowering plants

    Genetica

    (2002)
  • N. Carels

    The gene distribution of the maize genome

    Proc. Natl. Acad. Sci. U. S. A.

    (1995)
  • A. Barakat

    The distribution of genes in the genomes of Gramineae

    Proc. Natl. Acad. Sci. U. S. A.

    (1997)
  • Hoskins, R.A. et al. (2002) Heterochromatic sequences in a Drosophila whole-genome shotgun assembly. Genome Biol. 3,...
  • W.J. Kent et al.

    Assembly of the working draft of the human genome with GigAssembler

    Genome Res.

    (2001)
  • M.N. Raizada

    Somatic and germinal mobility of the RescueMu transposon in transgenic maize

    Plant Cell

    (2001)
  • P.D. Rabinowicz

    Differential methylation of genes and retrotransposons facilitates shotgun sequencing of the maize genome

    Nat. Genet.

    (1999)
  • G. Stoesser

    The EMBL nucleotide sequence database: major new developments

    Nucleic Acids Res.

    (2003)
  • D.C. Daly

    Plant systematics in the age of genomics

    Plant Physiol.

    (2001)
  • R. Herwig

    Construction of a ‘unigene’ cDNA clone set by oligonucleotide fingerprinting allows access to 25 000 potential sugar beet genes

    Plant J.

    (2002)
  • M.F. Bonaldo

    Normalization and subtraction: two approaches to facilitate gene discovery

    Genome Res.

    (1996)
  • L.D. Hillier

    Generation and analysis of 280,000 human expressed sequence tags

    Genome Res.

    (1996)
  • B. Ewing et al.

    Base-calling of automated sequencer traces using phred. II. Error probabilities

    Genome Res.

    (1998)
  • M. Seki

    High-efficiency cloning of Arabidopsis full-length cDNA by biotinylated CAP trapper

    Plant J.

    (1998)
  • K. Heumann et al.

    The hashed position tree (HPT): a suffix tree variant for large data sets stored on slow mass storage devices

  • D. Gordon

    Consed: a graphical tool for sequence finishing

    Genome Res.

    (1998)
  • Cited by (245)

    • Harnessing the potential of modern omics approaches to study plant biotic and abiotic stresses

      2021, Plant Perspectives to Global Climate Changes: Developing Climate-Resilient Plants
    • Bioinformatics as a tool to counter climate change: Challenges and prospects

      2019, Climate Change and Agricultural Ecosystems: Current Challenges and Adaptation
    • Microsatellite markers of finger millet (Eleusine coracana (L.) Gaertn) and foxtail millet (Setaria italica (L.) Beauv) provide resources for cross-genome transferability and genetic diversity analyses in other millets

      2018, Biocatalysis and Agricultural Biotechnology
      Citation Excerpt :

      It may be due to the fact that the EST- SSR markers are derived from expressed regions and therefore they are more conserved across a number of related species than non-coding regions (Varshney et al., 2005). Rudd (2003) observed that EST-SSR markers are more conserved and have a higher cross-genome amplification than gSSR markers. But in sorghum, EST-SSR markers produced lesser percentage of transferability across a range of closely related genus (wheat, rice, maize, durum wheat, finger millet, Cynodon and Paspalum) than gSSR markers (Wang et al., 2005).

    View all citing articles on Scopus
    View full text