Introduction

Population genetics analyses can provide data on a variety of important evolutionary parameters, including standing levels of genetic variation, the partitioning of this variability within/between populations, overall levels of inbreeding, selfing vs outcrossing rates, effective population sizes and the dynamics of recent population bottlenecks. Beyond providing basic evolutionary insights, such analyses are also an important tool for developing effective management strategies for endangered and/or invasive species (Hedrick, 2001; Sakai et al., 2001). Owing to their codominant and highly polymorphic nature, simple-sequence repeats (SSRs, a.k.a. microsatellites) have increasingly become the marker of choice for these sorts of analyses.

Unfortunately, the de novo development of SSRs is a costly and time-consuming endeavor (Zane et al., 2002; Squirrell et al., 2003), and these problems are often compounded by a paucity of resources in taxa that lack clear economic importance. Adding to this difficulty is the fact that the polymerase chain reaction primers used to amplify SSRs are frequently species-specific, meaning that markers developed in one taxon cannot be readily transferred to another. Beyond reducing the general utility of existing markers, this lack of transferability also means that interspecific comparisons are often based on disparate sets of markers, effectively confounding species differences with possible locus-specific effects.

One possible solution to these sorts of problems would be to exploit publicly available genomic resources for the development of gene-based SSR markers that are more likely to be transferable across taxonomic boundaries. In fact, the rapid and inexpensive development of SSRs from expressed sequence tag (EST) databases has been shown to be a feasible option for obtaining high-quality nuclear markers (Gupta et al., 2003; Bhat et al., 2005). Moreover, the National Center for Biotechnology Information (NCBI) EST database (dbEST; Boguski et al., 1993) contains an ever-increasing number of these ‘single-pass’ cDNA sequences, meaning that the resources necessary for the efficient development of large numbers of so-called EST-SSRs already exist for a wide variety of taxa.

In general, EST-SSRs have been found to be significantly more transferable across taxonomic boundaries than are traditional ‘anonymous’ SSRs (Chagne et al., 2004; Liewlaksaneeyanawin et al., 2004; Gutierrez et al., 2005; Pashley et al., 2006), and reports of EST-SSR transferability have become increasingly common. This is particularly true in plants, where transferability among economically important crop taxa has been demonstrated on a number of occasions (Decroocq et al., 2003; Thiel et al., 2003; Bandopadhyay et al., 2004; Saha et al., 2004; Varshney et al., 2005b). Until recently, however, little attention had been paid to the potential for transferring EST-SSRs from well-characterized taxa to lesser studied relatives as a means for facilitating evolutionary analyses (but see Arnold et al., 2002; Ellis et al., 2006).

In the present paper, we provide an overview of what is known about EST-SSR transferability with particular reference to the potential for these markers to facilitate population genetic analyses of previously understudied plant taxa. As an example of the utility of these resources, we further cross-reference existing EST databases against the US Fish and Wildlife Service threatened and endangered species database, the 2006 IUCN Red List of threatened species and the US State and Federal Composite List of Noxious Weeds to determine the extent of overlap between suitable EST databases and plant species of conservation concern. Finally, we provide a discussion of the advantages and disadvantages of EST-SSRs in the context of population genetic applications.

Transferability and polymorphism of EST-SSRs

Perhaps the most common applications of EST-SSRs to date have involved analyses of functional diversity, genetic mapping and/or marker-assisted selection in crop species (reviewed by Varshney et al., 2005a). To the extent that these markers are transferable across taxa, however, EST-SSRs also have clear potential for use in basic evolutionary applications, such as population genetic analyses (Ellis et al., 2006). In this section, we provide an overview of what is known about EST-SSR transferability in plants.

As noted in the Introduction, the ability to effectively transfer polymorphic EST-SSRs across taxa has now been demonstrated in a number of cases, most commonly in studies involving economically important crop species (Table 1). Taken together, the results of these studies indicate that EST-SSRs can often be transferred across relatively large taxonomic distances, spanning not just species within a genus, but in some instances multiple genera within a family. For example, Scott et al. (2000) tested the transferability of 10 Vitis EST-SSRs among grape cultivars, other grape species and related genera, and found high levels of transferability, with over 60% of markers tested working across taxa. Moreover, all of the transferable markers proved to be polymorphic at the level of cultivars, Vitis species and between related genera. Similarly, Decroocq et al. (2003) used EST-SSRs developed from grape and apricot sequences to investigate transferability across 46 related grape species and 29 members of the Rosaceae. Overall, the grape primers were transferable to, and revealed polymorphisms within, most Vitaceae accessions tested. In contrast, the apricot primers were found to be most useful within the subgenus Prunophora. In the cereals, Gupta et al. (2003) demonstrated extensive transferability of Triticum aestivum L. (bread wheat) EST-SSRs to 18 related wild species in the Triticum–Aegilops complex and to five cereal species of barley, oat, rye, rice and maize. Over 80% of primer pairs tested were transferable to the 18 related species, while nearly 60% showed transferability to one or more of the more distantly related cereal species. In other grass species, EST-SSRs from the turf grass Festuca arundinacea Schreb. (tall fescue) were tested for transferability in seven grasses from four genera varying in mating system and ploidy level (Saha et al., 2004). This work revealed greater than 90% transferability to one or more of the other species tested. Moreover, the surveyed loci revealed ample levels of polymorphism for elucidating relationships among these species. Finally, high levels transferability and substantial polymorphism were observed among 23 cotton (Gossypium) species (Guo et al., 2006).

Table 1 Summary of studies reporting on the transferability of EST-SSRs among plant taxa

In general terms, this sort of transferability is unique to EST-SSRs, with anonymous SSRs being significantly less portable (Chagne et al., 2004; Liewlaksaneeyanawin et al., 2004; Gutierrez et al., 2005; Pashley et al., 2006; but see Fitzsimmons et al., 1995; Dayanandan et al., 1997). EST-SSRs have also been shown to produce substantially ‘cleaner’ data (that is, easier to analyze/interpret amplification profiles) as compared to their anonymous counterparts (Pashley et al., 2006). There is also evidence that EST-SSRs located in coding regions are significantly more transferable than those found in untranslated regions (UTRs) (Pashley et al., 2006). Despite their potential to cause selectively deleterious frameshift mutations, however, EST-SSRs located in coding regions appear to reveal equivalent levels of polymorphism as compared to those located in UTRs, most likely due to a general trend toward trinucleotide repeats in coding regions. In fact, this trend toward trinucleotide repeats in exons has been observed in a variety of other taxa, including wheat (Gupta et al., 2003) barrel medic (Eujayl et al., 2004), tall fescue (Saha et al., 2004), and pine (Chagne et al., 2004). Regardless of the cause, if this observed tendency toward higher transferability and equivalent levels of polymorphism turns out to be a general feature of EST-SSRs located in protein-coding regions, the targeting of exonic trinucleotide repeat motifs might be the best strategy for developing portable sets of polymorphic EST-SSR markers.

EST resources and SSR frequencies

Although EST-SSR transferability has now been documented in a number of cases, the utility of these sorts of markers for facilitating evolutionary genetic research in non-target taxa (that is, taxa that lack genomic resources) has received relatively little attention. However, when the generally high transferability of EST-SSRs is combined with the fact that population genetic analyses often rely on a relatively small number of markers (Richards et al., 2004; Vornam et al., 2004; Szczys et al., 2005), it seems likely that even modest EST collections could prove to be of great value to evolutionary biologists. In fact, an estimated 2–5% of all plant-derived ESTs are thought to harbor SSRs (Kantety et al., 2002), although the actual frequency of SSR-bearing ESTs in any particular analysis is highly dependent on the search parameters (see below). Moreover, a large fraction of EST-SSRs (on the order of 80–90%) are typically found to be polymorphic (Bandopadhyay et al., 2004; Fraser et al., 2004; Pashley et al., 2006). Taking into account typical marker development attrition rates, it therefore seems likely that EST databases containing as few as 1000 sequences could provide enough markers to facilitate population genetic analyses.

To highlight the potential utility of such resources, we surveyed available EST collections and cross-referenced them against several databases that list either rare/endangered or invasive plant species. As of May 2006, the NCBI EST database (dbEST) contained over 36 million publicly available EST sequences from over 1100 taxa. Of these, 542 taxa accounted for greater than 1000 EST sequences apiece, including 211 different spermatophytes (that is, seed plants), representing 126 unique genera. The taxonomic databases that we cross-referenced these sequence collections against included the US Fish and Wildlife Service threatened and endangered species database (http://www.fws.gov/endangered/wildlife.html), the 2006 IUCN Red List of threatened species (http://www.redlist.org/), and the US State and Federal Composite List of Noxious Weeds (http://plants.usda.gov/).

At the time of our survey, the US Fish and Wildlife list of threatened and endangered species contained 51 plant species (representing 25 different genera) that were congeneric with at least one species for which there were 1000 publicly available ESTs (Table 2). The IUCN Red List contained an additional 576 species from 45 genera that had congeners with similar EST resources. Turning to the US Composite List of Noxious Weeds, 80 species from 21 genera had at least one congener with 1000 publicly available ESTs. In some cases, the source taxa for the ESTs were themselves either endangered or invasive; these species were excluded from the tallies, as noted in Table 2. In a handful of cases, the invasive species list cited only a genus name without specific epithet (for example, Vitis L.). Such instances were included in our tabulation, but only counted as a single taxon.

Table 2 Summary of the number of rare (R), endangered (E) and invasive (I) species that have a congener with 1000 publicly available ESTs

After accounting for overlap across lists, we found that half (68/136) of all plant-derived EST collections of sufficient size (that is, 1000 sequences) could potentially serve as a source of EST-SSRs for the genetic analysis of rare, endangered or invasive plants species worldwide (Table 2). It is important to note here that this is most likely a somewhat conservative estimate, as: (1) this survey was primarily based on data from US agencies, although we did include the most critically endangered species from elsewhere, and (2) only those EST collections that were derived from a congener of the focal species were included in the tally. As noted above, EST-SSRs are also often transferable across greater taxonomic distances; for example, Rossetto (2001) found that the average rate of intergeneric transfer was ca. 35% in a variety of plant taxa. It should also be kept in mind that, while rare and invasive plants were chosen to illustrate the likely utility of existing EST resources for population genetic analyses, these resources have the potential to facilitate evolutionary research in a much wider variety of taxa.

In order to better gauge the utility of the smallest EST collections identified above, we surveyed all data sets consisting of 1000–10 000 ESTs for the presence of unique SSRs (Table 3). We did this by first downloading from dbEST all ESTs for each genus that showed overlap with one or more taxa of conservation interest. We then assembled them using CAP3 (Huang and Madan, 1999) and analyzed the resulting unigene set for each genus using SSRIT (Temnykh et al., 2001; http://www.gramene.org/db/searches/ssrtool), which is a perl script that identifies all SSRs within a set of sequences. We set the script to identify all possible di-, tri- and tetranucleotide repeats with a minimum of five, four and three subunits, respectively. While some researchers have employed higher cutoffs (Kantety et al., 2002), relaxing the thresholds maximizes SSR discovery while still producing a high percentage of polymorphic loci (Pashley et al., 2006).

Table 3 Frequency of SSRs in each of the ‘overlapping’ EST databases containing 1000–10 000 total sequences

Inspection of Table 3 reveals that nearly one in 10 unique ESTs (9.0±0.1%; mean±s.e.) contained at least one SSR (range=2.5–21.1%). Thus, it seems reasonable to assume that EST collections consisting of 1000 or more sequences have the potential to provide ample candidate SSRs for use in conservation genetic analyses. This is especially true in view of the potentially high percentage of EST-SSRs that turn out to be polymorphic (Bandopadhyay et al., 2004; Fraser et al., 2004; Pashley et al., 2006).

Prospects and pitfalls

As noted at the outset, the codominant and highly polymorphic nature of SSRs has increasingly made them the marker of choice for population genetics analyses. Unfortunately, the development of traditional ‘anonymous’ SSRs requires a substantial investment of both time and money, putting them out of reach for many researchers. Given that EST-based SSRs can be developed directly from existing sequence resources and can often be transferred from one species to another, EST databases are an attractive source of markers for the genetic analysis of understudied taxa.

Looking beyond the relative ease with which EST-SSRs can be developed, one of their clearest advantages is that they allow one to make direct comparisons among taxa without running the risk that locus-specific differences might mask true species-level differences in things like overall levels of genetic diversity, the extent of population structure, so on. For example, Ellis et al. (2006) used EST-SSRs derived from the cultivated sunflower (Helianthus annuus L.) to investigate levels of genetic diversity in an extremely rare sunflower (H. verticillatus) and a more common congener (H. angustifolius). Based on a simple comparison of the mean level of genetic diversity present within each taxon, the two species are statistically indistinguishable. After controlling for inherent differences in variability from one locus to the next, however, it becomes clear that H. verticillatus actually harbors more genetic diversity than does H. angustifolius despite its rarity (Figure 1). Beyond providing more statistical power in paired comparisons, EST-SSRs also produce cleaner results for scoring as there are fewer null alleles (Leigh et al., 2003; Rungis et al., 2004) and fewer stutter bands (Leigh et al., 2003; Woodhead et al., 2003; Eujayl et al., 2004; Pashley et al., 2006). Despite these advantages, however, EST-SSRs are not without their drawbacks.

Figure 1
figure 1

Comparison of genetic diversity in H. verticillatus and H. angustifolius. Each point represents one of the 19 loci that the two species have in common, and the lines connect data points derived from an individual locus (Ellis et al., 2006). Note that, although the diversity estimates overlap broadly between species, there is a clear tendency toward decreased genetic diversity in H. angustifolius when viewed on a per-locus basis, with 13 of 19 loci showing clear evidence of a decline.

One concern with SSRs in general is the possibility of null alleles, which fail to amplify due to primer site variation, and thus do not produce a visible amplicon. Individuals that are heterozygous for a null allele appear to be homozygous for the visible allele, whereas null homozygotes appear to be failed reactions. When present in a population, null alleles will bias allele frequencies, reduce the observed heterozygosity, and therefore increase apparent levels of inbreeding (DeWoody et al., 2006). While EST-SSRs are subject to these sorts of difficulties, the same can be said of anonymous SSRs. Moreover, the primers flanking EST-SSRs are derived from relatively conserved sequences; therefore, it is likely that null alleles will be less of a problem for EST-SSRs as compared to their anonymous counterparts. Indeed, Rungis et al. (2004) found that measures of inbreeding were significantly lower in EST-SSRs versus genomic SSRs in spruce, and they suggested that this resulted from a lower frequency of null alleles in the former.

Because the cDNAs from which ESTs are derived lack introns, one possible concern with EST-SSRs is that unrecognized intron splice sites could disrupt priming sites, resulting in failed amplification. Alternatively, large introns could fall between the primers, resulting in a product that is either too large or, in extreme cases, failed amplification. Fortunately, intron locations are relatively well-conserved across taxa (Strand et al., 1997; Ku et al., 2000; Wu et al., 2006). Thus, it is possible to minimize this sort of problem by aligning ESTs of interest against the genomic sequence of model species such as Arabidopsis or rice. Putative intron positions can then be noted, and primers can be designed accordingly. Of course, this is not a perfect solution, as intron gain and loss are still distinct possibilities. In some cases, however, it may be possible to redesign the primers to exclude troublesome introns.

Another obvious concern is that since EST-SSRs are located within genes, and thus more conserved across species, they may be less polymorphic than anonymous SSRs. This concern has been borne out in a number of taxa, including rice (Cho et al., 2000), bread wheat (Gupta et al., 2003), pines (Liewlaksaneeyanawin et al., 2004), barley (Chabane et al., 2005) and sunflower (Pashley et al., 2006). However, the levels of genetic diversity revealed by these markers are still considerably higher than those revealed by most alternative marker types, such as allozymes (Hamrick and Godt, 1996). Thus, even though EST-SSRs reveal less variability than do anonymous SSRs, these markers still reveal sufficient levels of variation for the vast majority of population genetic applications.

Perhaps the greatest concern with regard to the utility of EST-SSRs in the present context is that selection on these loci might influence the estimation of population genetic parameters. Indeed, divergent selection will increase differentiation among and reduce variability within populations, whereas balancing selection will have the opposite effect. While a recent study by Woodhead et al. (2005) revealed that estimates of population differentiation based on EST-SSRs are comparable to those based on both anonymous SSRs and AFLPs in ferns, and large-scale comparative analyses suggest that only a very small percentage of all genes are experiencing positive selection (Tiffin and Hahn, 2002; Clark et al., 2003), some small fraction of all EST-SSRs will inevitably be subject to selection. Indeed, there are examples from the literature wherein certain genic SSRs are known to be associated with various diseases in animals (Zoghbi and Orr, 2000; Mao et al., 2002; Yamada et al., 2002) or pathogenicity/virulence in microbes (Peak et al., 1996; Grimwood et al., 2001). While more studies are needed before we will have a better understanding of the possible effects of genic SSRs in plants (Li et al., 2004), it seems safe to assume that at least a small percentage of loci will be evolving in a non-neutral manner. It remains unclear, however, whether this problem will be more or less frequent than in other gene-based marker systems, such as allozymes.

There are, of course, a number of potential applications of EST-SSRs that will be less sensitive to the effects of selection. For example, single-generation applications such as paternity studies, mating system analyses and direct estimates of gene flow will be relatively robust to deviations from neutrality. In the case of analyses that rely on equilibrium assumptions, such as studies focusing on population structure and/or indirect estimates of gene flow, the effects of selection can best be minimized by increasing the number of markers utilized, so as to reduce the potential biases introduced by any one locus, and by endeavoring to employ a common set of markers across taxa when working in a comparative manner. Assuming that a sufficiently large number of markers are employed, it should also be possible to statistically identify and exclude loci with extreme FST values (Wright's (1951) measure of population genetic structure), as such outliers are likely the result of selection (Lewontin and Krakauer, 1973; Beaumont and Nichols, 1996).

Conclusions

The advent of the genomics age has resulted in the production of an ever-expanding body of DNA sequence data, including vast EST collections. These ESTs represent a potentially valuable source of gene-based SSR markers for population genetic analyses. While EST-SSRs are not without their drawbacks, they offer a number of clear benefits, including rapid and inexpensive development and high levels of cross-taxon portability. Thus, EST-SSRs have the potential to facilitate evolutionary analyses in a wide variety of taxa, and may well represent the best way forward for the analysis of species for which only limited resources are available.