Introduction

Single-nucleotide polymorphisms (SNPs) are now widely used for both, linkage analyses and biodiversity studies (Collins et al, 1997; Mckenzie et al, 1998; Wang et al, 1998; Moorhead et al, 2003; Pearson et al, 2004) including inferences about human demographic history (Wakeley et al, 2001). SNPs are characterized by (a) high abundance in various genomes, and (b) the efficiency with which they can be genotyped.

Assessments of genetic diversity between yeast strains have been reported using various DNA markers (Versavaud et al, 1995; Cai et al, 1996; Kurtzman and Robnett, 1998, 2003; Zhang et al, 1999; Hennequin et al, 2001; Perez et al, 2001; Granchi et al, 2003). Kellis et al (2003) used whole-genome sequence alignment to analyse the phylogenetic relationships of Saccharomyces cerevisiae and three related species (S. paradoxus, S. mikatae and S. bayanus), which separated from S. cerevisiae 5–20 million years ago. The three species have sufficient sequence similarity to allow reliable alignment of orthologous regions. An alternative approach is that of Winzeler et al (2003), who used high-density oligonucleotide arrays to discover 11 115 single-feature polymorphisms (SFPs) existing in one or more of 14 different yeast strains. These SFPs were used to define genetic variation between common laboratory strains of yeast. They found that S288c is closely related to EM93 and W303. The laboratory strain SK1 was positioned distantly to S288c, as well as to most isolates from nature.

The aim of the present study was to evaluate the utility of SNPs for the assessment of genetic relationships among various yeast strains, and examine approaches to overcome main difficulties in the analysis.

Materials and methods

Yeast strains

In total, 32 strains of S. cerevisiae originating from various laboratories were used in this study (Table 1). The list includes 11 haploid laboratory strains used for genetic analysis, most of which are believed to have derived from S288c or to be related to it (Mortimer and Johnston, 1986), and 21 diploids. One of the latter (EM93) is believed to be the progenitor of many of the laboratory strains. In total, 20 other diploid strains from various ecological niches were included in order to widen the range of genotypes. These included 14 wine strains, 10 of which were isolated by RK Mortimer from one winery in California (Polsinelli et al, 1996) and four other strains of various types (Sauternes, Champagne, All Purpose and Hock). In addition, the analysis included two American commercial baking yeast: Red Star and Fleishman's, one beer strain, Danstar Windsor, one diploid strain obtained from South Africa, F2007, and two strains, CISC30 and CISC44, which were isolated from immunosuppressed patients who had systemic fungal infections (Wheeler et al, 2003). In addition, sequence information from three different Saccharomyces non-cerevisiae species (S. paradoxus, S. mikatae, S. bayanus), supplied by Kellis et al (2003), provided the genotypes for use as an outgroup. Thus, in total, 35 strains were used in this study.

Table 1 Description of S. cerevisiae strains

SNPs discovery

SNPs were identified by sequence comparison of 147 genes. Comparisons were made between the S. cerevisiae laboratory strain SK1 (Kane and Roth, 1974) and the published sequence of strain S288c (Goffeau et al, 1996). SK1 was chosen for the discovery of SNPs because it was the most distant laboratory strain to S288c, for which extensive genetic and molecular information is available. Based on their expression and annotations, we have chosen these 147 genes in another study aimed at the identification of QTLs involved in the sporulation efficiency. For each of the 147 genes, two primers were designed from the S288c sequence. The amplified fragment was 600–700 BP long and included in its center the codon ATG that starts transcription in each gene. Thus, we expected equal representation of open reading frame (ORF) sequences and the regulatory 5′ upstream regions. The DNA fragments were PCR amplified from strain SK1 and sequenced. PCR reactions were performed in a final volume of 20 μl containing 50–100 ng of genomic DNA, 1 × Taq reaction buffer (Promega, Madison, WI, USA), 0.5 μM of each primer, 200 μM dNTPs, 2.5 mM MgCl2 and 0.5 U Taq polymerase (Promega, Madison, WI, USA). A total of 35 PCR cycles were performed with the following profile: 1 min at 95°C, 1 min at 45–60°C (depending on the locus) and 1 min at 72°C. An initial denaturing step of 3 min at 95°C and a terminal extension of 10 min at 72°C were applied. Sequence comparison of 147 genes (81 480 bp) from strain SK1 with the published sequence of strain S288c resulted in identification of 554 SNPs (including Indels – insertion/deletion of single nucleotides) at an average frequency of 1:151 BP. In total, 30 of these SNPs (in 30 different genes) were chosen for the current study, only with regard to their type. In total, 10 of these SNPs are located in promoter regions and 20 are in ORFs, 10 of which are nonsynonymous (resulting in a change of an amino acid) and 10 are synonymous (the SNPs description is presented at http://www.agri.huji.ac.il/~hillely/giora/ and the online appendix). In this study, SNPs refer to single-nucleotide sites that are polymorphic among strains.

SNPs genotyping

DNA was extracted from each of the 32 S. cerevisiae strains listed in Table 1 and samples were genotyped at the 30 SNP loci, using the MassArray (product of Sequenom Inc.) located at the Weizmann Institute, Israel. The precise mass measurement of short DNA fragments is achieved via matrix-assisted laser desorption/ionization time of flight (MALDI-TOF) MS. The analyte is cocrystalled with organic matrix crystals, which absorb the energy from a nanosecond laser pulse that leads to spontaneous volatilization and ionization of the matrix and the analyte. Following acceleration, the gas-phase ions undergo mass-dependent separation over approximately 1-m flight path, enabling determination of the mass/charge ratio for each molecule. The assay is based on differential extension of a short primer including the SNP. The specific amplification products are immobilized on a magnetic solid phase and subsequently denatured. A specific primer is annealed close to the SNP site and a limited primer extension reaction is performed in the presence of at least one dideoxynucleotide triphosphate. Depending on the SNP alleles, specific termination products are generated and then analyzed by MALDI-TOF mass spectrometry (O'Donnell et al, 1997; Gut, 2004).

For the three non-cerevisiae species (S. paradoxus, S. mikatae, S. bayanus), the previously obtained sequences (Kellis et al, 2003) were used to determine the alleles at the SNP loci (all genotypes are presented in http://www.agri.huji.ac.il/~hillely/giora/ and the online appendix).

Genetic distances and construction of phylogenetic trees

The genetic distance between any two strains was estimated as one minus the proportion of alleles shared by the two strains, GD=1−PSA (Bowcock et al, 1994), as implemented in the software MICROSAT (Minch et al, 1998), and PHYLIP package (Felsenstein, 1993). As such, the value of this estimate is between 0 and 1. The strain SK1 was omitted from most of the phylogenetic analyses (below), since the SNPs were discovered by comparison of SK1 and S288c; thus, for these two strains, by definition PSA=0, and the genetic distance is necessarily maximal, GD=1.

We analyzed the genetic relationship between strains S288c or SK1 and all other strains. In order to compare SK1 and S288c, we used the mean genetic distance (MGD), averaging the GD values between a given strain and the remaining 33 strains.

In addition, we performed two additional analyses:

We constructed two unrooted neighbor-joining trees: First, based on maximum likelihood (ML) analysis as implemented in the software DNAML in the PHYLIP package (Felsenstein, 1993) and second, based on ‘F84’ genetic distance (Kishino and Hasegawa, 1989; Felsenstein and Churchill, 1996), as implemented in the software DNADIST in the PHYLIP package (Felsenstein, 1993). The ‘F84’ model is based on ML and incorporates different rates of transition and transversion, allowing for different frequencies of the four nucleotides.

Clustering analysis

The 35 strains were grouped into a specified number of clusters (K) using the Structure package (Pritchard et al, 2000). The Structure algorithm is a Bayesian method uses the multilocus genotypes to estimate the allele frequencies in K clusters, from which the genotypes have been drawn. It allows for the possibility that some genotypes may be derived from admixture between these clusters, and simultaneously estimates the allele frequencies and proportions, q, of each genotype drawn from each cluster. Structure does not require the membership of the clusters to be specified at the start of the analysis. We followed the recommendation of Pritchard et al (2000) and used 10 000 iterations of the Markov Chain for ‘burn-in’ (before analysis) to minimize the effect of the starting configuration, and 50 000 iterations after ‘burn-in’ to get accurate parameter estimates. Different values of clusters (K) were each used in 10 repeated runs.

Clustering efficiency based on different SNPs number

To evaluate number of SNP loci that are required for the cluster analysis, we compared the results obtained from subsets of the original 30 loci with the results from the full set. For each subset, those SNPs exhibiting the highest frequency of their rare allele were selected (in the case of ties, one SNP was chosen at random). In order to quantify how closely the results correspond to those from the full set of 30 SNPs, we estimated the correlation coefficient between the q-values obtained for each strain using the subset and those from the full set.

Results

Frequency of SNPs between strains

Two of the 30 SNPs were found to be monomorphic between all strains except SK1. In total, 12 SNP genotypes (in seven loci) were not available in S. paradoxus, S. mikatae and S. bayanus due to missing regions in the available sequences database of these species (Kellis et al, 2003). SNP loci were bi-allelic across all S. cerevisiae strains. However, when the non-cerevisiae species were included in the analysis, 10 SNPs (out of 30) had three alleles and one SNP had four alleles. This allelic richness was to be expected, considering that there is only 62% sequence homology between S288c and the most distant species, S. bayanus (Kellis et al, 2003).

Genetic distances between pairs of strains

The genetic distance values between pairs of strains are presented in the database at our web site: http://www.agri.huji.ac.il/~hillely/giora/ (and an online appendix). Genetic distances (GD values) are based on alleles that are not shared between two strains. For instance, GD will be high between a heterozygous diploid and any one of its haploid progeny, depending on the degree of heterozygosity. Likewise, the same high GD will be obtained between the diploid and one of its parent haploids. For example, the strain EM93, which is the diploid ancestor of both S288c and L6705 is genetically distinct from them, with GD values of 0.167 and 0.2, respectively. In contrast, GD values between the diploid wine strains were found to be very small (values of 0–0.067). The strains L6705 and S288c that were expected to be almost identical, in fact have a genetic distance of 0.1 between them. The genetic distances between S288c and all S. cerevisiae strains are much lower than the genetic distances to SK1 (Table 2). The MGD of S288c strain to all 34 strains (including SK1) is much lower (0.24) than the MGD of strain SK1 (0.85), and is statistically different (Prob[t]<0.0001). These results are in agreement with their known history (Mortimer and Johnston, 1986).

Table 2 Genetic distances between S. cerevisiae strains and both SK1 and S288c

The phylogenetic tree based on the estimated genetic distances (1-PSA) is presented in Figure 1. This tree does not include strain SK1 because it was used in the SNP discovery.

Figure 1
figure 1

Phylogenetic relationships of 34 Saccharomyces strains based on 1-PSA genetic distance. Names of strains are on edges. Numbers on nodes are bootstrap values out of 100 (only bootstrap values above 50 are given).

This tree can be compared with that based on the much more extensive data set based on the sequence of 77 randomly chosen genes that were sequenced on the SK1, average of 460BP from each (Figure 2). Notice that the distance between SK1 and S288c is much smaller than between each of them and the other three outgroup species.

Figure 2
figure 2

Hierarchical clustering of two S. cerevisiae strains and three non-cerevisiae species. Analysis is based on sequence data from 77 different genes 460BP from each gene). Neighbor-joining tree was generated using Kimura 2-parameter test as implemented in MEGA software version 2.1 (Kumar et al, 2001). Names of strains are on edges, Bold numbers on nodes are bootstrap values out of 100, Italic numbers on nodes are branch length.

ML methods

In both analyses (ML and ‘F84’), most groups of strains that were putatively related were, indeed, clustered in neighbouring branches of the phylogenetic tree. However, based on the ML analysis, the four wine strains (BU_420, BU_421, BU_422 and BU_423) were clustered closer to the baking strains and the beer strain rather than to the other 10 wine strains. In addition, one of the two strains isolated from humans (CISC44) clustered near the haploid strain L6705 and the diploid strain EM93 and the second (CISC30) clustered near the three non-cerevisiae strains. In the ML analysis only three bootstrap values were above 50 (which is considered as a reasonable threshold), compared to 11 in the 1-PSA analysis (Figure 3).

Figure 3
figure 3

Phylogenetic relationships of 34 Saccharomyces strains based on maximum likelihood. Names of strains, bootstrap values and codes are as in Figure 1.

In the ‘F84’ analysis, all diploid strains (including those from human) were clustered in neighbor branches. However, similar to the ML analysis, the four wine strains clustered in neighbor branches to these diploid strains. In addition, in this analysis, the three non-cerevisiae strains were clustered near the haploid laboratory strains (data not presented).

Clustering analysis

One can apply Structure for a range of K-values (number of clusters) and select the most appropriate K-value. We investigated the range K∈3…30, and calculated the posterior probability [Pr(K)] for each value of K using the estimated log-likelihood of K. For K over 7, we obtained an increased value for the variance of the estimated log-likelihood. For K between 3 and 7 we obtained the highest posterior probability and the lowest variance of the estimated log-likelihood.

When K was 4, we obtained a separation between the various strain groups (laboratory, wine, other S. cerevisiae diploids and the non-cerevisiae species), while all the non-cerevisiae species were clustered together. When K was 6, the non-cerevisiae species were split into three groups. The diploid parent of S288c (EM93) seems to be very close to the laboratory strains. When K increased to 7, the haploid laboratory strains split into two clusters (Figure 4).

Figure 4
figure 4

Cluster analyses by Structure program of all strains using various K-values plotted using Distruct software (Rosenberg, 2004). Each strain is represented by a vertical column, which is partitioned into K colored segments that represent the strain's estimated common fractions in the K clusters. In total, 10 Structure runs for each K-value produced nearly identical strain membership coefficients. (1) laboratory haploid strains; (2) diploid parent of S288c and L6705; (3) wild diploid strain; (4) wine strains; (5) strains isolated from human patients; (6) beer yeast strain; (7) commercial baking strains; and (8) other non-cerevisiae species.

When SK1 is included, it appears solely in a cluster for K-values between 4 and 7 (data not shown). This result is analogous to its placement in the phylogenetic analysis, which was adjacent to the non-cerevisiae outgroup (when SK1 was included in place of S288c).

To demonstrate the behaviour of this clustering analysis in a situation where both haploid and diploid strains are present, we created an artificial data set by drawing alleles at random from the genotypes of the diploid strains. In this manner we created two hypothetical haploid parents for each diploid strain (except the wine strains). Cluster analysis by Structure using K=7 was then carried out for all strains, including the hypothetical haploid strains. The results are illustrated in Figure 5. In the case of strains F1524, 1523 and BU_428, the algorithm inferred that two haploid strains shared a cluster with the diploid, but not with each other.

Figure 5
figure 5

Cluster analysis of all strains, including hypothetical haploid strains related to the diploids, using K=7. Graphic presentation is similar to that given in Figure 4. Seven diploid strains are represented by three columns each, which are labeled above the figure. For each tested diploid strain, the left column is the original diploid strain and the other two columns represent the two hypothetical haploid strains. The haploid strains and the wine strains were clustered in one group each. The various non-cerevisiae species (the three right columns) were clustered in three different groups.

Clustering as a function of the number of analyzed loci

The results obtained from a subset of the data approached those for the full data set as the size of the subset was increased. This trend is quantified by the correlation coefficient of the q-values plotted in Figure 6. As clustering the strains into four or six clusters resulted in a separation between the different strain groups and can represent two different stringency levels of separation, we preferred to use these two K-values. For the analysis with K=4, the trend approached an asymptote with a subset of nine loci and a correlation coefficient of 0.93. For K=6, there was no clear asymptote. The flattening of the curve at 28 loci is an artefact, because the last two loci are monomorphic.

Figure 6
figure 6

Success of clustering. Correlation coefficient between clustering results based on X subset of loci and clustering results based on 30 loci, as a function of the subset of loci used. Sets of markers were chosen as those with the highest frequency of the rare allele.

Estimation of genetic distance, based on various SNP types

We might have expected low polymorphism in the nonsynonymous SNP loci, similar to findings in other organisms (Cargill et al, 1999; Ben-Ari et al, 2005). However, in the present case, there is ascertainment bias, because the SNPs have been selected to be polymorphic in at least one comparison. Frequency of the rare allele in the various SNP categories was not significantly different from each other. Makalowski and Boguski (1998) have shown that genetic distances between species based on synonymous SNPs are similar to genetic distances based on untranslated regions. In our case (Table 3), the correlation coefficient between genetic distances based on promoter SNPs (G-P) and on synonymous SNPs (G-S) are significantly higher than the correlation coefficient between genetic distances based on nonsynonymous SNPs (G-NS) and G-S (P=0.01). Correlation coefficients between genetic distances based on all SNPs and genetic distances based on 20 SNPs from all different combinations of two out of the three SNP types were higher than r=0.96 (Table 3). However, due to the small number of SNPs in each category of this study, the effect of the SNPs type on the genetic relationships of the yeast strains should be regarded accordingly.

Table 3 Correlation coefficients between genetic distances based on different SNP types

Discussion

Genetic relationships within and among Saccharomyces strains

The commercial baking strains and the beer strain were heterozygous at about one-third of their loci. The 14 wine strains are almost identical to each other, and homozygous for almost all SNP loci; they have probably resulted from diploidization via the HO system (Strathern et al, 1982), which allows a haploid strain to double-up producing a diploid that is homozygous at all loci, save the mating type locus.

Each group of strains was clustered in neighbouring branches of the phylogenetic tree (Figure 1). The main groups were the haploid laboratory strains (not including SK1), the wine strains, the baking strains, the two strains isolated from human patients and the three non-cerevisiae strains. In agreement with their known history, all laboratory strains were genetically more similar to S288c than to its counterpart SK1. These strains were derived from S288c or are related to it (Mortimer and Johnston, 1986). More extensive sequence comparisons have also shown that SK1 is genetically distinct from other laboratory strains (Winzeler et al, 2003).

The separation of the haploid laboratory strains into two clusters (when K is 7) while the other clusters remain intact (Figure 4) suggests that they are a more heterogeneous group than the wine strains or the diploid strains. However, this result could be an artefact of comparisons between haploids and diploids. Suppose that there is one widespread allele at an SNP and another that is rarer; the diploids will then tend to share at least one copy of common allele, whereas in haploids will be divided into those strains that have it and those that do not.

Bootstrap values of the phylogenetic analysis are below 50 in most of the splits (Figure 1). More SNP loci or the use of haplotypes may be required for reliable estimation of the genetic distances between all pairs of strains. In particular, the joint analysis of diploid and haploid strains created several difficulties. In the analysis based on genetic distance diploids were separated from their haploid parents and haploid progeny to a degree related to their heterozygosity. On the other hand, one should emphasis the success of the clustering analysis based on the Structure approach (Pritchard et al, 2000), which identified groups of strains with common backgrounds even when only 30 SNP loci were used.

In spite of the fact that the 30 analyzed genes are related to sporulation, and the analyzed strains differ in their sporulation efficiency, the strains were clustered in groups according to their biological and historical relationship. For example, strains S288c and W303 are known to be genetically related, although strain W303 sporulates much more efficiently than S288c.

Likewise, the 10 wine strains isolated by Mortimer (Polsinelli et al, 1996) were found to have almost identical SNP genotypes, even though they differ significantly in their sporulation efficiency. We therefore suggest that, although the 30 genes used in this study have not been chosen randomly and are associated with sporulation, the SNPs at these genes seem to be appropriate for the biodiversity and phylogenetic studies and the results appear to reflect the history of the strains.

Comparison of various analyses

Phylogenetic reconstruction using genetic distances proved unsatisfactory because haploids can be classified as genetically distant from their diploid parents or progeny. Nevertheless, this method performed well in some respects: strains within groups clustered in neighbouring branches of the phylogenetic tree, whereas in the other two analyses (ML and ‘F84’) some stains appeared misplaced by this criterion. In addition, bootstrap values were higher in the tree based on the 1-PSA genetic distances.

The Structure clustering analysis was more effective in representing the relationship between haploids and diploids. When a simulated clustering analysis was carried out, and the heterozygosity was not more than 1/3, the relationship between a diploid and its hypothetical haploid parents clustered together with at least some shared membership to the same group (Figure 5). Although this outcome is more satisfactory than the phylogenetic analyses, it should be recognised that the genetic model underlying the Structure algorithm may not reflect the origins of the yeast strains. It assumes that the genotypes are drawn from one or more of K contemporary clusters whereas, in reality, some strains may not be similar because of gene flow from a current cluster, but through common ancestry.

Alternative choices of marker

Other types of markers used for biodiversity studies, like RAPDs, CAPs and especially SSR, have shown a sufficient level of resolution to clearly discriminate between S. cerevisiae strains (Hennequin et al, 2001; Perez et al, 2001). Based on the results presented in the current report, we may conclude that SNPs can be considered as an effective class of markers, due to their abundance and their amenability to high throughput analysis.

The approach discussed here has some advantages compared to another SNP-based biodiversity study of yeast strains (Winzeler et al, 2003), which used Affymetrix microarrays. Affymetrix arrays assay the entire genome for SNPs (and other polymorphisms) in one operation, and probably give unbiased phylogenetic relationships between the tested strains. These assays, however, require knowledge of the full genome sequence of the analyzed species. Our approach, on the other hand, does not depend on the availability of the full-genome sequence of the organism under investigation (although we used the published sequence of some strains, to define regions in which SNPs were searched). Moreover, unlike the situation in the current approach, heterozygosity at SNP loci cannot be detected on the Affymetrix chips. Our results have shown how important it is to distinguish heterozygote genotypes.

Despite the phylogenetic potential of SNPs, ascertainment bias occurs when their discovery is based on comparison of particular reference strains. The reliability of genetic distances between analyzed strains depends critically on the choice of reference strains and the relationships with the study strains. SNPs from distantly related reference strains produced relatively accurate representations of the phylogenetic relationships between study strains (Pearson et al, 2004).

In this light, several strategies are available. Outgroup strains that are not included in the analysis may be compared to define new SNPs. Other possibilities make use of the study-strains themselves. The two most distantly related strains could be compared (as in the present study) or, alternatively, several strains can be used. The most simple and useful approach may be based on sequencing of a DNA pool prepared from samples of all the strains that are included in the analysis. Most SNPs should thus be recognised. Furthermore, the large-scale approach reported by Winzeler et al (2003) can avoid the bias caused by other approaches of SNPs discovery.

The number of SNP loci to be used depends on the genetic variation between the tested species or groups and the degree of polymorphism of the selected SNPs. A preliminary screening of a small number of strains could be used to identify the most polymorphic loci.

Genetic characterization of yeast strains is needed in cases where new strains are introduced for a genetic study. Our current results suggest that as few as nine SNPs would be sufficient to characterise a new strain as a member of one of the four groups identified here: the laboratory strains, the wine strains, the wild diploid strains or the non-cerevisiae species.