Introduction

Tobacco (Nicotiana tabacum L.) is an important crop plant (Davis and Nielsen 1999) and a member of the nightshade (Solanaceae) family which is one of the largest and most diverse within the angiosperms. This family harbors 3,000–4,000 species (Olmstead et al. 2008; http://www.nhm.ac.uk/research-curation/research/projects/solanaceaesource/solanaceae), of which a considerable number are of major economic importance as crop, vegetable or ornamental species throughout the world such as potato (Solanum tuberosum), tomato (Solanum lycopersicum), eggplant (Solanum melonena), pepper (Capsicum species) and Petunia (Petunia × hybrida). Due to this and the fact that members of the Solanaceae are also important model species for systematics, plant biology, genetics and biotechnology, a current objective is to elucidate the genome structure of its most important species (Mueller et al. 2005; http://solgenomics.net/solanaceae-project) through various genome sequencing initiatives and comparative genome analyses.

Currently, the best characterized species within the Solanaceae are tomato and potato for which a large amount of information concerning their genome structure have been generated. This includes high density genetic and physical maps that contain several thousand anchored markers (Tanksley et al. 1992; van Os et al. 2006; http://solgenomics.net). At present, genome sequencing efforts for tomato and potato are approaching completion, with draft sequences being available (http://solgenomics.net; http://www.potatogenome.net). To obtain a more detailed picture of genome evolution in the Solanaceae, genetic maps of a number of other important species (e.g., pepper, eggplant) have been generated, and these were aligned through the mapping of conserved markers such as RFLP markers and conserved ortholog sequences (COS) markers (Fulton et al. 2002; Wu et al. 2006) to the genome of tomato as a reference. These data have indicated that at the chromosomal level, the Solanaceae genomes have undergone a number of inversions and translocations during speciation (Tanksley et al. 1992; Wu et al. 2009a, b), while at the gene sequence and gene order level, it could be shown through comparative BAC clone sequencing conducted in potato, eggplant, pepper, and petunia that most genes are conserved in their exon/intron structure and sequence order (Wang et al. 2008).

Beside the use of molecular markers for the characterization of genome structure and organization, they are also important tools for the development of new varieties through marker assisted breeding (Xu 2010). Since the level of polymorphism in elite material is frequently low, it is important to have access to a large number of markers (usually several thousand) so that in situations where a low level of polymorphism exists between closely related lines/varieties, sufficient polymorphic markers are still available for the mapping of traits that are either qualitatively or quantitatively inherited as well as for marker-assisted selection or marker-assisted backcrossing. While in many diploid species (especially with a high level of polymorphism) there is now a tendency to use single nucleotide polymorphism (SNP) for such applications (Ganal et al. 2009), microsatellite markers are still the object of active marker development efforts in other species (Ren et al. 2009). In polyploid species and in species with a low level of polymorphism such as tobacco, the current markers of choice are still microsatellites or simple sequence repeat (SSR) markers. SSR markers detect a higher level of polymorphism than other marker systems and they can be easily scored even in polyploid species (Röder et al. 1998). Scoring of SNP markers can be technically challenging in allopolyploid and autopolyploid species.

Tobacco is an allopolyploid species (n = 24) and shares its basic chromosome number of n = 12 with many other solanaceous species such as tomato, potato, pepper and eggplant. The species is most likely the result of a tetraploidization event (Lim et al. 2004; Clarkson et al. 2005) involving Nicotiana sylvestris (S-genome) and a species closely related to modern day Nicotiana tomentosiformis (T-genome). With both genomes together, tobacco is at the high end of genome sizes (4.5 Gbp) in the Solanaceae (Arumuganathan and Earle 1991) and contains a large proportion of repetitive sequences (Kenton et al. 1993; Zimmerman and Goldberg 1977). Until recently, little information was available in terms of genetic mapping and molecular marker development in the Nicotiana species (Lin et al. 2001; Rossi et al. 2001; Suen et al. 1997). This changed with the generation of 637 microsatellite markers and a first version of a genetic map covering 1,920 cM and 282 markers (Bindler et al. 2007). However, this map was not able to fully cover the tetraploid tobacco genome and the number of markers, while sufficient for diversity analyses (Moon et al. 2008), was still insufficient for many breeding purposes, although it was used for locating traits responsible for the formation of certain tobacco leaf surface exudates (Vontimitta et al. 2010). Recently a further study was published (Wu et al. 2009c), concerning the comparative mapping of the diploid ancestor of the T-genome (N. tomentosiformis) and a related species to the S-genome (N. acuminata) using 262 and 133 COS markers, respectively, together with a set of microsatellite markers that were mapped in these two species. The results from these mapping studies revealed that the tetraploid tobacco genome has undergone a number of chromosomal rearrangements compared to these diploid genomes. Furthermore, in the same study, with the COS markers that had also been mapped in the tomato genome, it could be shown that a number of reciprocal translocations and inversions (>10) differentiate the ancestral tobacco genomes from the tomato genome.

With respect to tobacco genome sequencing efforts, the tobacco genome initiative (TGI) (Gadani et al. 2003) aimed at sequencing of more than 90% of the N. tabacum genomic sequences containing open reading frames using methyl filtration technology (GeneThresher®), targeting unmethylated gene-rich regions. More than 1.3 million genome survey sequences (GSS) are now available in GenBank or through the public website http://www.pngg.org/tgi/. The goals of the research reported here, were to exploit the TGI sequences to generate a sufficient number of microsatellite markers for saturating the tobacco genome and to produce a high density, high resolution genetic map that could be used for tobacco breeding purposes and for further analysis of the tetraploid tobacco genome.

Materials and methods

DNA isolation and plant material

DNA was extracted from freeze-dried leaves of plants that were grown in the greenhouse as previously described by Bindler et al. (2007). Functionality tests of the developed markers and an initial analysis of polymorphism were performed using a panel of 16 tobacco varieties representing the main types of tobacco (Flue-cured, Burley, Oriental, Dark, and others) plus single accessions representing N. sylvestris and N. tomentosiformis (Table 1). Seeds for this material were obtained from the germplasm collection at Philip Morris International. The F2 population consisted of the same 186 individual plants as published by Bindler et al. (2007) and was originally provided by Dr. Ramsey Lewis (North Carolina State University).

Table 1 List of accessions tested with 5,500 microsatellite primer pairs

Processing of TGI sequence data and primer generation

A total of 1,379,067 raw sequences and 55,411 expressed sequence tags (ESTs) obtained from the TGI (http://www.pngg.org/tgi) were surveyed for the presence of microsatellite sequences and further processed for primer development for 5,500 candidate SSR loci. The sequences and the corresponding quality files were initially prepared by screening for vector and Escherichia coli contamination and by removing redundant sequences using the cross_match© program (http://www.phrap.org/). Sequences containing microsatellite motifs with at least eight repeats and a minimum match of 90% to the respective microsatellite motif were identified using Tandem Repeat Finder (Benson 1999) and BLAST (Altschul et al. 1990) searches. Oligonucleotide primer pairs flanking the respective microsatellite sequence were designed using the Primer 3.0 program (Rozen and Skaletsky 2000). Non-redundant primers were selected to be approximately 20 bases long, to have a GC-content between 20 and 80%, and to have a melting temperature between 57 and 63°C (optimum 60°C). A second elimination step of potentially remaining duplicated sequences was performed by checking primer sequences, microsatellite motifs, and the cross_match© scores.

Primer testing

Forward primers were labeled with FAM, HEX or ROX for fragment analysis on Applied Biosystems 3100 Genetic Analyzers. The respective fluorescent dyes were selected according to the expected size of the PCR fragment of the microsatellite marker. The FAM dye was used for fragments between approximately 90 and 170 bp, the ROX dye was used for fragments between approximately 170 and 207 bp and the HEX dye for fragments longer than 207 bp. Testing of primer pairs for functionality and linkage mapping were performed in PCR volumes of 10 μl with an annealing temperature of 55°C during 45 PCR cycles according to Bindler et al. (2007).

Mapping

Two different software packages were used for mapping of the SSRs: Map Manager QTXb20 (Manly et al. 2001, http://www.mapmanager.org) and JoinMap 3.0 (Plant Research International BV, Wageningen, Netherlands). MapManager was used for the main mapping procedure (settings: linkage evaluation F2 intercross, search linkage criterion P = 0.05, map function Kosambi, cross type line cross). The map position of some dominantly scored markers generated in the previous analysis and the phenotypic trait flower colour scored in the previous analysis (Bindler et al. 2007) were optimized manually taking into consideration the number of crossovers (as low as possible) and the length of the linkage group (as short as possible). Grouping of markers and traits as well as their segregation patterns were verified using the JoinMap® 3.0 program (Van Ooijen and Voorrips 2001) with the following settings: used linkages with REC smaller than 0.400 and LOD larger than 1.00, threshold for removal of loci with respect to jumps in goodness-of-fit 5.000, Kosambi mapping function. The final map was drawn using JoinMap.

Results

Microsatellite marker development from tobacco genomic sequences

A total of 1,379,067 raw sequences from the TGI and 55411 EST were analyzed for the presence of microsatellite motifs using the set of bioinformatics tools described in “Materials and methods”. After an initial screen of 288 primer pairs equally derived from ESTs and genomic sequences, EST derived microsatellites were excluded from further marker development due to the fact that in the ESTs, most of the identified microsatellite markers contained predominantly short repeats (<10 repeat motifs of mainly trinucleotide repeat-types) with a lower than expected level of polymorphism of the trinucleotide repeats between the parents of the mapping population (11 vs. 36% for the dinucleotide motifs from the genomic sequences). Thus, the developed markers were predominantly derived from single or low copy genomic sequences or introns. The 5,500 markers with the best primer design scores were selected for subsequent marker testing. Approximately 60% of the microsatellite markers contained predominantly TA repeats, 22% contained predominantly GA repeats, 14% contained predominantly GT repeats. The remaining 4% harbored mixed repeat types that could not be assigned clearly to any of the above types. 88% of the markers contained 8 (minimum) to 25 repeating units while the others were mostly larger and/or contained a mixture of repeat types.

Testing microsatellite markers for functionality

In an initial analysis, 5,500 primer pairs were evaluated for functionality on a test panel consisting of 16 different tobacco lines or varieties (Table 1). The panel included the mapping parents of the Hicks Broadleaf × Red Russian population and for the assignment of markers to the different genomes, representative accessions of Nicotiana tomentosiformis and Nicotiana sylvestris. From all 5,500 primer pairs tested, more than 93% (5,119) were functional under our analytical conditions. Functional markers were defined by no more than six amplified fragments with at least one fragment in the size range defined by the sequence used for primer design. Based on the results of the primer pair tests with the 16 sample panel, the markers were classified according to their number of loci (1, 2, 3, 4 and multiple loci), whereby the number of loci detected by the individual primer pair was determined according to the number of fragments amplified in the samples of the test panel, provided that all samples displayed comparable fragment numbers (supplementary material S1, Table 2). From all functional primer pairs, 74% amplified only single fragments, 22% amplified two and 4% more than two fragments. Especially for those markers amplifying single fragments, they were assigned to the two different genomes (T/S) based on the presence of an amplification product in the N. tomentosiformis or N. sylvestris accessions respectively. All functional primer sequences and other relevant data are available from the electronic supplementary material (S2).

Table 2 Functionality and number of detected loci for the investigated 5,500 tobacco microsatellite markers

Mapping of the microsatellite markers in the Hicks Broadleaf × Red Russian F2 population

A total of 2,415 (47%) of the functional markers were polymorphic between the parents of the mapping population. These candidate markers were then analyzed on the entire mapping population. 379 of these polymorphic markers could not be mapped due to insufficient data quality (low stability, difficult scoring due to strong stuttering, polymorphic between the parents but monomorphic within the mapping population) and thus had to be eliminated from the mapping experiment. This left 2,036 mapped markers, representing 37% of the newly identified primer pairs. Since the same plants were used as in Bindler et al. (2007), the 282 previously mapped markers were also included in the final map.

With the 282 previously mapped markers (including the morphological marker of flower colour), 2,318 markers detecting 2,363 loci were mapped on this high-resolution genetic map of tobacco. All mapped markers together generated 24 clearly defined linkage groups and the entire map of the tobacco genome covered 3,270 cM (Fig. 1). According to the amplification tests on the two samples from the ancestral genomes, the markers were assigned to the two different genomes of tobacco. 38% of all functional markers were assigned to the S-genome, 22% were assigned to the T-genome and those remaining could not be assigned due to lack of functionality in the ancestral genomes or because of simultaneous amplification from both genomes. In most of the linkage groups, the markers could be attributed predominantly to one of the two genomes. Eleven linkage groups could be clearly assigned to the S-genome and nine linkage groups to the T-genome. Four linkage groups consisted of both S- and T-genome specific markers mainly grouped in different parts of the respective linkage group. These four linkage groups were stable up to very high LOD-scores. On the S-genome linkage groups, 1,163 markers were located covering 1,810 cM whereas on the T-genome linkage groups, 1,200 markers covered 1,460 cM. On average, each linkage group is represented by 98 SSR markers (Table 3). The marker density on the T-genome was higher than on the S-genome due to the shorter genetic distances. Only two linkage groups (9 and 11) showed major areas of disturbed segregation as already observed previously (Bindler et al. 2007).

Fig. 1
figure 1figure 1figure 1figure 1

High-density genetic map of the tobacco genome. Red chromosomal segments were assigned to the T-genome. Blue chromosomal segments were assigned to the S-genome. Numbers on the left side are centiMorgan counted from the top of the chromosome. Numbers in brackets behind marker name on the right side display the numbers of additional cosegregating markers (±0.5 cM)

Table 3 Tobacco genetic map data displayed per linkage group

With 186 F2 individuals, the generated map had an average resolution of approximately 0.27 cM per crossover and a 90% probability of separating two markers that are approximately 0.6 cM distant from each other, provided that the scoring of the markers was accurate and determined in most or all of the progeny plants. In order to determine the quality of the map, a number of control analyses were performed. 5% of all data points generated throughout the mapping were randomly checked and 5% of markers classified as difficult to score were selectively checked. Less than 0.05% of the re-checked data points had technical or other problems in the data set. The final quality check included the specific re-analysis of all double crossing over in the mapping data. After this analysis, 371 real double crossovers were still present considering all mapped markers (Table 3). On average, approximately two-thirds of the double crossovers within a chromosome were associated with regions of low marker density and large genetic distances between these markers. In addition, each linkage group had one major region of markers separated by small genetic distances whereby frequently a significant number of markers (up to 28) showed essentially no recombination between each other (electronic supplementary material S3 and Fig. 1).

Although the largest distance between two markers is still 16 cM and a number of markers are cosegregating, the average marker distance of mapped markers is 1.41 cM based on a genome size of 4,500 Mbp and a genetic distance of 3,270 cM. This is roughly equivalent to an average distance of 2 million base pairs between individual mapped SSR markers.

435 markers amplified two clearly defined fragments in the tetraploid N. tabacum genome (Table 2). In these cases identification of the homeologous linkage groups of the T- and S-genome was not possible since both fragments need to be polymorphic and mapped. For 389 markers (89%) only one fragment was polymorphic and could be mapped while the other fragment was monomorphic. Only for 46 markers could both fragments be mapped. From these, six duplicated loci were located on the same linkage group and 40 mapped to different linkage groups producing inconsistent homeologous pair assignments.

Discussion

With the development of 5,119 new and functional microsatellite markers for the N. tabacum tobacco genome, the number of available functional microsatellite markers has increased nearly tenfold (from 637 to 5,756). This demonstrates the value of the TGI sequences for development of large numbers of SSR markers. Given a genome size of approximately 4,500 Mbp for the tobacco genome, it is estimated that the 5,736 markers increase the marker density to, on average, more than one available microsatellite marker (4,500 Mbp/5,756 marker) per million base pairs and now makes tobacco the species with the largest number of tested SSR or sequence tagged markers in the Solanaceae. The tobacco genome was previously less well mapped than species such as pepper and eggplant. It is now comparable to tomato and potato in terms of number of available sequence tagged markers. It also makes tobacco a species for which more tested SSR markers are available compared to other allopolyploid or autopolyploid plant species (Cheng et al. 2009; Somers et al. 2004; Guo et al. 2007).

The established map in the Hicks Broadleaf × Red Russian F2 population is of high quality and high resolution. Most of the observed and confirmed double crossovers are present only in regions with a low marker density and relatively large distances between markers. Furthermore, skewed segregation patterns are found only in limited chromosomal regions (chromosome 9 and 11). At an average genetic resolution of 0.27 cM and with most markers being scored in a codominant manner, the map is more accurate than the current saturated tomato map consisting of approximately half as many individuals (Tanksley et al. 1992; Fulton et al. 2002; http://solgenomics.net).

With a total size of 3,270 cM, the genetic map has increased by 1,350 cM compared to the previously published map of 1,920 cM which was incomplete since not all markers could be linked to each other (Bindler et al. 2007). In the current map, all 2,318 markers could be attributed unequivocally to the 24 linkage groups or chromosomes, indicating that the map should cover the entire tobacco genome, although the telomeres are not included in the map due to the lack of suitable markers. Considering that tobacco is an allotetraploid species with two genomes, each segregating in a diploid inheritance mode, the genome length of each genome in centiMorgan is comparable to the full map of tomato of approximately 1,500 cM (Shirasawa et al. 2010). The position of the centromeres of the 24 tobacco chromosomes is not known, although most chromosomes show a single region with a considerable clustering of markers. In other Solanaceous species such as tomato, such a marker cluster is usually associated with the centromere and its constitutive heterochromatin (Tanksley et al. 1992).

It is interesting to note that the vast majority (>90%) of the developed markers are either specific to one of the ancestral genomes or detect two loci (albeit usually only one locus was polymorphic and could be mapped) indicating that the two genomes are significantly different from each other and that in case two loci are identified, they might be located at the same position on the homeologous chromosomes. It was expected that the SSR markers for which two loci could be mapped should lead to the identification of the homeologous groups in the two genomes. This was however not the case since only a very limited number (46) of SSR markers were polymorphic on both chromosomes. Furthermore, in a number of cases linked marker pairs that were polymorphic in both genomes were definitely not located on the same chromosome, indicating that the two genomes have been rearranged in the current forms of the N. tomentosiformis and N. sylvestris genomes and/or after the speciation event which led to the generation of tetraploid tobacco. For both cases there is evidence based on this study and the results of Wu et al. (2009c). There, it could be shown that the N. tomentosiformis and N. acuminata (a closely related species to N. sylvestris) have undergone a number of chromosomal rearrangements since their separation. Based on the amplification of genome-specific markers in the two ancestral genomes of N. tomentosiformis and N. sylvestris, followed by the assignment of these markers to linkage groups in the current map, it is likely that translocations have also occurred after the polyploidization event since four linkage groups contain genome-specific markers from both linkage groups. This is independently supported by cytogenetic analyses (Lim et al. 2007) where it could be shown by GISH (genomic in situ hybridization) that numerous intergenomic translocations exist in natural N. tabacum. It can also not be excluded that the S and T genome may have undergone deletion of specific chromosomal segments since the polyploidization event (Lim et al. 2004; Doyle et al. 2008) and recent data (G.B. unpublished data) suggest that on linkage group 9, some genes are absent in the S-genome.

In terms of genetic mapping and molecular marker assisted breeding, this map represents a significant improvement over existing marker resources. While it has not been previously possible to identify polymorphic markers spread at roughly similar genetic distances over the entire tobacco genome even in distantly related material, it is now possible to identify sufficient (hundreds) of polymorphic markers in crosses within the four main tobacco types: Burley, Flue-cured, Oriental and Dark, so that genes and quantitative trait loci can be mapped with a marker density/spacing which is suitable for marker-assisted breeding (Xu 2010).

The SSR markers and the sequences from which they have been derived will in the future be important tools to further advance tobacco genome analysis, since they can be used as anchor points in the physical mapping of bacterial artificial chromosome (BAC) clones and as anchor points to align the genetic map with a future genome sequence of tobacco. Furthermore, since the SSR markers are located in predominantly single copy sequences, the flanking DNA sequence from which the respective marker has been generated can be used to further align the tomato and potato genome sequence to the tobacco genome via sequence homology analysis, once their genome sequences are fully available, and if sequence homology is sufficient.