Introduction

The ideal set of molecular marker data for linkage mapping has no missing values, no genotyping errors and the markers segregate in the expected ratio for that type of population. In practice, however, mapping data is complicated by all of these factors. The effect of genotyping errors, especially errors arising in data entry, was quickly recognised in the human genetics mapping literature (Shields et al, 1991). Simulation studies have shown that undetected typing errors at a rate of 1% can lead to incorrect map orders and inflation of map lengths, particularly as marker density increases (Buetow, 1991). The effects of missing values and distorted segregation ratios on map order and map length have had less attention.

This study was motivated by difficulties in placing several markers on a linkage map constructed from marker data scored on a population of 154 doubled-haploid lines of barley (Hordeum vulgare L.). The population had missing values with a mean proportion 4.0% of genotypes missing, and it was thought likely that typing errors could be present in the dataset. In common with many barley doubled-haploid populations, the population also showed some segregation distortion and it has been suggested that this is because of genes affecting survival in anther or microspore culture (Zivy et al, 1992; Manninen, 2000). Bailey (1961) showed that the equation for the maximum likelihood estimate of the recombination frequency between two loci in a backcross population (and similarly a doubled-haploid population) is unchanged if one of the loci affects the viability. It is straightforward to show that the maximum likelihood estimate of the recombination frequency is likewise unchanged for two loci that do not affect the viability themselves, but are linked to a third locus affecting viability. In principle, distorted segregation ratios should not give difficulties in mapping unless there are two linked loci, both of which affect viability.

To investigate influences on linkage maps, we have simulated populations of a similar size and structure to the experimental one and have introduced combinations of typing errors, missing values, and distorted segregation ratios. We report on their analysis by different locus-ordering methods.

Methods

Doubled-haploid populations were simulated using the software package GREGOR (Tinker and Mather, 1993). Two inbred homozygous parental lines A and B, with genotypes AA and BB at each locus, were simulated initially. These were crossed to give an F1 population with genotype AB at each locus, and doubled-haploid individuals were simulated from the gametes produced by the F1 generation. In all 30 markers were simulated on three chromosomes, each chromosome consisting of 10 markers placed at equal intervals. The interval lengths for the three chromosomes were 2, 6 and 10 cM, respectively. No interference was simulated, and so Haldane's mapping function was used throughout in the subsequent analyses. The number of individuals in each doubled-haploid population was 150, and 100 populations were simulated for each configuration of marker data.

A total of 12 configurations of marker data were simulated to determine the effect of missing values, typing errors, distorted segregation ratios and combinations of these on linkage mapping. Table 1 shows the combinations used. To generate the populations with distorted segregation ratios, a larger population of doubled-haploid individuals was simulated as described above. The sixth marker on each chromosome was chosen to affect viability, with reduced viability for individuals with the AA genotype for the chromosomes with 2 or 6 cM spacing between markers, or with the BB genotype for the chromosome with 10 cM markers. Individuals with any of these genotypes were removed from the population randomly, with a probability of removal of 0.5. This left a population with distorted segregation ratios of approximately 1:2 at the sixth marker of each chromosome. The degree of segregation distortion at linked markers decreases with distance from the sixth marker.

Table 1 Simulated marker data sets

Missing values and typing errors were randomly generated in the nondistorted and distorted populations to give populations with 10% missing values, 20% missing values, 1% typing errors, 3% typing errors and a combination of both 20% missing values and 3% typing errors.

There were 3 criteria were used to compare locus orders: (a) weighted least squares (Stam, 1993) weighted by the LOD scores, (b) maximum likelihood (Lander and Green, 1987) and (c) minimum sum of adjacent recombination fractions (SARF) (Falk, 1989). The minimum SARF criterion is motivated by the consideration that if the true order of three loci is L1–L2–L3, then the recombination frequency between L1 and L3 is greater than between L1 and L2, or between L2 and L3, and so the sum of adjacent recombination frequencies is smallest for the true order. Olson and Boehnke (1990) have shown in a study of eight methods (not including maximum likelihood) that minimum SARF was among the two best overall locus-ordering criteria. Locus-ordering criteria (a)–(c) can be found in linkage analysis software. For example JoinMap (Stam and Van Ooijen, 1995) uses weighted least squares and PGRI (Lu and Liu, 1995) has several criteria for ordering loci, including minimum SARF and maximum likelihood.

A method is required to search among the locus orders for the optimum value of the criterion. JoinMap uses a stepwise algorithm similar to the seriation method (Buetow and Chakravarti, 1987). PGRI has a choice of search methods: we chose a simulated annealing algorithm (Kirkpatrick et al, 1983) with the maximum likelihood criterion and a combination of simulated annealing and branch-and-bound algorithms (Thompson, 1987) with the minimum SARF criterion. Linkage analyses on the 12 configurations of marker data were performed using each locus-ordering method. The linkage maps for the maximum likelihood and minimum SARF criteria were almost identical to each other and so only the results for the maximum likelihood are presented.

Locus order is subject to sampling variation and small changes in the marker data can affect the estimated order. PGRI enables a measure of the confidence in the locus order to be calculated by bootstrapping. From each set of marker data and the corresponding locus order derived. A frequency matrix was constructed for the bootstrap samples with genome position as the row indicator and locus number as the column indicator, as shown in Table 2. The higher the frequencies of entries there are on the leading diagonal, the more consistency there is among the bootstrap samples and the more confidence that can be had in the estimated locus order. Table 2 can be summarised by the mean frequency of correct locus placings λ, equal to the sum of the leading diagonal divided by the number of loci.

Table 2 Bootstrap results for analysis of one simulation of set D with 10 loci at 2 cM intervals

Results

Table 3 presents the number of correct orders out of 100 replicate simulations for each configuration of marker data, the mean rank correlation between the estimated and the true order and the mean total map length for maps calculated using weighted least squares in JoinMap. Table 4 shows the same results using the maximum likelihood criteria in PGRI. Table 5 gives the mean percentage of correct locus placings, λ, obtained from the bootstrap samples.

Table 3 Linkage analysis using weighted least-squares criterion
Table 4 Linkage analysis using maximum likelihood criterion
Table 5 Percentage of correct locus placings λ, from the analysis of 100 bootstrap samples

The performance of both locus-ordering criteria improves as the distances between marker loci increase. The number of maps with correctly estimated orders is lower for all configurations when the distance between marker loci is 2 cM compared to the 6 and 10 cM distances. The percentage of correct orders at the 2 cM distance is between 70 and 83% in sets A and G (with neither missing values nor typing errors) but this is reduced by the presence of missing values or typing errors, especially in sets E, F, K and L, which all have 3% typing errors. However, at 10 cM intervals, locus order is relatively robust to missing values and typing errors, and only when a combination of 3% typing errors and 20% missing values is present (sets F, L) does the number of correctly ordered maps drop dramatically. At 6 cM distances, there is a marked decrease in the percentage of correct orders when the data has 20% missing values and/or 3% typing errors, especially under the weighted least-squares criterion. A comparison between sets A and G, B and H, etc shows that the presence of distorted segregation ratios has little effect on the number of correctly ordered maps.

The number of correct maps when the data are analysed using the weighted least-squares criterion (JoinMap) is consistently lower than for the maximum likelihood criterion (PGRI). This is particularly noticeable when missing values or typing errors are present. However, the mean rank correlations for maps constructed with loci at 6 and 10 cM intervals are very similar for the two criteria, signifying that although the weighted least-squares criterion gets more orders wrong, bigger differences from the true locus order result using the maximum likelihood criterion. Overall, the mean rank correlation between estimated and true order is very high for both criteria and only drops below 0.9 for sets F and L when the loci are spaced at 2 cM intervals.

The results from sets D, E, F, J, K and L (with 1 or 3% typing errors) show substantial increases in the map lengths compared to the original sets A and G. This effect worsens as marker separation decreases and is most noticeable in those simulations where the markers are at 2 and 6 cM intervals. Using the weighted least-squares criterion, the 2 cM distance groups in sets E, F, K and L have average map lengths of 31.2–33.2 cM, almost twice that of the respective original data sets, A and G (17.0–18.1 cM). The average map lengths for E, F, K and L calculated using the maximum likelihood criterion were 57.4–58.6 cM, showing that this criterion is affected more by typing errors. The effect of missing values in the data is to shorten map lengths for more widely spaced markers, especially when there is also segregation distortion. However, this shortening effect is less obvious under the maximum likelihood criterion.

Bootstrapping was used with the maximum likelihood criterion to obtain a measure of confidence in the locus order. Table 2 shows an example of a single simulation of set D with the 10 loci spaced at 2 cM intervals. Of the 100 bootstrap replications, 74 placed locus 1 in position 1, 20 placed it in position 2, 4 in position 3 and 2 in position 5. The mean percentage of correct locus placings λ=62.6% for Table 2. Table 5 shows the mean λ values for each configuration and marker separation distance. The entries in this table may be interpreted as measures of confidence in the estimated locus orders. Confidence in the estimated orders is highest for sets A and G and lowest for sets F and L. It is consistently lowest for those maps constructed from locus markers at 2 cM intervals: the difference between these and 6 cM distances is substantial. Confidence remains high at above 90% for 10 cM distances in all sets except sets F and L, the nondistorted and distorted populations with a combination of 20% missing values and 3% typing errors. At 6 cM distances, confidence in the loci order falls below 90% for sets C, E, H and K (with 20% missing values or 3% typing errors) and falls still further to below 70% for sets F and L. There are no substantial differences between nondistorted and distorted populations.

Discussion

This study was motivated by difficulties in mapping in a population of 154 doubled-haploid lines of barley. The effects of the factors considered here will depend on the population size: the smaller the population, the more severe the effects are likely to be. Our simulation study has shown that when markers are separated by distances of about 10 cM, a map based on 150 individuals is quite robust to the presence of typing errors, missing values and segregation distortion. However, finer scale genetic maps are of increasing interest, and in this case the map order and length are increasingly affected. The bootstrap analysis reveals the extent of variation in locus orders as a result of sampling. There was found to be less than 85% agreement between bootstrap samples without missing values or typing errors for a 2 cM separation, and introducing either of these introduced more variation in the ordering.

As discussed by several earlier authors, even a low frequency of typing errors can have a substantial impact on the order and length of a linkage map. The most likely effect of a typing error is to introduce a double recombination, so that an individual's phenotype at three neighbouring loci might change from a true phenotype of AAA to ABA. This is increasingly the case as the marker density increases and the proportion of true recombinations between neighbouring markers falls. We found that map inflation was more extreme using the maximum likelihood criterion than using weighted least squares. This was also reported by Shields et al (1991). The advantage of the weighted least-squares approach is that the distances between markers are calculated from the map distances between all pairs of markers on a chromosome, and so the impact of typing errors on the distance between adjacent markers is less severe.

In the human genetics literature, several authors have proposed methods to test for, eliminate or model typing errors. Morton and Collins (1990) adjusted their map of human chromosome 10 by subtracting a multiple of the error rate (determined by double typing a subset of marker data) from the length of each interval. Ott (1991) expressed the apparent rate of recombination p in terms of the rate of misclassification recombinants as nonrecombinants and vice versa, s, and the true recombination frequency, θ0, as

Shields et al (1991) incorporated this formulation of the apparent rate of recombination into an iterative weighted least-squares analysis and found it to give more realistic chromosome lengths for simulated data, human chromosome 10, and Drosophila. Lincoln and Lander (1992) worked with the idea of incomplete penetrance between the true marker genotype, and the observed marker phenotype, which may be subject to error. They included a penetrance function in their maximum likelihood analysis of simulated F2 data, and estimated a LOD score for error for each phenotype, equal to the logarithm of the odds ratio of the likelihood of the data given a correct phenotype to the likelihood of the data given an incorrect phenotype. The phenotypes with a high LOD score for error can then be rechecked and corrected if necessary, and a final map estimated with a smaller error rate.

In plants, genetic researchers typically estimate a linkage map assuming no errors present and then look for improbable genotypes, such as those originating from double recombinations, after the map has been calculated. The module JMCHK of JoinMap (Stam and Van Ooijen, 1995) calculates, for each locus, the probability of each genotype, conditional on the genotypes at the two flanking loci and on the distances between them. Improbable genotypes can be checked against the original autoradiograms and corrected where necessary. After correction, the map should be recalculated and rechecked. An alternative approach would be to incorporate a probability of mistyping into the likelihood or weighted least-squares approach, as has been used for human populations.

We found that the maximum likelihood criterion coupled with the simulated annealing search method gave the correct order more often than using the weighted least-squares criterion with the stepwise algorithm from JoinMap. Analysis using a combination of the simulated annealing and branch-and-bound algorithms to minimise the SARF criterion gave results almost identical to the maximum likelihood criterion. This result differs from that of Shields et al (1991), who found weighted least squares (with or without an error probability) to recover the correct locus order more often. It seems likely that the stepwise method may not always find the optimal value of the criterion. Combination of the simulated annealing and the branch-and-bound algorithms, which do guarantee to optimise a criterion, with weighted least squares, seems a promising technique.

Missing values at the proportion investigated in this study had less effect than typing errors, but they reduced the number of correctly ordered maps, especially for a marker separation of 2 cM. The variability among the order in bootstrap replicates also increases when missing values are present. There was also a tendency for missing values to lead to slightly shorter map lengths for more widely separated markers, especially in the presence of distorted segregation ratios and/or when using weighted least squares. The presence of missing values in the marker data means that information about the number of true recombinations that have taken place along the chromosome is lost. In our simulation data, missing values were dispersed randomly throughout the data, but in experimental data the distribution is likely to be nonrandom, with some markers being difficult to score and so having more missing values than others. This may have more effect on the estimates of map order and length.

Segregation distortion at the level introduced in this study had very little effect on marker order or length. For doubled-haploid populations, segregation distortion does not affect the equation for the maximum likelihood estimate of the recombination fraction when a single locus on a chromosome causes differential viability, although problems arise in the form of biased recombination fraction estimates when more than one locus is involved (Lorieux et al, 1995). Cheng et al (1998) have proposed a method for inferring the presence of and mapping a locus causing distortion, which they refer to as a partial lethal-factor locus, and have applied this to the barley Steptoe–Morex population. There is scope for further research on checking the goodness of fit of a single locus affecting viability.