Introduction

Studies of spatial structure in plant populations can reveal the operation of key evolutionary processes. When spatial structure develops, it may influence the patterns of local breeding and evolution. Population genetic theory predicts that plant populations will exhibit local population genetic structure when gene flow is restricted (Wright, 1943, 1978). In support of these models, Turner et al. (1982) showed with computer simulations that under restricted gene flow and selective neutrality, the population (as a whole) develops a patchy distribution of genotypes. Subsequent computer studies have confirmed that positive spatial autocorrelation, declining with distance, develops quickly under restricted gene flow (e.g. Sokal & Wartenberg, 1983; Sokal et al., 1989; Epperson, 1990, 1995a, b; Sokal & Jacquez, 1991).

Contrary to theoretical expectations, however, spatial structure is rarely consistent across loci or sites, and when found, is generally weak, with many studies showing minor spatial structure on a strictly microspatial scale (Heywood, 1991). Schnabel et al. (1991) have found spatial autocorrelation at short distances for some allozyme loci in Gleditsia triacanthos and Maclura pomifera; Perry & Knowles (1991) found similar patterns in Acer saccharum, and weak spatial structure was reported in Pinus banksiana (Xie & Knowles, 1991), Quercus laevis (Berg & Hamrick, 1995) and Psychotria officinalis (Loiselle et al., 1995). On the other hand, Waser (1987) found no apparent spatial autocorrelation at five allozyme loci in Delphinium nelsonii, although reliable estimates of pollen and seed dispersal suggest that it should have existed. Similarly, no spatial autocorrelation for allozyme loci has been found in Pinus contorta (Epperson & Allard, 1989), Picea mariana (Knowles, 1991), Psychotria nervosa (Dewey & Heywood, 1988) or Carpa procera (Doligez & Joly, 1997).

Several factors may explain why spatial structure is often weaker than anticipated. The usual post hoc explanation is that gene flow must have been greater than expected and sufficient to minimize local spatial structure (e.g. Dewey & Heywood, 1988; Epperson & Allard, 1989; Knowles, 1991; Doligez & Joly, 1997). An alternative is that spatial structure varies with life history stage, and that we are sometimes sampling the wrong cohort. In Cecropia obtusifolia, for example, seedlings show marked microspatial structure (spaced out on a cm × cm scale), but spatial structure declines among saplings (spaced out on a m × m scale), and is absent among adults (spaced 10s of m apart) (Epperson & Alvarez-Buylla, 1997).

A lack of sensitivity in the statistical procedures may also be a problem. Epperson (1995c) has shown that joint-count statistics for pairs of individuals are more sensitive to spatial structure than is the more familiar Moran I-statistic (Cliff & Ord, 1981), based on allele frequencies of sets of individuals. He has pointed out that widely cited simulations (e.g. Sokal & Wartenberg, 1983) employed a different computational procedure for Moran’s I than is typically used for empirical work in natural populations. The denominator of the published Moran statistic leads to an underestimate of the autocorrelation under restricted gene flow. Appropriate adjustments (Epperson, 1995c) indicate that natural and simulated levels of spatial structure match more closely than had previously been realized.

Most spatial structure studies to date are based on allozymes. However, a series of new PCR-based genetic markers is becoming widely used in plant studies, in part because they are more numerous and more informative than allozymes. These include microsatellites (or simple sequence repeat markers), henceforth SSRs (Jarne & Lagoda, 1996), RAPDs – random amplified polymorphic DNA markers (Welsh & McClelland, 1990; Williams et al., 1990) – and AFLPs – amplified fragment length polymorphic markers (Vos et al., 1995). SSR markers are single-locus, codominantly inherited systems, not unlike allozymes, but given the often large number of alleles per locus, allele frequency spectra are sometimes dominated by numerous low-frequency alleles. The information per locus is considerably greater than for allozymes, and in selfers exceeds that for any other markers (e.g. Rongwen et al., 1995). The RAPD and AFLP procedures yield multilocus DNA profiles with dominantly inherited polymorphisms. By virtue of the essentially unlimited number of primer combinations available, however, both RAPDs (Peakall et al., 1995) and AFLPs (Powell et al., 1996) represent more variable assay batteries than allozymes.

The procedures and publicly available software for investigating spatial genetic structure are not well designed for the large numbers of codominant alleles typical of many SSR loci, nor are they well designed to deal with multiple loci. Our purpose in this paper is to introduce a new approach to spatial genetic structure analysis that is applicable to multiallelic codominant, multilocus arrays, a method that can easily incorporate differential weighting of low-frequency alleles, if that is desired. The procedure is intrinsically multivariate, avoiding the need for allele-by-allele, locus-by-locus analysis, though such analyses can always be undertaken, if desired. By treating the genetic data set as a whole, we can strengthen the spatial signal by reducing the stochastic (allele-to-allele, and locus-to-locus) noise.

Our technical objectives are: (i) to develop a multiple-character, spatial autocorrelation analysis that is very general, based on genetic distance methods, (ii) to illustrate that treatment for multiallelic codominant loci, and (iii) to provide nonparametric permutational testing procedures for the full correlogram. We will illustrate these new methods with an example allozyme data set from Caladenia tentaculata, and in that context, we will address the following four questions. (i) How does the multiallele treatment compare with separate single-allele treatment, and which alleles convey most of the information, the common or rare alleles? (ii) How do different loci perform, and how is that related to allelic richness and the overall level of polymorphism? (iii) Is the multilocus treatment more powerful than are separate single-locus treatments, and do we lose information with the overarching treatment? (iv) What is the advantage (if any) of weighting alleles differentially, in terms of their respective frequencies?

Mathematical and statistical methods

Genetic distance measures

We begin by defining the genetic distance between a pair of individuals for a multiallelic codominant locus, such as one would have with either allozymes or microsatellite (SSR) markers; it is a slight modification of the scheme shown in Peakall et al. (1995). Consider a trio of codominant alleles (A, B, C ) and a sextet of diploid genotypes, as shown in Fig. 1(a). The homozygotes (AA, BB, CC ) are represented by the vertices of an equilateral triangle, and the linear distances between these vertices are scaled to be ‘2’ units apart, measured along the side between adjacent vertices. The heterozygotes (AB, AC, BC ) are positioned midway between the respective heterozygotes, and are a linear distance of ‘1’ from each homozygote. From the geometry of the triangle, it is also clear that the linear distance between heterozygotes sharing a single allele (say AB and AC) is 1, and that between any heterozygote and the opposite vertex homozygote (e.g. AB to CC) is √3. The squared distances between the various genotypes are thus 1 (AB to AA, BB, AC or BC; AC to AA, CC, AB or BC; BC to BB, CC, AB or AC), 3 (AB to CC, AC to BB, BC to AA), or 4 (AA to BB, AA to CC, BB to CC). The distance between any genotype and itself is clearly ‘0’. We will need squared distances for what follows, but the scoring scheme is ‘additive’.

Fig. 1
figure 1

Additive distance metric for multiallelic, codominant loci; (a) three-allele (A, B, C) equilateral triangle, with linear distances indicated; (b) equivalent vector representation of a three-allele (A, B, C) equilateral triangle.

We can see the connection between the linear scoring convention and squared genetic distance by defining these distances in terms of paired vectors of three linear genetic variables, defined as shown in Fig. 1(b). The squared distance between any two genotypes is one-half the Euclidean distance between their respective vectors,

where the subscript k = 1, …, K indexes the genetic (scoring) character. Just to illustrate, consider the two vectors for AA and BC. Equation 1 takes the form

the same result we had from the triangle. We can extend to four alleles (equilateral tetrahedron or a vector of length 4), five alleles (equilateral pentahedron or a vector of length 5), and so on. The only thing that is new is the distance between two nonoverlapping heterozygotes, for example AB vs. CD, which is d = √2, in linear form, and d 2 = 2, in squared form. We are led to an interindividual distance matrix of the form shown in Table 1.

Table 1 Unweighted squared genetic distances for a four-allele, codominant locus

Unequal character weights

It has been suggested (Epperson, 1995c) that the information available from rare alleles is greater than that from common alleles, and that it should thus be weighted differentially. We can weight inversely by frequency in such a way that the squared distances shown in Table 1 still obtain for equifrequent alleles, as a special case. We require a simple change in the distance metric (eqn 1) to incorporate allele-specific weights:

where the weights are inversely proportional to the allele frequencies (pk) and the total number of alleles, and are given by

with the subscript k = 1, …, K again indexing the genetic characters. A few examples from the four-allele case should suffice to show the pattern:

The weighting scheme gives the rare alleles more impact, on the premise that spatial proximity for two rare genotypes should carry more weight than proximity of two common genotypes.

The only difficulty with this weighting scheme is that the data used to assess spatial autocorrelation will probably also have to be used to establish the allele frequencies. We recommend substantial sample sizes, but even with large sample sizes, the precise frequencies of the rare alleles are not well established. The usual estimates of (1/pk) are biased upwards for rare alleles, and the bias increases as the true frequency decreases, so we will follow Smouse & Chakraborty (1986) and Xu et al. (1994) in recommending less biased estimates, in this case

where nk is the tally of the kth allele in the study population, and N is the total sample size. Unequal weighting of characters has no impact on the subsequent computations, as long as we use eqn (3). Whether weighting actually helps to detect autocorrelation is an empirical matter, to which we will return for the Caladenia illustration.

Multilocus treatment

Whether we have used weighted or unweighted distances, we can define an N × N genetic distance matrix, D, for a single locus, using the appropriate elements extracted from Table 1, or from the analogous treatment in eqn (5), a matrix that takes the form

To obtain a multilocus distance, we simply add across loci. For the ith and jth individuals,

where each locus is separately scored. We assume here that there are no missing genetic data for particular individuals, i.e. that the data set is complete. We could, with some sort of ‘missing value’ technique, adjust for holes in the data set, but that is beyond the scope of this effort. All later aspects of the analysis amount to manipulations of the elements of D, so the choice of metric is the aspect that really matters. Alternative distance measures can be envisaged and defined for these and other sorts of genetic markers, and we have developed alternatives for dominant/recessive RAPDs (Huff et al., 1993) and haplotypic markers, such as seen with mtDNA assay (Excoffier et al., 1992). We can even introduce some other weighting scheme. Suffice it, however, that as long as our chosen distance metric provides Euclidean closure, all the computations that follow are invariant with respect to that choice of metric.

The covariance matrix

The distance matrix D can be used to compute an inter-individual covariance matrix, C, which is what we will need for the autocorrelation analysis. The matrix C takes the form

where the diagonal terms measure the squared Euclidean distance from each individual genotype to the multivariate centroid of the genetic space,

where m = 1, …, M now indexes all the characters (alleles) of the multilocus set, and where the weights are either ½ or are given by eqn (4). Similarly, the inter-individual covariance terms provide a measure of the tendency of the ith and jth individuals to vary in the same (multidimensional) genetic direction from the centroid,

There is a convenient duality between genetic distance matrices and covariance matrices that makes our task simple. For the ith and jth individuals, Gower (1966) has shown that we can derive the squared distance, from eqns (10) and (11), via

In whatever way we choose to define our genetic characters, and whatever we choose as a weighting scheme, we can always convert the covariance matrix, C, into a corresponding genetic distance matrix, D, where the characters are defined in the same fashion. It is more convenient for us, however, to compute D directly, as described earlier, and then to effect a back-translation to C, using the ‘centring’ technique described by Gower (1966). Collapsing several algorithmic steps into one equation, we define cij as

The first summation is over all N elements of the ith row; the second summation is over all N elements of the jth column; the third summation is over all N2 elements of the matrix D, including the implicit diagonal zeroes. We have now converted our genetic characters, embedded in the multiallelic, multilocus distance matrix, D, into a corresponding genetic covariance matrix, C.

The X(h)-matrices

Corresponding to the covariance matrix, C, among individuals, we also need to define a set of corresponding N × N matrices for the spatial distances, X(h), between pairs of individuals separated by h steps (or ‘lags’, as they are sometimes called in geospatial literature),

where xij(h) = 1 for all pairs of individuals (i and j) that are h spatial distance classes apart (h lags apart), and xij(h) = 0 otherwise. The diagonal element is the number of nonzero pairs involving the ith individual at distance (lag) h. We have a separate matrix for each class (h) of spatial separation. The essence of spatial autocorrelation analysis is to compare the elements of C with those of X(h), for sets of paired observations that are h steps apart (h = 1, …, H).

To consolidate the idea of a ‘distance class’, consider the seven-individual hypothetical population in Fig. 2(a). For purposes of illustration, let us suppose that genetic propagule (seed or pollen) movement is animal-mediated, and that the animal vector cannot cross water. The wavy line in Fig. 2(a) represents the stream network, so genetic movements (in our hypothetical example) must accommodate the drainage. Although the “line-of-sight” distances between points A and F, B and F, E and F, and F and G are small, those direct connections are less relevant than the travel paths that would have to be taken by the animal vectors, all routed through positions C or D. Although the physical distances between the adjacent (immediately connected) points are not precisely equal, we arbitrarily denote them as all being members of the first distance class (h = 1). That is to say, all such pairs are said to be ‘one step (lag) apart’. The appropriate form for X(1) is shown in Fig. 1(b). Individual A is adjacent only to B, so x(1)AA=1, and x(1)AB=1=x(1)BA. On the other hand, individual B is adjacent to both A and C, so x(1)BB=2, whereas x(1)BA=1=x(1)AB and x(1)BC=1=x(1)CB. Because individual B is connected to both A and C at distance (h = 1), it is counted twice. We continue in this fashion for points C through G. For two-step connections (h = 2), A vs. C, B vs. D, B vs. F, C vs. E, F vs. E, D vs. G, the analogous treatment is shown in the matrix X(2) of Fig. 1(b). Individuals A and G are only involved in two-step distances once each, so the Ath and Gth diagonal elements are ‘1’, but all the other individuals are involved twice each in two-step connections, so their diagonal elements are each ‘2’. The three-and four-step connections are shown in X(3) and X(4), respectively. There is also a single five-step connection, that between A and G, so X(5) (not shown) has ‘1’s in all four corners and ‘0’s everywhere else. In this fashion, we can formally describe the connections between individuals separated by ‘h distance classes’, for h = 1, 2, 3, …, and so on.

Fig. 2
figure 2

Translation of spatial field sample into connection matrices for an example of a hypothetical seven-individual collection in the headwaters of a small drainage: (a) portrayal of available single-step connections, disallowing traverse of the stream system; (b) the connection matrices, X(h) for first (h = 1), second (h = 2), third (h = 3) and fourth (h = 4) distance classes. The matrix for h = 5 is not shown, but has ‘1’s in all four corners.

The autocorrelation coefficient

For all pairs of individuals that are ‘h steps apart’, we compute a correlation coefficient, according to the formula shown below. The two sets of genetic variables involved in a pairwise comparison of the ith and jth individuals are the same, however, with only their spatial location being different, so we describe the correlation as a spatial autocorrelation of individuals h steps apart (or ‘at lag h’).

where the numerator is the sum of all N (N – 1) off-diagonal ‘element-by-element’ products of C and X(h), and the denominator is the sum of all N diagonal ‘element-by-element’ products of those same two matrices. The coefficient r(h) is a proper correlation coefficient, with a mean of ‘0’ when there is no autocorrelation, and bounded by [–1, +1], closely related to Moran’s –I(h) coefficient, except that the x(h)ii terms for Moran’s coefficient are either ‘1’ or ‘0’, depending on whether the ith individual is (or is not) involved in pairs at distance class h, with no allowance for how many times the ith individual is paired with other individuals at that distance. Although the two coefficients have the same qualitative behaviour, we prefer r(h), because it is both completely general (multivariate) and a proper correlation. We compute an r(h)-value for each matrix X(h), and construct a ‘correlogram’ from the full series (h = 1, …, H ) of lags.

Sampling from the null distribution

Having computed an array of autocorrelations, one per distance class, we turn next to the question of how to test their departures from the null hypothesis that each is ‘0’. If there is no pattern of spatial genetic pattern within the site, then the null hypothesis should be that each of the correlations is drawn from a distribution of mean 0. If we represent the collection of H autocorrelations by a vector of length H, denoted R = [r(1), r(2), …, r(H)]T, then the null hypothesis is that R = 0. If we conclude from analysis of the data that the actual correlations at close distances are positive, we can also expect those at greater distances to be negative. The internal geometry of the covariance matrix C virtually guarantees it. We thus need to treat the vector R as a unit, representing all of the spatial inter-relationships in the data set.

To determine the null distribution for R, we provisionally accept the (null hypothesis) view that genetic affinity and spatial affinity are not related at all, and we randomize spatial positions of the N separate individuals. That amounts to permuting ID numbers (the rows and corresponding columns) in the matrix C, while holding those of the X(h)-matrices constant, effectively permuting whole genotypes amongst the spatial positions occupied. For the mth permutational shuffle, we compute a new array of estimated r(h)-values, packed into a vector of length H, denoted Rm. We permute C many times, say (M – 1) = 999, each time extracting the estimated vector of r-values, packed into (M – 1) sample Rm = {rm−5.5(h)}-vectors, all drawn from the same null distribution with no autocorrelation. On the premise that the genetic data were themselves drawn from that same null distribution, we use the estimated vector, RG = {rG−5.5(h)}, drawn from the actual positions of the individuals, as the Mth random realization.

To test the autocorrelation for a particular distance class, say the hth, we tally the M random estimates of r(h), extracting an estimated mean and standard deviation and generating an empirical confidence interval around that null-hypothesis mean. When testing for a positive autocorrelation (one-tailed test), we can simply tally the number of random r(h)-values that are at least as large as that actually seen, i.e. &α^=Tally(r(h)mr(h)G)/M. Because the actual value is also included in the random set, the tail probability will always be &α^≥M−1. If a two-tailed test is in order, because significant negative r(h)-values are also deemed to be relevant, then we compute

where r̄•(h) is the average of the randomized estimates, and where s2h is their estimated variance. Substituting individual r m ( h ) -values into eqn (16), we compute an empirical null distribution for t2h, m, against which to compare the actual value, t2h, G, with &αcirc;=Tally(t2h,mt2h , G)/M. Alternatively, we can compare rG( h ) with a confidence interval spanning a desired fraction (say 95% or 99%) of the random values, r( h )m. No distributional assumptions are required for this nonparametric test.

It is customary (Morrison, 1976), when testing large numbers of correlations (several distance classes and many separate analyses of single alleles or single loci), to employ a Bonferroni probability criterion, assuming that all the test criteria are independent. Because we have included all loci and all alleles in a single genetic affinity measure, there is no need for a separate test of each allele and each locus, and the multiple-character analysis (measuring overall genetic affinity) is what we want. Moreover, the separate tests of H different r(h)G-values cannot be independent, given the internal Euclidean closure of the distance matrix, D, and the consequent closure of the covariance matrix, C. For an overall assessment of spatial autocorrelation, we test the null hypothesis, R = 0, against the alternative hypothesis, R0. If spatial autocorrelation is positive at some distances, it must be negative at others; the correlations at different lags are not independent. We need a test that allows for both types of simultaneous departures from the null model. We prefer a multivariate analogue of the univariate t2-test in eqn (16).

From the collection of M null hypothesis Rm-vectors (including the actual vector, RG), we compute a null mean vector,

and a covariance matrix,

We invert this covariance matrix, and compute

which we compare with the distribution of similarly computed values from the random vectors to determine how often a random (null-hypothesis) autocorrelation vector deviates as far from the multivariate centroid as does the observed vector, with &α^=Tally(T2mT2G)/M. This multiclass test is patterned on the classic T2-test (Hotelling, 1951), but we rely on permutational testing, rather than parametric theory. The T2-criterion measures departures from the R = 0 hypothesis for all distance classes simultaneously, and thus provides a global test of spatial genetic clustering that allows for all of the interdependencies within the spatial and genetic data sets.

An illustration with Caladenia tentaculata

Background

In a detailed study, Peakall & Beattie (1996) have examined the ecological and genetic consequences of exclusive pollination by sexually attracted male thynnine wasps in the orchid Caladenia tentaculata (Schldl). Pollination in this species is typical of the many Australian orchids, exploiting the reproductive behaviour of thynnine wasps by production of pheromone-like fragrances and presentation of labellum structures that mimic the female. Pollination occurs when male wasps attempt copulation (pseudocopulation) with the labellum. After pollination, wasps immediately leave the patch, rather than visiting additional plants within the patch. As a consequence of this behaviour, pollen movements approximate a linear (rather than leptokurtotic) distribution, with a mean dispersal distance of 17 m (max. = 58 m). This is among the largest mean pollen dispersal distances known in a herbaceous plant (Peakall & Beattie, 1996).

Despite extensive pollen flow in C. tentaculata, an analysis of five allozyme loci revealed significant genetic clustering within a 20 × 40 m quadrat. The existence of spatial structure, in the presence of homogenizing pollen flow, may be a consequence of restricted seed dispersal instead. Although C. tentaculata’s minute seeds are wind-dispersed, most of the seed probably fall close to the parent, where the chances of seedling establishment may be enhanced by the presence of mycorrhizal fungi that are required for germination.

Here, we revisit the data set of Peakall & Beattie (1996), demonstrating various features of our new method of spatial autocorrelation analysis. First, we use the method to compare the spatial structure of single alleles, comparing the results for rare and polymorphic alleles, along with those of a multiallelic analysis for the same locus. We then compare multiallele analyses for the different loci, relating the results to the levels of polymorphism for single loci, and contrast the results with those of a full multilocus analysis. Finally, we examine the impact of differential weights for common and rare alleles, to see whether weighting clarifies the situation by emphasizing the detectable pattern from the rarer alleles.

Sampling and genetic analysis

This spatial autocorrelation analysis focused on a 20 × 40 m plot, embedded within a larger population of C. tentaculata, a plot orientated in such a way that a maximum number of orchids was contained within it. All flowering plants within the plot were sampled, and allozyme analysis of 16 enzyme systems was performed, yielding 22 putative loci. Of these 22 loci, only the five most polymorphic loci were used for the subsequent analysis: Mr, Pgm, Mdh-1, Got-3 and Pgi-2 (Peakall & Beattie, 1996). Table 2 shows the allele frequencies of the loci, listed in order of decreasing heterozygosity (He). Because our multilocus genetic distance is not currently programmed to deal with missing genotypes, the final data set used for this paper consists of only the 384 plants for which the genotypes were known for all loci. Thus, the allele frequencies reported here differ slightly from those published in Peakall & Beattie (1996). Fig. 3 shows the distribution of the 384 plants within the 20 × 40 m plot.

Table 2 Allele frequencies and expected heterozygosities (gene diversities) within the plot of Caladenia tentaculata for the five polymorphic allozyme loci
Fig. 3
figure 3

Distribution of 384 Caladenia tentaculata plants sampled within a 20 m × 40 m plot.

Microgeographical distance classes of 1 m were used in all analyses, so that the first distance class included all distances in the (0,1) interval, (1,2) interval, and so on. Tests of significance were computed for each distance class by comparing the observed value of rG(h) with those obtained from 999 spatial permutations of the 384 sampled plants. The observed data were added as the 1000th permutation, on the null premise that the actual data were genetically random with respect to spatial position. The 1000 sets of pseudocorrelations were then sorted, and a 95% confidence interval was constructed from the 975th value and 25th value, respectively. We assessed significance of a single correlation with the two-tailed t2h-criterion, as computed from eqn (16), and that of the correlogram as a whole with the two-tailed T2-criterion, eqn (19). We would remind the reader at this juncture that if we wished to test all of these separate th2 criteria, as though they were independent, a Bonferroni correction would be in order, but because our preference is for a single test of the total correlogram, via T2 (see below), we will ignore that nuance here.

Single-allele vs. multiallele treatment

Classical spatial autocorrelation analysis of genetic frequencies is executed one allele at a time, and our treatment is multiallelic. There is nothing to prevent us, however, from examining each allele separately, and we illustrate with the Mr locus. Fig. 4 shows the correlograms for each of the Mr alleles, for a comparison with the correlogram for a multiallelic treatment of the locus. Mr2 showed the strongest spatial structure, with r(1) = 0.175, and with r(h) remaining significantly positive out to 7 m. Mr1 shows the same trend, and Mr4 shows a similar (but weaker) trend. Interestingly, no significant spatial structure was detected for Mr3, although it occurred at higher frequencies than Mr4 (Table 2). In general, we are interested in the demographic (propagule flow) processes causing autocorrelation. For those purposes, individual alleles are best viewed as replication, and their differences are best viewed as the stochastic consequences of random genetic sampling of alleles, scattered across a spatial landscape. The multiallele treatment shows the essential pattern of spatial genetic affinity, allows for the correlations among alleles, extracts all the information, and (by virtue of averaging across these discrete genetic variables) ‘smoothes out the bumps’ in the correlogram. The variation among alleles (the ‘noise’) is reduced and the pattern of genetic affinity (the ‘signal’) is clear.

Fig. 4
figure 4

Unweighted correlograms (solid lines) for each of the Mr alleles and the corresponding multiallelic (single-locus) correlogram, with 95% null hypothesis confidence regions indicated by dotted lines; probabilities of total correlogram test criterion t2 (for individual alleles) or T2 (for the full locus) are shown in the top right corner.

Single-locus vs. multilocus analysis

We show a series of separate, unweighted correlograms for each locus and one for the multilocus treatment in Fig. 5. All five loci showed positive values of r(h) in the first three to seven distance classes, and (with the exception of Pgi-2) significant spatial structure was apparent for the shortest distance classes, as shown by the values of the r(h)-values that exceeded the upper 97.5% confidence limits. The most striking result was that for Mr, with r(h) remaining significantly positive out to 7 m (i.e. seven steps, or seven ‘lags’). Oscillation of the correlogram between positive and negative values is apparent for Mr, beyond 7 m, and to a lesser extent for Got-3 as well, consistent with a pattern of strong microspatial structure. The multilocus correlogram shows a similar pattern to that for Mr, but with a somewhat smoother curve. The point at which the curve first crosses the x-axis provides an estimate of patch size (Sokal & Wartenberg, 1983), which is between 2 and 7 m. The overall departure from the null hypothesis of no spatial genetic clustering is resoundingly rejected for the multilocus treatment.

Fig. 5
figure 5

Unweighted correlograms (solid lines) for each locus separately, and the corresponding multilocus correlogram, with 95% null hypothesis confidence regions indicated by dotted lines; probabilities of total correlogram T2 are shown in the top right corner.

With the exception of Got-3, the strength of autocorrelation decreased with decreasing heterozygosity (cf. Table 2), ranging from high in Mr (He = 0.68) to low and not statistically different from zero in Pgi-2 (He = 0.10), showing that the strength of the spatial signal increases with the level of polymorphism, rather than the number of alleles per se. Although Pgi-2 exhibited four alleles, the most common had a frequency close to 0.95, our arbitrary criterion for exclusion of a locus from analytical consideration (Table 2). None of the individual alleles showed any pattern for this locus (not shown). There is little to be gained by attempting to detect spatial pattern for a locus that shows almost no variation!

Weighted vs. unweighted analysis

One possible reason for the absence of autocorrelation for essentially invariant loci is that there are few useful comparisons in the data set, so that even if strong spatial structure were to exist, it would be difficult to detect. One possible solution to this problem is to weight the pairwise genetic distances inversely by allele frequency, as described in the Methods section. We evaluated this option by using the codominant weighting scheme in eqn (5). The analytical results of inverse frequency weighting are shown in Fig. 6. Empirically, weighting had little effect on the outcome, except for Mdh-1 (compare Mdh-1 in Fig. 5 and 6). For Mdh-1, the weighted analysis indicated stronger autocorrelation than was evident from the unweighted analysis, with r(1) substantially above the 97.5% confidence interval and r(h) remaining high and mostly significant out to 12 m. The weighted results for Mdh-1 reinforced the finding of strong and significant spatial autocorrelation for the other loci. Nevertheless, the overall interpretation from the weighted analysis is identical to that from the unweighted analysis. With the multilocus approach, we have the advantages of considerable averaging. Weighting creates no particular problems, but it does not seem to offer much additional resolution.

Fig. 6
figure 6

Weighted correlograms (solid lines) for each locus separately, and the corresponding multilocus correlogram, with 95% null hypothesis confidence regions indicated by dotted lines; probabilities of total correlogram T2 are shown in the top right corner.

Extensions and generalizations

Other types of genetic markers

Although our genetic distance metric in Table 1, and its weighted analogue, eqn (5), are designed specifically for codominant diploid allozyme or SSR loci, there is nothing that restricts us to such markers. With a proper distance metric, we can examine any type of genetic marker available. Elsewhere, we (Huff et al., 1993) have shown how to compute genetic distances for dominant/recessive markers, such as are generated by RAPD (Welsh & McClelland, 1990; Williams et al., 1990) and AFLP (Vos et al., 1995) methods. Both RAPDs and AFLPs can be used in any species, without DNA sequence knowledge. By virtue of the essentially unlimited number of primer combinations available, both RAPDs (Peakall et al., 1995) and AFLPs represent more variable assay batteries than allozymes. With band presence scored as ‘1’ and its absence as ‘0’, the genetic distance between two individuals is a simple tally of the ‘1’ vs. ‘0’ (different bands) tally for those individuals. Unequal weighting for the different loci is easy to add if desired. Given a proper distance metric, of course, everything else is the same.

Relationship to competing methodological alternatives

The distance-based autocorrelation methods described here are related to innovations by earlier workers, and a few words are in order to put this current effort into broader perspective. Sokal et al. (1986) adapted a distance-based version of autocorrelation analysis. More recently, Sokal (see Burgman & Williams, 1995) has formalized that treatment and labelled the outcome a ‘Mantel Correlogram’, which is basically a comparison of the genetic distance matrix, D, with a matrix of counter-variables, patterned after X(h). The method is very similar to that described here, based on a correlational coefficient, but uses D and X, rather than C and X.

Bertorelli & Barbujani (1995) have introduced distance-based versions of both Moran’s I- and Geary’s c-coefficients, labelled AIDA (autocorrelation indices for DNA analysis) coefficients, for the case of linked haplotypes, using either the phenetic or phyletic metrics that we introduced elsewhere (Excoffier et al., 1992). Barring a difference in denominator, their treatment is a binary haploid analogue of the multistate diploid treatment introduced here.

The patterns of spatial structure in C. tentaculata revealed by our new method are qualitatively similar to those in Peakall & Beattie (1996), whose multilocus treatment followed a technique developed by J. Nason (see Loiselle et al., 1995), based on an estimate of Wright’s relatedness coefficient ρ. The estimated value, denoted rij, but defined differently from our eqn (15), is a frequency-weighted average over all alleles and loci. The difference here is that we have allowed for the covariances between alleles, which follow from the fact that the allele frequencies at any one locus must sum to unity. Epperson (1995c) has developed a ‘joint count’ distance, a frequency-weighted measure that reduces each locus to 2-allele form. Each of these methods represents a laudable attempt to allow for inherent information differences that arise with alleles of different frequencies. Our treatment, embodied in eqns (3)–(5), is a more general codominant weighting scheme. Analogues for dominant/recessive loci, as well as for haplotypic marker sets, are easily constructed and fairly obvious. As mentioned earlier, given the choice of distance metric, everything else stays the same.

Beyond technical particulars, we have introduced a generic multivariate method for spatial autocorrelation analysis of genetic data on individuals, requiring only a proper distance metric (and perhaps a weighting scheme). This adds a method for spatial analysis to the battery of tests that can be accessed routinely from the interindividual genetic distance matrix, joining such standard tools as AMOVA (Excoffier et al., 1992; Huff et al., 1993; Peakall et al., 1995) and Matrix Correlation Analysis (Mantel, 1967; Sneath & Sokal, 1973; Smouse et al., 1986; Smouse & Long, 1992). The primary value of generic methods is that they can be used on different sorts of data, a feature that will increase in importance as new genetic methodologies are added to our assay battery. With a truly multivariate approach, we have the additional advantages of averaging over stochastically varying systems. For most purposes, we do not really need a separate test for every allele, or even a separate test for each locus. What we need is one test that addresses the question: ‘Are genotypes in close physical proximity any more similar than those with greater physical separation?’. A multivariate approach provides that, both in the form of the vector RG, and its associated multivariate test criterion, T2G.