Introduction

The limited power of linkage to detect and localize genes of minor or modest effect has led to the widely accepted view that association is the primary tool of gene mapping in humans.1 Unlike linkage, which extends over long genetic distances, allelic associations to a disease are usually restricted to the susceptibility locus itself and very tightly-linked polymorphisms.2 Consequently, the screening of even a megabase of DNA may require 50–100 markers. Although single nucleotide polymorphisms (SNPs) occur at sufficient density in the genome,3 the need to genotype hundreds of individuals for thousands of markers remains prohibitively expensive. One way of considerably reducing cost is to use DNA pooling, whereby DNA samples from multiple individuals are pooled before genotyping. This technique is ideal for screening a large number of markers for associations, although positive results will require confirmation using individual genotype data.4,5,6,7,8,9,10

For a categorically defined disease, DNA pooling is necessarily restricted to a simple case-control design, in which allele frequencies are compared across a pool of DNA from cases and a pool of DNA from controls. The appropriate method of analysis and the power of this simple design have been examined.11 A greater variety of pooling designs is possible for quantitative traits. Bader et al.12 considered symmetric designs under a classical biometrical genetic model and showed that the optimal pooling strategy is to define pools by the top 27% and the bottom 27% of the trait distribution. However, they did not consider asymmetric designs or more importantly the impact of different sources of experimental errors.

Technical aspects of DNA pooling dictate that errors in allele frequency estimation will arise. For example, DNA quantification, choice of electrophoresis method, ‘plus-A’ stutter, and sensitivity (the minimum reliable detectable difference between pools) are all factors that contribute to discrepancies in allele frequency estimation. This error can be reduced to <5%,13,14 and in the absence of experimental bias maybe to as little as 1%.

We have examined the sensitivity of optimal pooling designs for quantitative traits to variations in genetic model parameters and to experimental noise. After confirming the result of Bader et al.12 that a symmetric pooling design with a pooling fraction of 27% in each tail is optimal for a common additive gene, we show the potentially serious loss of power of this design for rare or recessive alleles. We also consider two sources of experimental noise and show that a high level of experimental accuracy is essential for the success of the pooling strategy, and that the impact of experimental noise on optimal design is to lower to pooling fraction. Finally, we provide practical guidelines for optimal sample selection in DNA pooling studies.

Method and results

Genetic model

We assume a diallelic quantitative trait locus (QTL) with alleles A1 and A2, occurring at frequencies p and q, respectively. We denote the mean trait effects of the genotypes A1A1, A1A2, and A2A2 by a, d, and −a, and their frequencies by P(G)=p2, 2pq, and q2, respectively. The mean effect in the population is therefore m=a(p−q)+2dpq. The dominance ratio (d/a) is denoted as c, while the proportion of trait variance accounted for by the QTL is represented by σQ2. Under Hardy-Weinberg equilibrium, σQ2=2pq[a-d(p-q)]2+[2pqd]2=σA2+σD2. The distribution of trait scores (X) for each genotype, G, is assumed to be normal with mean μG equal to a−m, d−m, and −a−m for genotypes A1A1, A1A2, and A2A2, respectively, and variance σR2=1−σQ2 within each genotype. The trait distribution in the population is thus a mixture of three normal distributions with overall mean 0 and variance 1.

Test statistic

We assume that trait values are available for all individuals in a random sample of the population. To test the null hypothesis of no linkage disequilibrium between marker locus and QTL we compare the allele frequencies of A1 in the lower and upper pools. The test statistic for a two-pool design is

where and are the estimated frequencies of allele A1 in the lower and upper pools respectively. The variance of is σ2 = VS + VU + VM, where VS represents sampling variation, VU represents variation in the quantity of DNA contributed by the individuals, and VM represents variation in the measurement of allele frequency. The sampling variance is

and nL and nU are the numbers of individuals in the lower and upper pools, respectively.

The accuracy of allele frequency estimation using pooled DNA depends on each individual making an equal contribution of DNA to the pool. However, the process of obtaining equal concentrations of DNA for the individual samples, and then pipetting out equal volumes of the solutions to make up the pool, is subject to experimental error. The variance due to unequal DNA contributions is shown in Appendix A to be

where Ï„ is the coefficient of variation (i.e. standard deviation over mean) of the number of DNA molecules of locus A contributed by each individual.

The frequency of an allele in a pool of DNA is a quantitative measure that is also subject to measurement error. We assume that the effect of measurement error is to increase the variance of the allele frequency estimate of each pool by a constant quantity, denoted ε2. Thus the contribution of measurement error to the variance in allele frequency difference between the two independent pools is therefore simply VM=2ε2.

Under the null hypothesis, Z2 has a χ2 distribution with one degree of freedom. To compare the efficiencies of different designs, we evaluate the non-centrality parameter (NCP) of Z2 in the presence of a QTL.

From a random sample of N individuals from the population, individuals with trait values below a threshold TL are selected for the low pool, while those with trait values above another threshold TU are selected for the high pool. The expected numbers of individuals in the upper and lower pools are given, respectively, by

where Φ is the standard normal distribution function. The expected allele frequencies in the two pools are then

and

The NCP of the test statistic is then

Optimal asymmetric and symmetric designs

For any set of model parameters, this NCP can be maximised over the thresholds TL and TU, and therefore the pool sizes nL and nU. Because there are only two variables, the optimisation can be therefore achieved simply by a grid search. Thus, if the true genetic model is known, the optimum selection strategy is to select individuals for the upper and lower pools using the thresholds calculated under the asymmetric pooling scheme for the particular model. In addition, we also maximised this NCP subject to the constraint that nL=nU. These symmetric designs are particularly relevant when there is no knowledge regarding allele frequency or dominance, so that there is no reason to treat the two tails differently.

We calculated the NCP for the optimal asymmetric and symmetric designs for a test of σA>0 under 8 different sets of model parameters, encompassing different levels of QTL heritability, allele frequency and dominance, assuming the absence of experimental errors (Table 1). As expected, with equal allele frequency (p=0.5) and no dominance (c=0), the optimal design is symmetric.12,15 The optimal design is asymmetric when allele frequencies are not equal or when there is dominance, and the degree of asymmetry and the ratio of the NCP for asymmetric and symmetric designs both depend strongly on the magnitude of a/σR (R2=0.98 for regression of the natural log of the NCP ratio on a/σR, P-value =4×10−19). When a/σR<0.5, the symmetric and asymmetric designs provide equal information. As a/σR increases, the QTL has a major gene effect and a multimodal phenotypic distribution arises. The asymmetric design essentially selects the individuals who become separated from the main phenotypic distribution.

Table 1 Optimum thresholds, pool sizes, and frequency of allele A1 in pools, for individual models over range of heritabilities, under Symmetric and Aymmetric Pooling schemes, sample size N=1000.

Although asymmetric pooling is potentially more informative than symmetric pooling, our usual lack of knowledge on allele frequency and dominance means that we would normally adopt a symmetric design. A symmetric design would also be appropriate for a more general test of σQ2▪equals;0.

Figure 1 shows the expected percentage of total information (obtained by individual genotyping) retained by symmetric designs with different pooling fractions, for the eight models. Here, the NCP for individual genotyping is simply the QTL heritability, σQ.2,12,16 The optimal pooling fraction is about 27% for all models with the exception of rare or recessive alleles; the information retained approaches 80% for common additive alleles but is particularly poor for rare recessive alleles.

Figure 1
figure 1

Symmetric pooling scheme compared with individual genotyping for h2=0.01. Proportion of information is the ratio of the test statistic under symmetric pooling scheme to the test statistic under individual genotyping.

Experimental error

Next, we investigated the impact of experimental error, focusing only on symmetric pooling designs, employing an equal allele frequency, additive model with 1% QTL heritability. The impact of increasing level of measurement error is to reduce the information retained and to decrease the optimal pooling fraction (Figure 2). Random variation in the amount of DNA contributed by individual subjects reduces the information retained, but does not affect the optimal pooling fraction (Figure 3). In absolute terms, even relatively small values of ε (>0.01) or τ (>0.2) can have a large impact on the information retained by pooling.

Figure 2
figure 2

Effect of error from measurement on symmetric pooling scheme assuming various values for standard error of allele frequency measurement error (ε), p=0.5, h2=0.01, c=0, N=1000.

Figure 3
figure 3

Effect of error due to unequal DNA contribution from individuals on symmetric pooling scheme, assuming various values for the coefficient of variation (Ï„) of the number of DNA molecules of locus A, p=0.5, h2=.01, c=0.

For small and additive QTL effects, the NCP in the presence of experimental noise is shown in Appendix B to be

where the term κ2 represents the ratio of the measurement error to other sources of error,

In the absence of experimental noise this reduces to the formula derived by Bader et al.,12 which implies an optimal pooling fraction of 27%. It can be seen from these formulas that the optimal fraction remains at 27% regardless of τ when ε is zero, but having ε>0 reduces the optimal pooling fraction. The reason for this difference is that increasing the pool size does not reduce measurement error. Analytical estimates for the optimal pooling fraction, derived in Appendix C, are

These analytical estimates are shown in Figure 4 to be quite accurate when compared to the numerical results.

Figure 4
figure 4

Optimal pooling fraction for a symmetric scheme assuming various values for standard error of allele frequency measurement error (ε), p=0.5, h2=0.01, c=0, N=1000, from numerical calculations (solid line) and analytical approximations (dashed line).

Discussion

We have illustrated, that in the absence of experimental error, a symmetric pooling sampling scheme, whereby the top and bottom 27% are separately pooled and genotyped, results in a pooling association study that is optimally powerful across a wide range of possible genetic models underlying the trait. The information retained relative to individual genotyping approaches 80% for common additive alleles but is particularly poor for rare recessive alleles.

Our results on the impact of experimental error emphasize the importance of accuracy in both the constitution of the pools and the measurement of allele frequencies. We have shown that random variation in the amount of DNA contributed by individuals to a pool reduces the efficiency of the pooling scheme and that allele frequency measurement errors reduce the optimal pooling fraction as well as the overall efficiency of the scheme. Thus, providing errors from allele frequency estimation can be minimised to within 1% in standard deviation, we recommend symmetric pooling fractions of around 20% as opposed to the 27% that would be optimal in the absence of experimental errors. It may be preferable to replicate the pooling to reduce the DNA concentration variance or, more importantly, to repeat the allele frequency measurement to reduce the effective experimental measurement error.

In our calculations we have assumed that the true values of ε and τ are known, so that they can be specified correctly in the calculation of the test statistic. Clearly, under-specification of ε and τ will lead to liberal P values, whereas over-specification of ε and τ will lead to a conservative test. In practice, values of ε and τ may be estimated from laboratory experiments prior to the actual association study, or inferred from the distribution of values of the test statistics, in a way similar to the use of genomic control for population stratification.17,18,19,20

While the results we have obtained suppose a single-locus test, biological systems may exhibit multi-locus epistatic effects or gene-environment interactions. The primary effect in the context of the single-locus pooled tests described here is to re-scale the additive variance. For example, suppose that a fraction s of the population is exposed to a sensitising factor that enhances the genotypic effects by a factor λ, from a, d, and −a to λa, λd, and −λa for the genotypes A1A1, A1A2, and A2A2 respectively, and additionally that the sensitising factor is correlated with allele A1 with correlation constant r. The re-scaled value of σA, the correlation between the fraction of allele A1 in a genotype and the phenotypic shift, is . This analysis supposes that the sensitising factor has no independent effect, which is appropriate if phenotypic variables have been conditioned on the applicable covariates. Although interaction terms do not interfere with single-locus pooled tests, estimating the size of the interaction terms would require individual genotyping.

Finally, family-based tests provide additional means to control for environmental effects, and the optimised tests for unrelated populations described here may be extended to family-based studies.