Abstract
Availability of a large collection of single nucleotide polymorphisms (SNPs) and efficient genotyping methods enable the extension of linkage and association studies for complex diseases from small genomic regions to the whole genome. Establishing global significance for linkage or association requires small P-values of the test. The original TDT statistic compares the difference in linear functions of the number of transmitted and nontransmitted alleles or haplotypes. In this report, we introduce a novel TDT statistic, which uses Shannon entropy as a nonlinear transformation of the frequencies of the transmitted or nontransmitted alleles (or haplotypes), to amplify the difference in the number of transmitted and nontransmitted alleles or haplotypes in order to increase statistical power with large number of marker loci. The null distribution of the entropy-based TDT statistic and the type I error rates in both homogeneous and admixture populations are validated using a series of simulation studies. By analytical methods, we show that the power of the entropy-based TDT statistic is higher than the original TDT, and this difference increases with the number of marker loci. Finally, the new entropy-based TDT statistic is applied to two real data sets to test the association of the RET gene with Hirschsprung disease and the Fcγ receptor genes with systemic lupus erythematosus. Results show that the entropy-based TDT statistic can reach p-values that are small enough to establish genome-wide linkage or association analyses.
Similar content being viewed by others
References
Borrego S, Ruiz A, Saez ME, Gimm O, Gao X, Lopez-Alonso M, Hernandez A, Wright FA, Antinolo G, Eng C (2000) RET genotypes comprising specific haplotypes of polymorphic variants predispose to isolated Hirschsprung disease. J Med Genet 37:572–578
Bourgain C, Genin E, Margaritte-Jeannin P, Clerget-Darpoux F (2001) Maximum identity length contrast: a powerful method for susceptibility gene detection in isolated populations. Genet Epidemiol 21(Suppl 1):S560–S564
Clayton D, Jones H (1999) Transmission/disequilibrium tests for extended marker haplotypes. Am J Hum Genet 65:1161–1169
Edberg JC, Langefeld CD, Wu J, Moser KL, Kaufman KM, Kelly J, Bansal V, Brown WM, Salmon JE, Rich SS, Harley JB, Kimberly RP (2002) Genetic linkage and association of Fcgamma receptor IIIA (CD16A) on chromosome 1q23 with human systemic lupus erythematosus. Arthritis Rheum 46:2132–2140
Ewens WJ, Spielman RS (1995) The transmission/disequilibrium test: history, subdivision, and admixture. Am J Hum Genet 57:455–464
Freimer N, Sabatti C (2004) The use of pedigree, sib-pair and association studies of common diseases for genetic mapping and epidemiology. Nat Genet 36:1045–1051
Graybill FA (1976) Theory and application of the linear model. Duxbury Press, North Scituate
Hampe J, Schreiber S, Krawczak M (2003) Entropy-based SNP selection for genetic association studies. Hum Genet 114:36–43
Lehmann EL (1983) Theory of point estimation. Wiley, New York
Nothnagel M (2002) Simulation of LD block-structured SNP haplotype data and its use for the analysis of case-control data by supervised learning methods. Am J Hum Genet 71(Suppl 4): A2363
Rabinowitz D, Laird N (2000) A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered 50:211–223
Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516–1517
Schaid DJ (1996) General score tests for associations of genetic markers with disease using cases and their parents. Genet Epidemiol 13:423–449
Sham PC (1997) Transmission/disequilibrium tests for multiallelic loci. Am J Hum Genet 61:774–778
Sham PC, Curtis D (1995a) An extended transmission/disequilibrium test (TDT) for multi-allele marker loci. Ann Hum Genet 59:323–336
Sham PC, Curtis D (1995b) An extended transmission/disequilibrium test (TDT) for multi-allele marker loci. Ann Hum Genet 59(Pt 3):323–336
Shannon CE (1948) A mathematical theory of communication. Bell Systems Tech J 27:379–423
Spielman RS, Ewens WJ (1996) The TDT and other family-based tests for linkage disequilibrium and association. Am J Hum Genet 59:983–989
Spielman RS, McGinnis RE, Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 52:506–516
Wilson SR (1997) On extending the transmission/disequilibrium test (TDT). Ann Hum Genet 61(Pt 2):151–161
Zhang S, Sha Q, Chen HS, Dong J, Jiang R (2003) Transmission/disequilibrium test based on haplotype sharing for tightly linked markers. Am J Hum Genet 73:566–579
Zhao H, Zhang S, Merikangas KR, Trixler M, Wildenauer DB, Sun F, Kidd KK (2000) Transmission/disequilibrium tests using multiple tightly linked markers. Am J Hum Genet 67:936–946
Zhao J, Boerwinkle E, Xiong M (2005) An entropy-based statistic for genomewide association studies. Am J Hum Genet 77:27–40
Acknowledgments
M. M. Xiong is supported by NIH-NIAMS grant IP50AR44888, HL74735, and NIH grant ES09912. J. Y. Zhao is supported by NIH grant ES09912. E. Boerwinkle is supported by grant from the National Heart, Lung and Blood Institute and the National Institute of General Medical Science.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1
Let \({X = - \hat{p}\log \hat{p}}\) and \({Y = - \hat{q}\log \hat{q}}.\) Both X and Y are nonlinear functions of the allele frequencies. The distribution of (X − Y) is asymptotically normal with mean zero and variance Var(X − Y), where
Under the null hypothesis of no linkage or no association, the frequency of the transmitted allele M 1 is equal to the frequency of the transmitted allele M 2, thus, we have X(p) = Y(q). Therefore, the distribution of (X−Y) is normal distribution with mean zero and variance \({\frac{{pq}}{n}(2 + \log p + \log q)^{2}}.\) Under the null hypothesis, \({\hbox{TDT}_{e} = \frac{{(X - Y)^{2}}}{{{\rm Var}(X - Y)}} = \frac{{n(\hat{p}\log \hat{p} - \hat{q}\log \hat{q})^{2}}}{{\hat{p}\hat{q}(2 + \log \hat{p} + \log \hat{q})^{2}}}}\) is asymptotically distributed as a central χ2 (1) distribution.
Appendix 2
First, we calculate \({\hbox{var} (\hat{p}_{{i \cdot}})}.\) By definition, we have
where p ii = 0.
Similarly, we have
Next we calculate covariance between \({\hat{p}_{{i \cdot}}}\) and \({\hat{p}_{{j \cdot}}(i \ne j}).\) Again, by definition, we obtain
Similarly, we have
Now we calculate \({\hbox{cov} (\hat{p}_{{i \cdot}}, \hat{q}_{{j.}}).}\) First, we consider i ≠ j. In these cases, we have
Similarly, we have \({\hbox{cov} (\hat{q}_{{i.}}, \hat{p}_{{j \cdot}}) = \frac{1}{n}(p_{{ji}} - q_{{i.}} p_{{j \cdot}})}\) when i ≠ j.
Then, we consider i = j. For i = j, we obtain
Thus, we have proven Eq. 6.
Let \({h(p) = [h(p_{{1 \cdot}}), \ldots, h(p_{{k \cdot}})]^{T}}\) and \({h(q) = [h(q_{{1 \cdot}}), \ldots, h(q_{{k \cdot}})]^{\rm {T}}},\) where \({h(p_{{i \cdot}}) = - p_{{i \cdot}} \log p_{{i \cdot}}}\) and \({h(q_{{i.}}) = - q_{{i.}} \log q_{{i.}}}.\) Then \({h(\hat{p}) - h(p)}\) and \({h(\hat{q}) - h(q)}\) are asymptotically distributed as normal distribution with mean zero and variance \({\frac{1}{n}B\Sigma_{p} B^{\rm {T}}}\) and \({\frac{1}{n}C\Sigma_{q} C^{\rm {T}}},\) respectively, where \({B = (b_{{ij}})_{{k \times k}}}\) and \({C = (c_{{ij}})_{{k \times k}}, b_{{ii}} = \frac{{\partial h(p_{{i \cdot}})}}{{\partial p_{{i \cdot}}}} = - 1 - \log p_{{i \cdot}}, b_{{ij}} = \frac{{\partial h(p_{{i \cdot}})}}{{\partial p_{{j \cdot}}}} = 0\begin{array}{*{20}c} & {{(j \ne i)}}, \\ \end{array} c_{{ii}} = \frac{{\partial h(q_{{i.}})}}{{\partial q_{{i.}}}} = - 1 - \log q_{{i.}}}\) and \({c_{{ij}} = \frac{{\partial h(q_{{i.}})}}{{\partial q_{{j.}}}} = 0\begin{array}{*{20}c} & {{(j \ne i)}} \\ \end{array}}\) Under the null hypothesis of no linkage or no association, we have h(p) = h(q), thus \({h(\hat{p}) - h(\hat{q}) = h(\hat{p}) - h(p) - [h(\hat{q}) - h(q)]}\) is asymptotically distributed as normal distribution:
where
Applying Theorem 4.4.3 (Graybill 1976), we obtain \({n[h(\hat{p}) - h(\hat{q})]^{T} \Lambda^{-}_{e} [h(\hat{p}) - h(\hat{q})]}\) is asymptotically distributed as a central χ2 (r) distribution under the null hypothesis of no linkage or no association, where r = rank(Λ e ), and Λ e is the estimator of matrix Λ by substituting the estimators of p i· and p ·i into Eq. 11.
Appendix 3
Following the approach of Sham and Curtis, we can obtain the joint probability that a heterozygous parent with genotype M i M j transmits the allele M i to an affected child. Let TM denote the transmitted marker allele, NM denote the nontransmitted marker allele, TD denote the transmitted disease allele, OTD denote disease allele transmitted by another parent and TH denote the transmitted haplotype. Let P i be the frequency of the allele M i at the marker locus, \(P_{D_{k}}\) be the frequency of the disease allele D k , P ki be the frequency of the haplotype \(H_{D_{k}M_{i}}\) and θ be the recombination fraction between the marker and disease loci. Define the measure of LD between the marker and disease loci as
Then, the probability that the haplotype \({H_{{{\rm D}_{k} M_{i}}}}\) is transmitted and the marker allele M j is not transmitted is given by
The joint probability that a heterozygous parent with genotype M i M j transmits the allele M i to an affected child is given by
Using Eq. 12, we have
where notations are given in the text.
Summarizing over all j in the above equation, we obtain the probability of transmitting the marker allele M i to an affected child:
Similarly, we have
Note that
Thus, the measure δ1j of the LD between the disease allele D and the marker allele M j should satisfy the following constraints:
The measure δ1j of the LD between the disease allele and the marker allele M j can be calculated by
where t is the time since the generations of the LD between the marker and disease loci, and δ1j (0) is the measure of the initial LD when the LD was created. The initial measure δ1j (0) of the LD should satisfy the constraints Eq. 13.
Now we study how to calculate the frequencies of the transmitted and nontransmitted haplotypes. Recall that TH and NH denote the transmitted and nontransmitted haplotypes, respectively. The transmitted three-locus haplotype will experience a non-recombinant, a single recombinant and a double recombinant event (Wilson 1997). Thus, we have
Let \({b = \frac{{(f_{{11}} - f_{{12}})P_{{\rm D}} + (f_{{12}} - f_{{22}})P_{{\rm d}}}}{{P(A)}}}\) and \({a = \frac{{f_{{12}} P_{{\rm D}} + f_{{22}} P_{{\rm d}}}}{{P(A)}}},\) Then, after some algebra on Eq. 15, we can obtain
Thus, the probability that the haplotype \(H_{i_{1}i_{2}}\) is transmitted to an affected child is given by
(Note bP D + a = 1)
Similarly, we have
Using the following relationship between the haplotype and the measure of LD:
We obtain
Measures of LD are random variables. The expectation of \(\delta_{i_{1}si_{2}}\) is equal to
But, it was shown that \({E[P_{{i_{1} si_{2}}}] = \delta_{{i_{1} si_{2}}} (0)(1 - \theta_{1})^{t} (1 - \theta_{2})^{t} + P_{{i_{1}}} E[\delta_{{si_{2}}}] + P_{{i_{2}}} E[\delta_{{i_{1} s}}] + P_{{i_{1}}} P_{{{\rm D}_{s}}} P_{{i_{2}}}},\) where \(\delta_{i_{1}si_{2}} (0)\) is the measure of the initial LD at three loci \(M_{i_{1}}D_{s}M_{i_{2}}.\) Thus, we have
Substituting \({E\left[\delta_{i_{1}s}\right]},\;{E\left[\delta_{si_{2}}\right]}\) in Eq. 14 and \({E\left[\delta_{i_{1}si_{2}}\right]}\) in Eq. 20 into Eq. 19, we obtain
Let \({P_{{i_{1} i_{2} \cdot}} = P(\hbox{TH} = H_{{i_{1} i_{2}}} |\hbox{Affected})}\) and \({P_{{\cdot i_{1} i_{2}}} = P(\hbox{NH} = H_{{i_{1} i_{2}}} |\hbox{Affected}),}\) we obtain equations:
Based on the above equations, we can calculate
The matrices Σ p , Σ q , B, C, and Λ can be similarly defined. Then, for haplotypes produced by two SNPs marker loci flanking a disease locus, substituting μ T , μ NT and other parameters, Σ p , Σ q , B, C, and Λ into Eq. 9, we obtain the noncentrality parameter λ HE . Using these analytic formulas for computing the noncentrality parameters of the distribution of the test statistics, we can calculate the power of the test statistics under specified alternative hypothesis.
Rights and permissions
About this article
Cite this article
Zhao, J., Boerwinkle, E. & Xiong, M. An entropy-based genome-wide transmission/disequilibrium test. Hum Genet 121, 357–367 (2007). https://doi.org/10.1007/s00439-007-0322-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00439-007-0322-6