Methods for Handling Multiple Testing
Introduction
The problem of multiple testing in genetics studies arises when many different loci are tested for association with a particular trait or disease, as is common in genome‐wide linkage and whole‐genome association (WGA) scans as well as in gene expression microarray studies. The most common method for dealing with multiple testing problems involves adjusting the significance level (α) of each test to accommodate the total number of tests performed, although the need for this type of adjustment has been questioned (e.g., Rothman, 1990). Historically, the Bonferroni correction has been used to adjust significance levels (Bonferroni 1935, Bonferroni 1936). The Bonferroni correction provides a simple formula for computing the required pointwise α‐levels (for each test made) based on a global or experiment‐wise error rate or α‐level (say 0.05). This method works well when there are only a few independent tests being performed. For example, in an experiment involving 10–20 independent tests, one would expect not more than one result to be significant at the p < 0.05 level due to chance alone. In fact, the Bonferroni correction formed the basis for the often‐cited Lander and Kruglyak (1995) lod score thresholds for mapping complex trait loci in linkage mapping contexts of 1.9 (α = 1.7 × 10−3) for suggestive linkage and 3.3 (α = 4.9 × 10−5) for significant linkage. These Bonferroni‐based thresholds for declaring significance are based on the assumption of a dense map containing an infinite number of markers with infinitely small intermarker distances. For more realistic genome‐wide linkage scans involving about 300 markers, one would expect only about 15 false positives at a pointwise α‐level of 0.05, or a Bonferroni corrected α‐level of 1.7 × 10−4 corresponding to a lod score of about 2.75.
Even with the use of the more realistic α‐levels in linkage mapping studies, significant results involving complex, multifactorial traits have been few and far between (Altmuller et al., 2001). The problems inherent to multiple testing have been compounded with the explosion of available DNA markers and high‐throughput genotyping technologies for analysis, as researchers begin to undertake WGA studies that involve hundreds of thousands of tests, for example prostate cancer (Gudmundsson et al., 2007) and type 2 diabetes (Sladek et al., 2007). Multiple testing problems that are this pronounced have been referred to in the data mining community as the “short‐fat data” or “large p, small n” problem, and currently there is no agreed upon resolution. For example, if the Bonferroni was applied to a single nucleotide polymorphism (SNP)‐based WGA scan involving a half‐million tests, one would expect to see a whopping 25,000 false positives if a simple p < 0.05 criteria for declaring significance for each SNP was applied; alternatively, one would have to adopt a pointwise α‐level as small as 1.0 × 10−7 if a simple Bonferroni correction was applied to each SNP for declaring significance in order to preserve an experiment‐wise error rate of 5%. Because it is unlikely that any single SNP would have an effect large enough to produce a p value that small in studies with realistic sample sizes for a complex trait, investigators are either doomed to failure at the outset or must be willing balance the risks associated with including some false positives (i.e., type I errors) in a set that might be worth pursuing in further studies against the chance of missing important loci as false negatives (i.e., type II errors).
One very important consideration in genetics mapping studies is the fact that the tests are not likely to be independent that would run counter to assumptions underlying the Bonferroni method [see, e.g., Efron (2007) for discussion]. Certain mechanisms such as linkage disequilibrium (LD) and the occurrence of multiple neighboring genes implicated in given metabolic pathways give rise to correlations among subsets of SNPs. Consequently, concurrent with the revolution in the development of molecular genetic technologies, there has been a rekindling of active study into methods of evaluating statistical significance that not only address the massive number of multiple tests that might be pursued, but also consider other mechanisms that are unique to genetic analysis settings, such as correlation structures among tests at neighboring loci due to LD and prior information about the phenotypic influence of those loci obtained from other sources such as linkage evidence and allele frequencies. In general, newer methods consider these confounding factors while attempting to strike a balance between both type I and type II errors. In later sections of this chapter, we will review aspects of the various practices that have been described for handling problems arising from multiple testing in genomics studies.
In summary, the problem inherent to many large‐scale genetic linkage and association studies is that by performing multiple tests researchers increase the likelihood of obtaining false positive results (Storey and Tibshirani, 2003). Although the classical Bonferroni adjustments were developed to deal with this problem by requiring the signal to be significant at a global level across all tests, this global α‐level becomes increasingly conservative when greater numbers of tests are performed because it assumes, essentially, that the tests are independent. However, practically speaking it is not likely that the effects of inherited variations for many complex traits will be overly large as there are likely multiple genes and gene interactions that influence such traits. Consequently, most researchers are willing to risk having higher false positive rates in their studies than against the risk of finding no associations or linkages at all. Ultimately, then, the problem of multiple testing in gene‐finding experiments is to choose appropriate methods that will simultaneously control for both false positives and false negatives.
Section snippets
Types of Errors in Hypothesis Testing (Type I and Type II)
In statistics, hypothesis testing typically involves testing a null hypothesis against an alternative hypothesis. Here we will assume that the null hypothesis is that a given gene or genetic variation has no effect on a trait of interest (i.e., it is not linked or associated with the trait), and the alternative is that there is a linkage or association. Because most contemporary genetic epidemiologic research focuses on association studies rather than linkage studies, we will consider
Striking a Balance Between False Positives and False Negatives
The use of very conservative or stringent significance levels (α) to test hypotheses lead to a loss of power and an increase in the rate of false negatives. However, the use of significance levels that are too liberal lead to unacceptably high rates of false positives. Todorov and Rao (1997) demonstrated the relationship between these two errors in one linkage analysis scenario by plotting both false positives (F+) and false negatives (F−) against the pointwise significance level. The example,
Alternative Adjustment Methods
Various methods attempt to control for these multiple testing issues and have been discussed in several recent articles (e.g., Balding 2006, Carlson 2004, Cheverud 2001, Elston 2006, Guo 2001, Lander 1995, Morton 1998, Pounds 2006, Province 2001, Rao 1998, Rao 2001, Strug 2006, Thomson 2001). At least two general types of multiple comparison procedures are used, one controlling family‐wise error rates (FWERs) and the other controlling for false discovery rates (FDRs, Benjamini 1995, Benjamini
Conclusion
In summary, molecular genetic technologies have advanced to such an extent that the sheer volume of data produced by them often overwhelms researchers' abilities to make valid inferences from those data. Therefore, what are needed are novel statistical methods and insights in order to make expedient, informed, and unbiased use of modern high‐throughput genomic data. Multiple testing issues, which on the surface may seem like a simple problem, can be quite complex for genomic data for a number
Acknowledgments
T. K. R. and D.C.R. are supported in part by grants from the National Institute of General Medical Sciences (GM 28719) and from the National Heart, Lung, and Blood Institute (HL 54473), National Institutes of Health. N.J.S. is supported in part by grants from the National Heart Lung and Blood Institute Family Blood Pressure Program (HL064777), the National Institute on Aging Longevity Consortium (AG023122), the National Institute of Mental Health Consortium on the Genetics of Schizophrenia
References (63)
- et al.
A general test of association for quantitative traits in nuclear families
Am. J. Hum. Genet.
(2000) - et al.
Multipoint quantitative‐trait linkage analysis in general pedigrees
Am. J. Hum. Genet.
(1998) - et al.
Genomewide scans of complex human diseases: True linkage is hard to find
Am. J. Hum. Genet.
(2001)Am. J. Hum. Genet.
(2001) - et al.
One‐stage versus two‐stage strategies for genome scans
Adv. Genet.
(2001) Complexity and power in case‐control association studies
Am. J. Hum. Genet.
(2001)Significance levels in complex inheritance
Am. J. Hum. Genet.
(1998)Am. J. Hum. Genet.
(1998)Sequential methods of analysis for genome scans
Adv. Genet.
(2001)- et al.
False positives and false negatives in genome scans
Adv. Genet.
(2001) - et al.
Using linkage genome scans to improve power of association in genome scans
Am. J. Hum. Genet.
(2006) Significance levels in genome scans
Adv. Genet.
(2001)
Leveraging the HapMap correlation structure in association studies
Am. J. Hum. Genet.
Haplotypes vs single marker linkage disequilibrium tests: What do we gain?
Eur. J. Hum. Genet.
Group sequential methods and sample size savings in biomarker‐disease association studies
Genetics
A tutorial on statistical methods for population association studies
Nat. Rev. Genet.
“Sequential Identification and Ranking Procedures.”
Controlling the false discovery rate: A practical and powerful approach to multiple testing
J. R. Statist. Soc. B.
The control of the false discovery rate in multiple testing under dependency
Ann. Stat.
Sequential designs for genetic epidemiological linkage or association studies: A review of the literature
Biom. J.
Il calcolo delle assicurazioni su gruppi di teste
Teoria statistica delle classi e calcolo delle probabilita
Pubblicazioni del R Instituto Superiore de Scienze Economiche e Commerciali de Firenze
Mapping complex disease loci in whole‐genome association studies
Nature
Selection of differentially expressed genes in microarray data analysis
Pharmacogenomics J.
A simple correction for multiple comparisons in interval mapping genome scans
Heredity Part 1
Choosing the lesser evil: Trade‐off between false discovery rate and non‐discovery rate
Stat. Sinica (to appear)
“Bootstrap Methods and Their Applications”
Efficiency and power in genetic association studies
Nat. Genet.
“Randomization Tests.”
The jackknife, the bootstrap, and other resampling plans
Correlation and large‐scale simultaneous significance testing
J. Am. Stat. Assoc.
Advances in statistical human genetics over the last 25 years
Stat. Med.
Genetic analysis of case/control data using estimated haplotype frequencies: Application to APOE locus variation and Alzheimer's disease
Genome Res.
Cited by (160)
A unified framework for dataset shift diagnostics
2023, Information SciencesCognitive correlates of auditory hallucinations in schizophrenia spectrum disorders
2023, Psychiatry ResearchCombined forecasting tool for renewable energy management in sustainable supply chains
2023, Computers and Industrial EngineeringThe role of genetic and epigenetic factors in familial clustering of metabolic syndrome
2023, Metabolic Syndrome: From Mechanisms to Interventions