Elsevier

Advances in Genetics

Volume 60, 2008, Pages 293-308
Advances in Genetics

Methods for Handling Multiple Testing

https://doi.org/10.1016/S0065-2660(07)00412-9Get rights and content

Abstract

The availability of high‐throughput genotyping technologies and massive amounts of marker data for testing genetic associations is a dual‐edged sword. On one side is the possibility that the causative gene (or a closely linked one) will be found from among those tested for association, but on the other testing, many loci for association creates potential false positive results and the need to accommodate the multiple testing problem. Traditional solutions involve correcting each test using adjustments such as a Bonferroni correction. This worked well in settings involving a few tests (e.g., 10–20, as is typical for candidate gene studies) and even when the number of tests was somewhat larger (e.g., a few hundred as in genome‐wide microsatellite scans). However, the current dense single nucleotide polymorphism (SNP) and/or whole‐genome association (WGA) studies often consider several thousand to upwards of 500,000 and 1 million SNPs. In these settings, a Bonferroni correction is not practical as it does not take into account correlations between the tests due to linkage disequilibrium and hence can be too conservative. The effect sizes of susceptibility alleles will rarely (if ever) reach the required level of significance in WGA studies if a Bonferroni correction is used, and the number of false negatives is likely to be large. Thus, one of the burning methodological issues in contemporary genetic epidemiology and statistical genetics is how to balance false positives and false negatives in large‐scale association studies. This chapter reviews developments in this area from both historical and current perspectives.

Introduction

The problem of multiple testing in genetics studies arises when many different loci are tested for association with a particular trait or disease, as is common in genome‐wide linkage and whole‐genome association (WGA) scans as well as in gene expression microarray studies. The most common method for dealing with multiple testing problems involves adjusting the significance level (α) of each test to accommodate the total number of tests performed, although the need for this type of adjustment has been questioned (e.g., Rothman, 1990). Historically, the Bonferroni correction has been used to adjust significance levels (Bonferroni 1935, Bonferroni 1936). The Bonferroni correction provides a simple formula for computing the required pointwise α‐levels (for each test made) based on a global or experiment‐wise error rate or α‐level (say 0.05). This method works well when there are only a few independent tests being performed. For example, in an experiment involving 10–20 independent tests, one would expect not more than one result to be significant at the p < 0.05 level due to chance alone. In fact, the Bonferroni correction formed the basis for the often‐cited Lander and Kruglyak (1995) lod score thresholds for mapping complex trait loci in linkage mapping contexts of 1.9 (α = 1.7 × 10−3) for suggestive linkage and 3.3 (α = 4.9 × 10−5) for significant linkage. These Bonferroni‐based thresholds for declaring significance are based on the assumption of a dense map containing an infinite number of markers with infinitely small intermarker distances. For more realistic genome‐wide linkage scans involving about 300 markers, one would expect only about 15 false positives at a pointwise α‐level of 0.05, or a Bonferroni corrected α‐level of 1.7 × 10−4 corresponding to a lod score of about 2.75.

Even with the use of the more realistic α‐levels in linkage mapping studies, significant results involving complex, multifactorial traits have been few and far between (Altmuller et al., 2001). The problems inherent to multiple testing have been compounded with the explosion of available DNA markers and high‐throughput genotyping technologies for analysis, as researchers begin to undertake WGA studies that involve hundreds of thousands of tests, for example prostate cancer (Gudmundsson et al., 2007) and type 2 diabetes (Sladek et al., 2007). Multiple testing problems that are this pronounced have been referred to in the data mining community as the “short‐fat data” or “large p, small n” problem, and currently there is no agreed upon resolution. For example, if the Bonferroni was applied to a single nucleotide polymorphism (SNP)‐based WGA scan involving a half‐million tests, one would expect to see a whopping 25,000 false positives if a simple p < 0.05 criteria for declaring significance for each SNP was applied; alternatively, one would have to adopt a pointwise α‐level as small as 1.0 × 10−7 if a simple Bonferroni correction was applied to each SNP for declaring significance in order to preserve an experiment‐wise error rate of 5%. Because it is unlikely that any single SNP would have an effect large enough to produce a p value that small in studies with realistic sample sizes for a complex trait, investigators are either doomed to failure at the outset or must be willing balance the risks associated with including some false positives (i.e., type I errors) in a set that might be worth pursuing in further studies against the chance of missing important loci as false negatives (i.e., type II errors).

One very important consideration in genetics mapping studies is the fact that the tests are not likely to be independent that would run counter to assumptions underlying the Bonferroni method [see, e.g., Efron (2007) for discussion]. Certain mechanisms such as linkage disequilibrium (LD) and the occurrence of multiple neighboring genes implicated in given metabolic pathways give rise to correlations among subsets of SNPs. Consequently, concurrent with the revolution in the development of molecular genetic technologies, there has been a rekindling of active study into methods of evaluating statistical significance that not only address the massive number of multiple tests that might be pursued, but also consider other mechanisms that are unique to genetic analysis settings, such as correlation structures among tests at neighboring loci due to LD and prior information about the phenotypic influence of those loci obtained from other sources such as linkage evidence and allele frequencies. In general, newer methods consider these confounding factors while attempting to strike a balance between both type I and type II errors. In later sections of this chapter, we will review aspects of the various practices that have been described for handling problems arising from multiple testing in genomics studies.

In summary, the problem inherent to many large‐scale genetic linkage and association studies is that by performing multiple tests researchers increase the likelihood of obtaining false positive results (Storey and Tibshirani, 2003). Although the classical Bonferroni adjustments were developed to deal with this problem by requiring the signal to be significant at a global level across all tests, this global α‐level becomes increasingly conservative when greater numbers of tests are performed because it assumes, essentially, that the tests are independent. However, practically speaking it is not likely that the effects of inherited variations for many complex traits will be overly large as there are likely multiple genes and gene interactions that influence such traits. Consequently, most researchers are willing to risk having higher false positive rates in their studies than against the risk of finding no associations or linkages at all. Ultimately, then, the problem of multiple testing in gene‐finding experiments is to choose appropriate methods that will simultaneously control for both false positives and false negatives.

Section snippets

Types of Errors in Hypothesis Testing (Type I and Type II)

In statistics, hypothesis testing typically involves testing a null hypothesis against an alternative hypothesis. Here we will assume that the null hypothesis is that a given gene or genetic variation has no effect on a trait of interest (i.e., it is not linked or associated with the trait), and the alternative is that there is a linkage or association. Because most contemporary genetic epidemiologic research focuses on association studies rather than linkage studies, we will consider

Striking a Balance Between False Positives and False Negatives

The use of very conservative or stringent significance levels (α) to test hypotheses lead to a loss of power and an increase in the rate of false negatives. However, the use of significance levels that are too liberal lead to unacceptably high rates of false positives. Todorov and Rao (1997) demonstrated the relationship between these two errors in one linkage analysis scenario by plotting both false positives (F+) and false negatives (F−) against the pointwise significance level. The example,

Alternative Adjustment Methods

Various methods attempt to control for these multiple testing issues and have been discussed in several recent articles (e.g., Balding 2006, Carlson 2004, Cheverud 2001, Elston 2006, Guo 2001, Lander 1995, Morton 1998, Pounds 2006, Province 2001, Rao 1998, Rao 2001, Strug 2006, Thomson 2001). At least two general types of multiple comparison procedures are used, one controlling family‐wise error rates (FWERs) and the other controlling for false discovery rates (FDRs, Benjamini 1995, Benjamini

Conclusion

In summary, molecular genetic technologies have advanced to such an extent that the sheer volume of data produced by them often overwhelms researchers' abilities to make valid inferences from those data. Therefore, what are needed are novel statistical methods and insights in order to make expedient, informed, and unbiased use of modern high‐throughput genomic data. Multiple testing issues, which on the surface may seem like a simple problem, can be quite complex for genomic data for a number

Acknowledgments

T. K. R. and D.C.R. are supported in part by grants from the National Institute of General Medical Sciences (GM 28719) and from the National Heart, Lung, and Blood Institute (HL 54473), National Institutes of Health. N.J.S. is supported in part by grants from the National Heart Lung and Blood Institute Family Blood Pressure Program (HL064777), the National Institute on Aging Longevity Consortium (AG023122), the National Institute of Mental Health Consortium on the Genetics of Schizophrenia

References (63)

  • N. Zaitlen et al.

    Leveraging the HapMap correlation structure in association studies

    Am. J. Hum. Genet.

    (2007)
  • J. Akey et al.

    Haplotypes vs single marker linkage disequilibrium tests: What do we gain?

    Eur. J. Hum. Genet.

    (2001)
  • R. Aplenc et al.

    Group sequential methods and sample size savings in biomarker‐disease association studies

    Genetics

    (2003)
  • D.J. Balding

    A tutorial on statistical methods for population association studies

    Nat. Rev. Genet.

    (2006)
  • R.E. Bechhofer et al.

    “Sequential Identification and Ranking Procedures.”

    (1968)
  • Y. Benjamini et al.

    Controlling the false discovery rate: A practical and powerful approach to multiple testing

    J. R. Statist. Soc. B.

    (1995)
  • Y. Benjamini et al.

    The control of the false discovery rate in multiple testing under dependency

    Ann. Stat.

    (2001)
  • I.R. Böddeker et al.

    Sequential designs for genetic epidemiological linkage or association studies: A review of the literature

    Biom. J.

    (2001)
  • C.E. Bonferroni

    Il calcolo delle assicurazioni su gruppi di teste

  • C.E. Bonferroni

    Teoria statistica delle classi e calcolo delle probabilita

    Pubblicazioni del R Instituto Superiore de Scienze Economiche e Commerciali de Firenze

    (1936)
  • C.S. Carlson et al.

    Mapping complex disease loci in whole‐genome association studies

    Nature

    (2004)
  • J.J. Chen et al.

    Selection of differentially expressed genes in microarray data analysis

    Pharmacogenomics J.

    (2007)
  • J.M. Cheverud

    A simple correction for multiple comparisons in interval mapping genome scans

    Heredity Part 1

    (2001)
  • R.V. Craiu et al.

    Choosing the lesser evil: Trade‐off between false discovery rate and non‐discovery rate

    Stat. Sinica (to appear)

    (2007)
  • A.C. Davison et al.

    “Bootstrap Methods and Their Applications”

    (1997)
  • P.I. de Bakker et al.

    Efficiency and power in genetic association studies

    Nat. Genet.

    (2005)
  • E.S. Edgington

    “Randomization Tests.”

    (1995)
  • B. Efron

    The jackknife, the bootstrap, and other resampling plans

    (1982)
  • B. Efron

    Correlation and large‐scale simultaneous significance testing

    J. Am. Stat. Assoc.

    (2007)
  • R.C. Elston et al.

    Advances in statistical human genetics over the last 25 years

    Stat. Med.

    (2006)
  • D. Fallin et al.

    Genetic analysis of case/control data using estimated haplotype frequencies: Application to APOE locus variation and Alzheimer's disease

    Genome Res.

    (2001)
  • Cited by (160)

    • The role of genetic and epigenetic factors in familial clustering of metabolic syndrome

      2023, Metabolic Syndrome: From Mechanisms to Interventions
    View all citing articles on Scopus
    View full text