Introduction

Substantial progress in GWAS of complex diseases has been made and at least 300 loci have been found to be significantly associated with as many as 120 diseases and traits in these studies.1 In spite of the great success of GWAS, current GWAS continue to be primarily focused on testing associations of a single SNP with a disease one at a time. As common diseases are often caused by multiple genes and environments that are organized into a myriad of complex networks, to only test for association of a single SNP has limited utility2 and is insufficient to dissect the complex genetic structure of common diseases for the following reasons. First, the common approach to the current GWAS is to select dozens of the most significant SNPs in the list for further investigations. However, the set of most significant SNPs often accounts for only a small proportion of the genetic variants associated with disease and offers limited understanding of complex diseases.3 Common diseases often arise from the joint action of multiple loci within a gene or joint action of multiple genes within a pathway. Although each single SNP may confer only a small disease risk, their joint actions are likely to have a significant role in the development of disease. If one only considers the most significant SNPs, the genetic variants that jointly have significant risk effects, individually making only a small contribution, will be missed. Second, locus heterogeneity, in which alleles at different loci cause disease in different populations, will increase the difficulty in replicating associations of a single marker with a disease.4 The list of significant SNPs from several studies may have little overlap. Therefore, replication of association findings at the SNP level can be difficult if redundant genes have roles. Third, the ultimate purpose of genetic studies of complex diseases is to decipher the path from genotype to phenotype. In spite of the conduct of extensive studies in search of genes causing complex diseases, connections between DNA variation and complex phenotypes, which are essential for unraveling pathogeneses of complex diseases and predicting variation in human health, still have been elusive. Health states of individuals are a complex, multidimensional phenomenon. Clinical manifestations arise from integrated actions of multiple genetic and environmental factors, through dynamic, epigenetic and regulatory mechanisms.5, 6, 7 What has been generally missing in the current GWAS is the context in which DNA variation occurs. It was reported that a gene location within a cellular network may have significant effect on the results of the given gene mutation.8 The genetic variation occurring at multiple loci often perturbs signal, regulatory and metabolic pathways, resulting in complex changes in phenotype. SNPs and genes carry out their functions through intricate pathways of reactions and interactions. Knowing the list of risk, SNPs is not sufficient to understand disease mechanisms.9

To overcome these limitations, recently, Wang et al10 suggested to extend gene set enrichment analysis for gene expression data, which intend to identify subtle, but coordinated expression variations of gene groups to GWAS. The challenge for extension is how to represent a gene in GWAS. Wang et al10 suggested to choose the most significant SNP from each gene as a representative. But, in GWAS, a gene often contains a variable number of SNPs. The genes that contain a number of SNPs jointly having significant risk effects, but individually making only a small contribution, will be missed in such representation. Another issue is how to deal with correlations among SNPs and genes. Owing to linkage disequilibrium (LD), there may be high correlations among some SNPs. In Wang et al's publication, the statistics that were used for testing association of a pathway with the disease did not take correlations among SNPs into account.

To solve these problems, we consider three basic units of association analysis: SNP, gene and pathway and suggest gene and pathway-based GWAS. In gene and pathway-based GWAS, each gene is represented by all SNPs, which are either located within the gene or are not >500 kb away from the gene.10 Unlike gene set enrichment analysis in which one examines whether significantly associated genes are overrepresented in the set of genes to be analyzed, we formulate the gene and pathway-based GWAS as the problem to jointly test for association of multiple SNPs within the gene or multiple genes within the pathway with disease. This allows us to holistically unravel complex genetic structure of common disease to gain insight into the biological processes and disease mechanism.

The purpose of this report is to develop a general framework for gene and pathway-based GWAS of complex diseases and novel statistics for testing association of a gene or pathway with the disease. To accomplish this, we first formulate the null hypothesis for testing association of the gene or pathway with the disease. Then, we develop three statistics to combine a set of dependent P-values of SNPs into an overall significance level for a gene or a set of dependent P-values of genes into an overall significance level for a pathway. We validate the null distribution and calculate type 1 error rates of the three developed statistics for testing association of the gene or pathway with the disease using extensive simulation studies. To illustrate how to perform the gene and pathway-based GWAS, we examine GWAS of rheumatoid arthritis (RA) in two independent studies: Wellcome Trust Case–Control Consortium (WTCCC) and the North American Rheumatoid Arthritis Consortium (NARAC) studies. Our results show that the suggested new paradigm for GWAS not only can identify the genes that have large genetic effects and can be found by single SNP association analysis, but also can detect new genes in which each single SNP confers a small disease risk, but their joint actions can be implicated in the development of diseases.

A program for implementation can be downloaded from our website http://www.sph.uth.tmc.edu/hgc/faculty/xiong/.

Materials and methods

Gene-based association and its formal null hypothesis testing

A gene-based association analysis uses a gene as the basic unit of analysis. The gene-based association jointly considers all common variation within a gene.4 Instead of testing association of single SNPs with the disease, gene-based association jointly tests for association of all the SNPs within the gene. Formally, suppose that there are k SNPs in the gene. The null hypothesis for testing association of the ith SNP in the gene is represented by

where θi denotes the parameter, for example, the difference in allele frequencies between cases and controls. Then, the null hypothesis for testing association of a gene with disease is defined as testing for the combined null hypothesis:

The goal of testing association of the gene is to test all SNPs in the gene as a whole. Testing for association of the gene with disease is to test an overall effect of all SNPs in the gene, which combines evidence. Each SNP in the gene may confer small disease risk, and jointly they make a large contribution.

Statistics for testing association of a gene with disease

A general framework for testing association of a gene with the disease is to combine evidence from all the markers within the gene. In general, correlations among P-values of SNPs within the gene exist because of LD among SNPs. Correlations among SNPs will invalidate the existing methods for combining independent P-values. Therefore, the methods for combining independent P-values cannot be directly applied to combining P-values of SNPs within the gene. We need to develop methods for combining dependent P-values, which take correlations among SNPs into account. We suggest three statistics for combining dependent P-values. In the following discussion, we assume that Pi is the P-value of the statistic with a normal or asymptotic normal distribution.

Before presenting statistics, we introduce some notations. Consider SNP Mi with two alleles Bi and bi, and SNP Mj with two alleles Bj and bj. For cases, we define the indicator variables for alleles: or the indicator variables for the genotypes:

We similarly define the indicator variables yi and yj for controls.

Linear combination test (LCT)

The first suggested statistic is to take a linear combination of P-values for all SNPs within the gene, which is referred to as the LCT. Let e=(1, 1, …, 1)T. A statistic based on linear combination of the vector Z is defined as

where Zi=Φ−1(1−Pt), Z=(Z1, …, Zk)T, Rg is the correlation matrix of Z. A key issue is how to calculate the correlation matrix Rg. In general, Rg is difficult to calculate. However, if the P-value for each SNP is calculated by the t statistic, we have the following results. Let Zk=Φ−1(1−Pi)=Φ−1(FT(tk)), where tk is a t statistic for testing association of the k-th SNP. When the sample size is large enough, FT can be approximated by a standard normal distribution, which implies Zk≈tk. Therefore, under the null hypothesis the correlation matrix of Z among all the SNPs within a gene can be given by the sampling correlation matrix of the data: corr(Zk, Zl)≈corr(xk−yk, xl−yl). Therefore, the correlation matrix Rg can be approximated by

where xi and yi are indicator variables for either alleles or genotypes in cases and controls, respectively, and Φ is the standard normal distribution. Under the null hypothesis, TL is the standard normal distribution.

Quadratic Test (QT)

A QT that is based on the quadratic form of Z is defined as

where Z and Rg are previously defined. Under the null hypothesis, TQ is asymptotically distributed as a central χ(k)2 distribution, where k is the number of SNPs within the gene.

Decorrelation Test (DT)

Another way to combine dependent P-values is that we first transform dependent variables into independent variables and then combine independent variables. Let the correlation matrix Rg be decomposed as

where C is a nonsingular matrix. Then, the correlated random variables Zi(i=1, … , k) can be decorrelated by the following transformation:

It can be easily observed that

Thus, the variables in W are independent, which implies that the decorrelated statistics W are asymptotically distributed as a vector of independent standard normal variables. For each Wi, we calculate the P-value P*i, resulting in

All the methods for combining independent P-values can be applied to P*. For example, we can use the Fisher's combination test11 to combine P*:

which follows a χ(2k)2 distribution, or Sidak, Simes, false discovery rate (FDR) method.12

Pathway-based association test

A general framework for testing association of a pathway with disease that is similar to gene-based association analysis is to combine P-values of the genes within the pathway from gene-based association analysis into an overall significant level of the pathway.

Correlation structure among genes within a pathway

Consider m genes within a pathway. Suppose that the i-th gene has ki SNPs. Let x i u , x j v , y j v and y j v be the indicator variables for the u-th allele in the i-th gene, v-th allele in the j-th gene in cases and controls, respectively. The correlation between the u-th marker in the i-th gene and the v-th marker in the j-th gene is defined as r i u , j v  =  c o r r ( x i u  -  y i u , x j v  -  y j v ) . Let Zij=Φ−1(1−Pij), where Pij is the P-value for testing association of the j-th SNP in the i-th gene. Define

Define the correlation matrix between vectors Zi and Zj as

Let Ri be the correlation matrix of the vector Zi for the i-th gene in the pathway, which is defined in Equation (2), and the correlation matrix of the vector Z for the whole pathway be defined as

Recall that the statistic TLi for the i-th gene defined in Equation (1) is given by

By simple algebra, we have

Let TL=(TL1, … ,TLm)T, rgij=corr(TLi,TLj) be the correlation between the test statistic for the i-th gene and the test statistic for the j-th gene. Then, its corresponding correlation matrix Rp for the whole pathway is given by

Statistics for testing association of a pathway with disease

Similar to testing for association of a gene with the disease, the basic idea for testing association of a pathway with the disease is to combine P-values of genes within the pathway. We have three statistics for testing association of a pathway with the disease.

Linear combination test

Taking a linear combination of statistics for testing association of the genes within the pathway leads to a statistic for testing association of the pathway with the disease. Formally, we define the statistic for testing association of the pathway with the disease as

where TL=(TL1, … ,TLm)T and RP is defined in Equation (6). Then, under the null hypothesis, TP is asymptotically distributed as the standard normal distribution.

Quadratic test

Similar to the gene-based analysis, we can also define the following QT

Under the null hypothesis, TPQ is asymptotically distributed as a central χ(m)2 distribution.

Decorrelation test

The vector of the statistics for testing gene association TL can also be decorrelated by

where RP=CPCTP Then, TPD consists of m independent standard normal variables. Let P D  = ( P D 1 , …  ,  P D m ) T be the vector of P-values corresponding to TPD. We can use the Fisher's combination test to combine PD:

which follows a χ(2m)2 distribution. Other methods for combining independent P-values such as Sidak, Simes and the FDR method can also be used to combine P-values for individual genes within the pathway.

Results

Type 1 error rates of test statistics

To validate the statistics presented for testing association of genes and pathways with the disease in this publication, first verify the standard normal distribution of the Z statistic that is obtained by an inverse normal distribution transformation of the t statistic. For simplicity, here we only present results for indicator variables with alleles. The results for the genotypes were similar (data not shown). SNaP software13 was used to generate a population of 1 000 000 chromosomes. We sampled 2000 individuals as cases and 2000 individuals as controls from the population and performed 10 000 simulations. Figure 1 plots the empirical distribution of the Z statistic, which is very close to the standard normal distribution. We then calculate the type 1 error rates of the developed statistics. For calculation of type 1 error rates of the statistics for testing association of the gene with the disease, SNaP software was used to generate 1 000 000 chromosomes, each having a gene with 20 SNPs. For calculation of type 1 error rates of the statistics for testing association of the pathway with disease, SNaP software was used to generate 1 000 000 chromosomes, each having 5 blocks that are representative of genes and each block having 20 SNPs. We randomly sampled individuals from the population that were equally divided as cases and controls. The number of sampled controls range from 1000 to 3000, and 10 000 simulations were performed. Table 1 and Supplementary Table 1 show that type 1 error rates of the statistics for testing association of the gene and pathway with the disease were not appreciably different from the nominal levels (α=0.05, α=0.01 and α=0.001), respectively.

Figure 1
figure 1

Empirical distribution of the Z statistic.

Table 1 Type 1 error rates of the statistics for testing association of the gene with the disease

RA in the WTCCC and NARAC studies

To evaluate the performance of the gene and pathway-based GWAS, the developed statistics were applied to RA in the WTCCC14 and NARAC15 studies to identify significantly associated genes and pathways with RA. A total of 459 653 SNPs were typed for 1860 RA patients and 2938 controls in the WTCCC studies and 545 080 SNPs were typed for 866 RA patients and 1194 controls in the NARAC studies. The total number of genes involved in the WTCCC and NARAC studies were 15 732 and 17 773, respectively.

The current GWAS are limited to taking a SNP as the basic unit for association testing. The results, wherein taking a gene or a pathway as a basic unit of association test are presented below. We assembled 465 pathways from KEGG16 and Biocarta (http://www.biocarta.com). The assignment of SNPs to a gene was obtained from the NCBI human9606 database (version b129) (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/organism_data/b129/b129_SNPContigLocusId_36_3.bcp.gz). The P-values for declaring association of the gene with RA after performing a Bonferroni correction in the WTCCC and NARAC studies were 3.2 Ă— 10−6 and 2.8 Ă— 10−6, respectively. All 465 pathways were involved in the WTCCC and NARAC studies. Thus, the P-value for declaring association of the pathway with RA was 1.1 Ă— 10−4.

Table 2 summarizes all 19 replicated genes by the LCT method with their P-values. Supplementary Tables 2, 3 and 4 list 49, 47 and 45 replicated genes by the QT, DT(FDR) and DT(Fisher) methods, respectively. The QT method identified 90 and 92% of the replicated genes and they are included in the list of replicated genes identified by the DT(FDR) method and the DT(Fisher) method, respectively. Association of the genes human leukocyte antigen (HLA)-DPB1,17, 18 HLA-DQR1,18 HLA-DQB1,19, 20 and MICA21, 22 with RA were previously reported. MICA is a cell stress-induced glycoprotein and localized in the HLA region. Its reaction with T cells and natural killer cells suggest that MICA gene may have an important role in the development of autoimmune disease. The gene AIF1 (an allograft inflammatory factor 1) that is encoded within the HLA class III genomic region on chromosome 6p21 and has an important role in inflammation was reported to be associated with systemic sclerosis23 and atherosclerosis.24 RDRNA-binding protein that is located in the major histocompatibility complex (MHC) class III region on chromosome 6p21.3 was reported to be involved in the immune response and systemic inflammatory stimulation.25 The genes BAT3, BAT4 and AGPAT1 are within the human MHC class III region. The gene ZFP57 that is located on chromosome 6p22 and encodes a zinc-finger transcription factor is involved in hypomethylation of several imprinted loci in transient neonatal diabetes patients.26 The SNP rs6679677, which is in complete LD with the SNP rs2476601 in the PTPN22 gene belongs to the gene RSBN1 in the NCBI database. The PTPN22 gene that has been reported to be associated with RA several times14 also showed strong association with RA in the NARAC studies in our analysis.

Table 2 Genes with significant association with RA in both WTCCC and NARAC studies that were identified by the LCT method

To show that the strategy for considering only the most significant SNPs in the association studies may lead to missing the genetic variants that jointly have significant risk effects, but individually make only a small contribution, see Table 3. Five different markers were typed for the gene ZFP57 in both the WTCCC and NARAC studies. Table 3 shows that none of the SNPs in the gene ZFP57 showed significant association, but the gene ZFP57 itself has strong association with RA in both the WTCCC and NARAC studies. We also observe that although typed SNPs within the gene ZFP57 in two studies were different, we still can replicate association of the gene ZFP57 with RA in the two independent studies.

Table 3 P-values of SNPs in the gene ZFP57

Attempting to understand and interpret a number of significant SNPs without any unifying biological theme can be challenging and demanding. SNPs and genes carry out their functions through intricate pathways of reactions and interactions. The function of many SNPs may not be well characterized, but the function of pathways, on the contrary, are much better analyzed. Pathway-based association analysis can help unravel the mechanism of complex diseases. Next we present the results of pathway-based GWAS of RA. Supplementary Table 5, Table 4, Supplementary Tables 6 and 7 list significantly associated pathways with RA in both the WTCCC and NARAC studies, which were identified by LCT, QT, DT(FDR) and DT(Fisher) methods, respectively. Figures 2 and 3 plot a MAPK signaling pathway, which was associated with RA in the WTCCC and NARAC studies, respectively. These tables and figures showed several remarkable features that can be used to extract biological insight from GWAS. First, functional pathway analysis is a key to unraveling the mechanism of complex diseases and opens a way for a pathway definition of complex diseases. Biological pathways are sets of genes that work in concert to perform particular cellular functions or biological processes. RA is an autoimmune disease characterized by chronic inflammation of the joints, the tissues around the joints and other organs in the body.27 Associated pathways identified in the WTCCC and NARAC studies can be classified into three groups. The first group consists of three pathways: antigen processing and presentation, cell adhesion molecules and type I diabetes mellitus pathways. Results of all tests (LCT, QT and DT) have shown that these three pathways were significantly associated with RA in two studies. The second group includes six pathways: MAPK signaling pathway, complement pathway, complement and coagulation cascades, alternative complement pathway, cytokines and inflammatory pathway and ether lipid metabolism pathways, which were in common in the lists of associated pathways identified by QT and DT methods. The third group includes B lymphocyte cell surface molecules, IL 5 signaling pathway, Th1/Th2 differentiation pathway, glycerophospholipid metabolism, cell communication, focal adhesion, glycerolipid metabolism, Jak-STAT signaling pathway, bystander B-cell activation pathway and antigen-dependent B-cell activation pathway. The pathways in the third group were identified by either the QT method or DT method.

Table 4 Significant pathways in both WTCCC and NARAC studies that were identified by the QT method
Figure 2
figure 2

P-values for testing association of the genes within the MAPK signaling pathway with RA in WTCCC studies. Blocks including significant genes are in red color, blocks including mild significant genes are in light red color and blocks including no significant genes are in green color.

Figure 3
figure 3

P-values for testing association of the genes within the MAPK signaling pathway with RA in NARAC studies. Blocks including significant genes are in red color, blocks including mild significant genes are in light red color and blocks including no significant genes are in green color.

In the first group, the antigen processing and presentation pathway mainly consists of MHC molecules, which are shown on cell surfaces and responsible for lymphocyte recognition and antigen presentation. The antigen processing and presentation pathway and the cell adhesion pathway are crucial for controlling inflammatory and immune responses and involved in the RA.28, 29 Close contact between different populations of cells is fundamental for inflammatory and immune responses. The type I diabetes mellitus pathway induces an uncontrolled immune attack against the insulin producing β-cells.30 These three pathways form the core pathway definition for RA.

The relationships between the second group of pathways and RA consist of the MAPK pathway, which is a key signal transduction pathway of inflammation and reported to be involved in the development of RA.31 The complement pathway helps clear pathogens from an organism and has a key role in determining the fate of immune status.32 The complement and coagulation cascades pathway is a partner of inflammation33 and involved in the pathogenesis of RA.34 The pathways in the third group such as IL 5 signaling pathway,35 Th1/Th2 differentiation pathway,36 B lymphocyte cell surface molecules pathway,37 lysine degradation,38 antigen-dependent B-cell activation pathway,39 cell communication,40 bystander B-cell activation pathway41 and focal adhesion42 are involved in inflammation and immune responses and hence are related to RA in some degree.

Second, replication of the results of pathways in independent samples is much easier than replication of genes or SNPs. Replications can be performed at the level of the SNP, the gene and pathway. As Figures 2 and 3 show, the WTCCC and NARAC studies shared no common significantly associated genes within the MAPK pathway, in other words, we failed to replicate significantly associated genes within the MAPK pathway in two independent studies. However, Table 4 and Supplementary Tables 6 and 7 show that the MAPK pathway in both studies were significantly associated with RA. This example shows that replication at the pathway level is easier than replication at the gene level.

Third, the number of genes showing significant association with RA within the pathway may be very small, but the number of genes showing mild association with RA within the pathway may be quite large. In Figures 2 and 3 shown, we can only observe two and four significantly associated genes, but we can observe 19 (9.4% of total genes within the pathway) and 29 (12.7% of total genes within the pathway) genes showing mild association with RA within the MAPK pathway in the WTCCC and NARAC studies, respectively. It is interesting that these mildly associated genes were proinflammatory cytokine, stress gene, growth factors, MAPKKK, MAPKK, MAPK and transcription factors, which were distributed among all stages, from upstream to downstream, of inducing the MAPK pathway. We also observe that even if the gene CACNA2D3 showed significant association with RA using the LCT test, the P-value of the best SNP in the gene CACNA2D3 was 0.000432, in the NARAC studies. This shows that if we consider only the most significant SNPs, the genetic variants that jointly have significant risk effects, but individually make only a small contribution, will be missed. This example also shows that each gene may confer a small contribution, but their joint actions may affect the function of the pathway, which in turn will cause disease.

Discussion

In spite of the great success of large-scale GWAS, the current approach to GWAS has mainly focused on testing association of single SNPs with disease and selected the best SNPs for further studies. However, single SNP association analysis will miss many SNPs with moderate genetic effects. Separate association finding from biological interpretation offer limited understanding of the functional basis of complex diseases. To overcome these limitations, in this report we suggest gene and pathway-based GWAS in which we take a gene and a pathway as basic units of association analysis in addition to single SNP association studies. Gene and pathway-based GWAS assess the significance of the genes and the predefined pathways, and intend to identify biological pathways with subtle but coordinated genetic variants that confer risk contributions.

To shift the paradigm from single SNP-based GWAS to gene and pathway-based GWAS, we addressed the following issues. First, unlike the extension of gene set enrichment analysis to GWAS in which we analyze whether significantly associated genes are overrepresented in the set of genes, which are of interest, we formulate the gene and pathway-based GWAS as the traditional hypothesis testing problem. In other words, to test the association of a gene or a pathway with the disease is to jointly test for association of multiple SNPs within the gene or multiple genes within the pathway with the disease. Second, the challenge facing us is how to develop statistics for testing association of a gene or a pathway with the disease. A simple approach to joint analysis of multiple SNPs within the gene and multiple genes within the pathway is to combine their P-values into an overall P-value to represent the significance of a gene or a pathway. We analyzed correlations among SNPs within the gene and correlations among genes within the pathway and found that correlations among SNPs and genes cannot be ignored (owing to space limitation, data were not shown). However, the current popular statistical methods are designed for only combining independent P-values and hence are not appropriate for gene and pathway-based GWAS. Therefore, we developed three novel statistics, which are able to combine dependent P-values of SNPs within the gene or genes within the pathway. We examined the distribution of the suggested statistics under the null hypothesis of no association of the gene or pathway with the disease and calculated their type 1 error rates by simulations. Our results have shown that type 1 error rates were close to nominal significance levels. Third, to assess their merit and limitations, we applied the developed statistical methods for gene and pathway-based association analysis to GWAS of RA in the WTCCC and NARAC studies. The results have shown that the new paradigm of GWAS not only confirmed previous association findings, but also discovered a number of new genes and pathways that were significantly associated with RA. Although the results were preliminary, they indeed showed that identification of pathways associated with disease allows us to much easier uncover pathogenesis of disease.

Gene and pathway-based GWAS offer several remarkable features. First, the new paradigm not only can identify the genes that have large genetic effects and can be found by single SNP association analysis, but also can detect new genes in which each single SNP confers small disease risk, but their joint actions can be implicated in the development of diseases. Second, the results of application of pathway analysis to RA strongly show that pathway-based analysis can add structure to genomic data and allows us to gain deep understanding of cellular processes as intricate networks of functionally related genes and to unravel the functional bases of the association finding. Third, replication of association findings at the gene or pathway level is much easier than replication at the individual SNP level. Risk SNPs (or genes) for different individuals may be different, but may be in the same gene (or pathway). Fourth, the new paradigm for GWAS will open a novel avenue to integrate GWAS with other functional analyses such as gene set enrichment analysis for gene expression data and hence will facilitate uncovering the mechanism of complex diseases. Our results strongly challenge the paradigm of GWAS that only tests the association of single SNPs.

The developed statistics for testing association of genes or pathways also have serious limitations. First, presence of both positive and negative correlations among SNPs will dramatically reduce the power to discover association of genes or pathways. Second, when the number of SNPs within the gene or number of genes within the pathway is large, numeric instability will increase the error in calculation of the inverse matrix of the correlation matrix, which in turn will increase the false-positive rate of association finding. We should overcome these limitations in the future.

Millions of dollars are spent for GWAS. Data from GWAS are very expensive, but also contain rich information. Simple statistical methods based on single SNP association analysis might not be the best strategy for deciphering the path from genomic information to clinical phenotypes. Taking full advantage of rich information and huge opportunities provided by GWAS raises great conceptual and technical challenges. To unravel the true nature of complex diseases, we need to integrate multiple approaches and multiple types of data. In the coming years, we will witness the development of a variety of novel methods for GWAS, rapid progress in GWAS and their great success.