Introduction

The advent of high-throughput genotyping technology has made the technology of whole-genome single-nucleotide polymorphisms (SNPs) scanning for susceptible genes easy and popular, resulting in the generation of mass genotyping data. Analysis of these data using statistical and data mining methods have led to the discoveries of many association or predisposing genes, yet few causative genes were determined.1 In addition, it is hard to put forward genetic mechanisms for most common diseases. Therefore, many researches are focusing on factors affecting the power of association study.2, 3, 4 One of the most important factors is gene–gene interaction5, 6 and the other is gene–environment interaction.7 In this paper, we focus on gene–gene interaction, or the so-called epistasis.

As first advanced by Bateson,8 epistasis has been defined as a phenomenon whereby the effects of a given gene on a biological trait are masked or enhanced by one or more genes.9 Several studies8, 9, 10, 11 have provided evidences for the existence of gene–gene interaction or epistasis. Since gene–gene interactions may play a role in the mechanisms of complex diseases and weaken the major effects of single gene, the association study often turns out to be confusing and hard to explain.12 Genetic interaction models that consider two-locus genotype combinations have been proposed; for example (Figure 1), the threshold model, jointly recessive–recessive model and jointly dominant–dominant model.13, 14, 15 Some of these models, such as the additive model, multiplicative model and heterogeneity model, can be presented as deviation formation.15, 16, 17 Li and Reich15 have enumerated all possible two-locus models, some of which had been reported with significant biological meaning.

Figure 1
figure 1

Eight typical two-locus models. 1 denotes high-risk genotype combinations and 0 low-risk. M1, jointly recessive-recessive model (RR); M3, jointly recessive-dominant model (RD); M7, single-gene recessive model (1L: R); M11, threshold model (T); M15, modifying-effect model (Mod); M27, jointly dominant-dominant model (DD); M78, exclusive OR model (XOR); M84. diagonal model (Diagonal).

Statistical18 and data mining methods are the mainstream in the current analysis approaches. One of the most familiar methods is Cordell's unified stepwise regression procedure,5 which can be applied to the additive, multiplicative and heterogeneity model. It is the most familiar method for analyzing interaction effects. Millstein et al19 developed an interaction testing framework called FITF, which is also based on stepwise regression. Recently, Zhao et al20 proposed an LD-based measure between two unlinked loci and the method was proved to be powerful under some two-locus disease models. Evans et al21 investigated the performance of two simple two-stage strategies. Moore et al22 used attribute interaction23 to select potential interacting SNPs and construct interaction graph. While these methods are applicable and useful, they often could not distinguish which two-locus model was proper for the interaction effects but could only tell that interaction exists under certain genetic data. Multilocus statistics such as S-statistic24 and data mining approaches, such as multifactor-dimensionality reduction (MDR),25 restricted partitioning method,26 combinatorial partitioning method,27 dynamic algorithm (DA),28 decision tree29 and random forest,30 are powerful to reduce data dimension and to get a set of SNPs that can interpret the results best. Taking the popular MDR25 as an example, it considers each possible genotype combination of SNPs as high- or low-risk combinations, repeats the calculation from single SNP to multi-SNP combinations and performs cross-validation to get combinations with maximum cross-validation consistency and minimum prediction error at different dimensions. However, as Moore et al pointed out, the combination results of MDR were still hard to interpret. These methods focus on data reduction by mathematical methods but ignore how to interpret the resulting interaction effects from the point of view of biology or genetics. Thus the results are often confusing as statistical significance may not correspond to biological or genetic significance.6, 31 For example, if we know two SNPs have significant statistical interaction, how do they interact in biology or genetics? Does each allele of SNPs act? Do dominant or recessive effects exist?

To solve these problems, we developed a novel entropy-based method called ESNP2 (entropy-based SNP–SNP interaction method) integrating two-locus genetic models. In our ESNP2 system, there are two options: ESNP2-S (ESNP2-standard option) and ESNP2-Mx (ESNP2-model option). The former aims to detect two-locus interactions, whereas the latter, ESNP2-Mx, tests the interaction against various two-locus genetic models and gets the best-fit model. A program implemented by Java for ESNP2 algorithm can be downloaded from our website (http://www.biosino.org/papers/esnp2/).

Methods

Entropy is defined as follows:32

where Ncase denotes the number of cases in the population and Ncontrol the number of controls in the population.

ESNP2-S

Our entropy-based method has three steps (Figure 2).

Figure 2
figure 2

Workflows of ESNP2-S and ESNP2-Mx.

First step: compute entropy of initial data set (H0)

Given an initial data set (or sample), D (case+control), p and (1−p) are the proportions of case and control in the data set, respectively. According to p, we can compute the entropy of the data set:

Second step: compute gain ratio of single SNP (ΔR1 and ΔR2)

The effects of a single SNP that may be involved in interaction can be estimated by computing its gain ratio. Considering each SNP splitting the initial data set D with its possible genotypes into several subsets, S1(D)=S1{DAA, DAa, Daa}. Each subset has its sub-entropy H(Di) and corresponding weighting coefficient P(Di), which is the proportion of a certain genotype or genotype combinations to the total data set.

Gain refers to information gain, which is also called mutual information when considering H1 as conditional entropy. It reflects the relation between SNP and disease status. As gain correlates with entropy of the initial data set that is determined by p, we normalize it to eliminate the effects of p and get gain ratio ΔR1. In addition, ΔR1 is a likelihood ratio that can be approximated to a χ2-test (proof not shown here).

Similarly, we can compute the gain ratio of another SNP ΔR2 involved in interaction.

Third step: compute gain ratio of SNPs' interactions (ΔR1,2)

Interacting SNPs split the initial data set D into nine subsets:

Its gain ratio ΔR12 can be computed as follows:

ΔR1,2 measures interaction effects of SNP1 and SNP2 whenever marginal effects exist. Bootstrapping strategy33 is performed to get P-values corresponding to ΔR1,2 with the initial data set D. The bootstrapping strategy resamples random samples of size (Ncase+Ncontrol) with replacement from the original data. Repeating the sampling procedure a large number of times and counting new ΔR1,2 larger than the original value generates P-values.

ESNP2-Mx

ESNP2-S aims to explore the interaction effects of two SNPs. It explores all the nine possible combinations of two SNPs independently. To bestow ESNP2 with biological or genetic meaning, we extend ESNP2-S to ESNP2-Mx. Mx is the abbreviation of Model x, such as M1, M11 and M27, representing the jointly recessive model, threshold model and jointly dominant model, respectively. It is a binary coding system used by Li and Reich.15 Figure 1 shows eight classical interaction models. Two-locus genetic models are presented as penetrance table where 1 denotes high penetrance or high risk and 0 low penetrance or low risk. High-risk genotype combination leads to higher disease susceptibility in case than in control. In practice, risk can be evaluated by the ratio p (ie, if a certain SNP combination has a p larger than the average of the data set, it can be considered as a high-risk combination). By using ESNP2-S to calculate gain ratios and P-values for each of the possible models, we then get the best-fit model. If both ESNP2-S and ESNP2-Mx result in similar and significant P-values, we can conclude the data are fitting a certain interaction model.

Similar to ESNP2-S, ESNP2-Mx procedure has three steps except the third step (Figure 2). With ESNP2-Mx, the initial data set D is divided into two subsets, namely high- and low-risk subsets:

where H1,2 is the entropy and ΔR1,2 is the gain ratio.

According to different candidate interaction models, new gain ratio ΔR1,2 can be calculated, followed by a bootstrapping strategy, to get the P-values. The procedure is similar to that of ESNP2-S.

Results

Data simulation

To validate the effect of ESNP2-S and ESNP2-Mx on detecting the different association intensities between SNPs and disease, we constructed a simulated data set with respect to certain parameters. Odds ratio (OR) was used as a parameter to define the relationship between disease and two loci. For various interaction models, we set OR to be 1.2, 1.5 and 1.8 ordinally, the sample sizes of case and control to be 1000 and simulation times to be 100 to get the median of the results. Eight classical interaction models were selected as shown in Figure 1. First, the total number of high- and low-risk genotype combinations were randomly generated for case and control according to the above parameters. Then we got the number of all possible genotype combinations with the collapsibility of OR and certain interaction models.

The resulting simulation data were then analyzed by ESNP2-S and ESNP2-Mx, respectively, with bootstrapping times set to be 104 by which we could get the maximal precision of 10−4 (we marked it as 0 instead of <10−4). Even longer bootstrapping times can be consumed if higher precise results are needed. The familiar logistic stepwise regression has also been performed to estimate the effects deviated from additive model in log scale, which is considered to be interaction effects. Single-gene effects have been estimated by the minimum P-value of single-point association results of two genes (Figures 3 and 4). As a control, one single-gene effect model (M7) was included.

Figure 3
figure 3

P-value curves generated by single association, ESNP2-S, ESNP2-Mx and logistic regression for simulated data (M1, M3, M7 and M11). Single, minimum P-value of two genes' association results representing maximum single-gene effects; standard, interaction effects computed by ESNP2-S; model, two-locus model effects computed by ESNP2-Mx; logistic, interaction effects computed by logistic regression representing effects deviation from additive effects. Standard curve overlaps with model curve entirely.

Figure 4
figure 4

P-value curves generated by single association, ESNP2-S, ESNP2-Mx and logistic regression for simulated data (M15, M27, M78 and M84).

As the value of the parameter OR increases, both single-gene effects and interaction effects enhance rapidly as shown by most models. Significant single-gene effects can be observed except in models M78 and M84, whereas M7 is a model of single-gene effects that is used as a control. According to M7 data, as all of the three models initially considered the single gene effect by default, they subsequently failed to detect any interaction effects. For models M1, M3, M11, M15 and M27, single-gene effects change more variously than interaction effects. Contrarily, although M78 and M84 are able to detect stronger interaction effects, yet only weak single-gene effects have been observed. Similarly, M1, M11 and M27 show moderate ability in identifying single-gene effects and can thus be considered to be standard and symmetric interaction models. On the other hand, whereas most of the genetic effects shown by M3 and M15 can be mainly interpreted by a single gene, only a few epistasis effects occur when OR is large. As shown in Figures 3 and 4, ESNP2-Mx using correct models gets approximately the same results with ESNP2-S because the model data are simulated perfectly. For each of the eight models, the corresponding two curves overlap entirely with each other. In the case of real data analysis, results will deviate from the proposed models. So ESNP2-Mx will also deviate from ESNP2-S more or less. As shown in Appendix Table 1 (Supplementary Table 1), which presents with incorrect models, ESNP2-Mx has much larger P-values than ESNP2-S.

Table 1 Malaria admission data of sickle cell anemia and α+-thalassemia combination effects

As shown in Figures 3 and 4, ESNP2 gets similar power with logistic regression, especially when interaction effects weigh much heavier and single-gene effects behave inconspicuously (such as M78, M84). For one thing, our ESNP2-S can detect interaction effects as sensitively as logistic regression; for another thing, ESNP2-Mx bears the power of selecting the best-fit model for the effects by analyzing and integrating different two-locus models with ESNP2-S.

Analysis of real data

We then incorporated a real data set of malaria cohort study (Tables 1 and 2) performed by Williams et al34 in Kenya to further evaluate the availability of ESNP2-S and ESNP2-Mx. Previous studies have shown that an important causative protein in malaria is Hemoglobin (Hb), which has two variants – heterozygote HbAS and homozygote HbSS that can cause sickle cell anemia. While HbSS is a lethal mutation leading to premature death, individuals with HbAS are protective against malaria. Additionally, there exist other mutations that can protect against severe and fatal malaria – heterozygote −α/αα and homozygote −α/−α, which cause α+-thalassemia.

Table 2 Malaria admission data of sickle cell anemia and α+-thalassemia combination effects with adjustment for child years of follow-up

Table 1 shows data of genotype combinations for malaria admission, formatted as case/control. χ2-tests show that sickle cell anemia has strong association (3 × 10−9) with malaria resistance, while α+-thalassemia has small association (0.063). We first applied ESNP2-S on this data set to detect the potential interaction effects. The resulting P-value of this step, as equaling to 6.8 × 10−5, provides supporting evidence for interaction. Then we performed ESNP2-Mx using negative epistasis model (Table 3) to calculate the interaction model effects and got a P-value of 0.008. As Table 1 is a cohort data, we corrected it with surveying time (child years of follow-up) as shown in Table 2. The corrected data have also been analyzed by our models, resulting in the P-value being <10−6 of the interaction model effects. We thus concluded that the negative epistasis model fits the malaria data satisfactorily. While the logistic regression model coupled with Wald test can only tell whether interactions between sickle cell anemia and α+-thalassemia exist and affect malaria resistance, our method goes a step further by not only confirming the results but also finding that the negative epistasis model is proper for the relationship.

Table 3 A negative epistasis model protecting against malaria

Discussion

The epistasis models have been developed and enhanced to be more complex and perfect since it was first proposed by Bateson8 one hundred years ago, resulting in the appearance of numerous published models.15 While these models are effective and powerful in practice application, they demand different number of disease alleles and variously show interaction effects. The classical epistasis model or the so-called modifying-effect model (Mod, M15) is similar to the single-locus recessive model when regardless of Aabb combination. The threshold model (T, M11) demands that at least three disease alleles are affected; the jointly recessive–recessive model (RR, M1) requires the presence of two copies of disease alleles.15 Our simulation results in Figures 3 and 4 show marginal and interaction effects of various two-locus models, which have also been supported by marginal effects computed by Li and Reich.15 Identifying fitting models for the interacting SNPs will contribute to our knowledge about both single gene and interaction effects. Advances in researches on human diseases35 and mouse models36 have proposed some gene–gene interaction models that contribute to discovering the genetic mechanisms of common diseases. Moreover, the models are widely used to simulate data for testing the power of novel methods.20, 21, 22

In the situation of single SNP association, the routine strategy includes performing allele/genotype tests to ascertain susceptible SNP, followed by searching for the best-fitting method to get proper genetic models (additive, dominant, recessive, etc.).37 The existing interaction approaches consider only the first step and calculate interaction effects, while our methods implement the second step: combining genetic models and gene–gene interaction measurements with the information concept, entropy. The advantage of our method has been exhibited fully in the real data analysis where we got a negative epistatic model. The key statistical parameter is called gain ratio, which describes normalized information gain. A similar work has been carried out by Moore et al.22 The authors used attribute interaction23 to select SNPs, which serve as a coordinate format of information gain. Gain ratio has the further advantage of eliminating potential effects of the initial data set (the ratio between case and control) and computing interaction effects whenever marginal effects exist. By integrating genetic models with gain ratio, genotype combinations can be distinguished by the level of risks, and different genotype combinations can be grouped together if they have the same risk. In the algorithm, we only categorize the risk into two classes according to its value, namely high risk and low risk; however, more levels can be used if needed.

As gain ratio of ESNP2 does not belong to any classical statistic models, we performed bootstrapping to compute the P-values. The power of bootstrapping depends on bootstrapping times as well as the intensity of interaction effects. Longer bootstrapping times can be consumed if higher precise results are needed. Simulation results showed that bootstrapping could gain similar power as that of logistic regression under most circumstances.

On the whole, we developed an entropy-based method called ESNP2 to explore gene–gene interaction in SNPs aimed at discovering their potential biological or genetic meaning. Two options are provided: the model-free option ESNP2-S detects gene–gene interaction and the model-based ESNP2-Mx gets the best-fitting model by fitting various two-locus genetic models on the potential interaction. Both simulation data and real data of malaria show that this method is effective in detecting gene–gene interaction and revealing potential genetic models, which may contribute to the understanding of genetic mechanisms of most common diseases. Further researches should try to develop a more powerful statistic strategy that can get the P-values directly instead of bootstrapping. Another direction should be to focus on extending the methods and rationale to multilocus.