Fast calculation of pairwise mutual information for gene regulatory network reconstruction

https://doi.org/10.1016/j.cmpb.2008.11.003Get rights and content

Abstract

We present a new software implementation to more efficiently compute the mutual information for all pairs of genes from gene expression microarrays. Computation of the mutual information is a necessary first step in various information theoretic approaches for reconstructing gene regulatory networks from microarray data. When the mutual information is estimated by kernel methods, computing the pairwise mutual information is quite time-consuming. Our implementation significantly reduces the computation time. For an example data set of 336 samples consisting of normal and malignant B-cells, with 9563 genes measured per sample, the current available software for ARACNE requires 142 hours to compute the mutual information for all gene pairs, whereas our algorithm requires 1.6 hours. The increased efficiency of our algorithm improves the feasibility of applying mutual information based approaches for reconstructing large regulatory networks.

Introduction

Information theoretic approaches are increasingly being used for reconstructing regulatory networks from gene expression microarray data. Examples include the Chow–Liu tree [1], the relevance network (RelNet) [2], the context likelihood of relatedness (CLR) algorithm [3], ARACNE [4] and the maximum relevance minimum redundance network (MRNet) [5]. These specific approaches and others start by computing the pairwise mutual information (MI) for all possible pairs of genes, resulting in an MI matrix. The MI matrix is then manipulated to identify regulatory relationships. The Chow–Liu tree is a maximum spanning tree constructed from the MI matrix, where edges are weighted by the MI values between connected nodes. In relevance networks, an edge exists between a pair of genes if their MI exceeds a given threshold. The CLR algorithm transforms the MI matrix into scores that take into account the empirical distribution of the MI values, and then applies a threshold. ARACNE applies the data processing inequality (DPI) on each connected gene triple, removing potential false positive edges in the relevance networks. In MRNet, the maximum relevance minimum redundance (MRMR) criterion [6] is applied, where the maximum relevance criterion tends to assign edges to gene pairs that share large MI, while the minimum redundance criterion controls false positives. The MRMR criterion is essentially a pairwise approximation of the conditional MI. In [7], the conditional MI is explicitly used for inference. If two genes share large MI but are conditionally independent give a third gene, there is no edge between them. These approaches have been applied successfully to simulated data and real microarray data, identifying interesting regulatory targets and pathways.

The most time-consuming step in these approaches is computing the MI matrix, because all possible pairs of genes need to be examined. When the expression values of genes are treated as continuous random variables and the MI is estimated by kernel methods, computing the pairwise MI can be computationally expensive. For example, in a recently analyzed B-cell data set consisting of 336 samples and 9563 genes per sample, [4], ARACNE takes 142 hours to compute the MI for all gene pairs.

We propose a faster way to calculate the pairwise MI, where the order of the calculations is rearranged so that repeating operations are significantly reduced. For the previously mentioned B-cell data set, the run time of our algorithm is 1.6 hours. To explain our algorithm, in Section 2, we first briefly review kernel based estimation of MI. Then we present the pseudo-code of our approach. Section 3 illustrates the decrease in computation time, and is followed by Conclusions.

Section snippets

Kernel estimation of mutual information

The MI between two continuous random variables X (a gene) and Y (another gene) is defined as follows:I(X;Y)=f(x,y)logf(x,y)f(x)f(y)dxdywhere f(x,y) is the joint probability density of the two random variables, and f(x) and f(y) are the marginal densities. Given M data points drawn from the joint probability distribution, (xj,yj), j=1,,M, the joint and marginal densities can be estimated by the Gaussian kernel estimator [8], wheref(x,y)=1M12πh2e12h2((xxj)2+(yyj)2)f(x)=1M12πh2e12h2(xxj)

Results

We compare the computation time for obtaining the pairwise MI using ARACNE’s Matlab and Java implementations downloaded from [4], and the our fast algorithm written in Matlab. The tests are performed using a personal computer (PC) with Intel Core 2 processor 2.66 GHz. In Fig. 1 (a), we fix the number of samples M=200, vary the number of genes N=100,200,,1000, and compare the computation time of ARACNE and the proposed fast algorithm in log–log scale. In Fig. 1(b), we fix the number of genes N=

Conclusion

When analyzing large gene expression microarray data sets by information theoretic approaches, fast approaches are needed. We demonstrated a fast method for calculating the pairwise mutual information based on kernel estimation. Our method achieves significant gains in speed when compared to existing implementations, with modest increases in memory use. Moreover, our method is not limited to the analysis of gene expression data, and can be generally applied to speed up any algorithm that

Conflict of interest

None declared.

Acknowledgement

We gratefully acknowledge funding from the NCI Integrative Cancer Biology Program (ICBP), Grant U56 CA112973.

References (10)

  • C. Chow et al.

    Approximating discrete probability distributions with dependence trees

    IEEE Trans on Information Theory

    (1968)
  • A.J. Butte et al.

    Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements

    Pacific Symposium on Biocomputing

    (2000)
  • J.J. Faith et al.

    Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles

    Plos Biology

    (2007)
  • A.A. Margolin et al.

    ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context

    BMC Bioinformatics

    (2006)
  • P.E. Meyer et al.

    Information-theoretic inference of large transcriptional regulatory networks

    EURASIP Journal on Bioinformatics and Systems Biology

    (2007)
There are more references available in the full text version of this article.

Cited by (0)

View full text