Fast calculation of pairwise mutual information for gene regulatory network reconstruction
Introduction
Information theoretic approaches are increasingly being used for reconstructing regulatory networks from gene expression microarray data. Examples include the Chow–Liu tree [1], the relevance network (RelNet) [2], the context likelihood of relatedness (CLR) algorithm [3], ARACNE [4] and the maximum relevance minimum redundance network (MRNet) [5]. These specific approaches and others start by computing the pairwise mutual information (MI) for all possible pairs of genes, resulting in an MI matrix. The MI matrix is then manipulated to identify regulatory relationships. The Chow–Liu tree is a maximum spanning tree constructed from the MI matrix, where edges are weighted by the MI values between connected nodes. In relevance networks, an edge exists between a pair of genes if their MI exceeds a given threshold. The CLR algorithm transforms the MI matrix into scores that take into account the empirical distribution of the MI values, and then applies a threshold. ARACNE applies the data processing inequality (DPI) on each connected gene triple, removing potential false positive edges in the relevance networks. In MRNet, the maximum relevance minimum redundance (MRMR) criterion [6] is applied, where the maximum relevance criterion tends to assign edges to gene pairs that share large MI, while the minimum redundance criterion controls false positives. The MRMR criterion is essentially a pairwise approximation of the conditional MI. In [7], the conditional MI is explicitly used for inference. If two genes share large MI but are conditionally independent give a third gene, there is no edge between them. These approaches have been applied successfully to simulated data and real microarray data, identifying interesting regulatory targets and pathways.
The most time-consuming step in these approaches is computing the MI matrix, because all possible pairs of genes need to be examined. When the expression values of genes are treated as continuous random variables and the MI is estimated by kernel methods, computing the pairwise MI can be computationally expensive. For example, in a recently analyzed B-cell data set consisting of 336 samples and 9563 genes per sample, [4], ARACNE takes 142 hours to compute the MI for all gene pairs.
We propose a faster way to calculate the pairwise MI, where the order of the calculations is rearranged so that repeating operations are significantly reduced. For the previously mentioned B-cell data set, the run time of our algorithm is 1.6 hours. To explain our algorithm, in Section 2, we first briefly review kernel based estimation of MI. Then we present the pseudo-code of our approach. Section 3 illustrates the decrease in computation time, and is followed by Conclusions.
Section snippets
Kernel estimation of mutual information
The MI between two continuous random variables X (a gene) and Y (another gene) is defined as follows:where is the joint probability density of the two random variables, and and are the marginal densities. Given M data points drawn from the joint probability distribution, , , the joint and marginal densities can be estimated by the Gaussian kernel estimator [8], where
Results
We compare the computation time for obtaining the pairwise MI using ARACNE’s Matlab and Java implementations downloaded from [4], and the our fast algorithm written in Matlab. The tests are performed using a personal computer (PC) with Intel Core 2 processor 2.66 GHz. In Fig. 1 (a), we fix the number of samples , vary the number of genes , and compare the computation time of ARACNE and the proposed fast algorithm in log–log scale. In Fig. 1(b), we fix the number of genes
Conclusion
When analyzing large gene expression microarray data sets by information theoretic approaches, fast approaches are needed. We demonstrated a fast method for calculating the pairwise mutual information based on kernel estimation. Our method achieves significant gains in speed when compared to existing implementations, with modest increases in memory use. Moreover, our method is not limited to the analysis of gene expression data, and can be generally applied to speed up any algorithm that
Conflict of interest
None declared.
Acknowledgement
We gratefully acknowledge funding from the NCI Integrative Cancer Biology Program (ICBP), Grant U56 CA112973.
References (10)
- et al.
Approximating discrete probability distributions with dependence trees
IEEE Trans on Information Theory
(1968) - et al.
Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements
Pacific Symposium on Biocomputing
(2000) - et al.
Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles
Plos Biology
(2007) - et al.
ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context
BMC Bioinformatics
(2006) - et al.
Information-theoretic inference of large transcriptional regulatory networks
EURASIP Journal on Bioinformatics and Systems Biology
(2007)