Elsevier

Journal of Econometrics

Volume 171, Issue 2, December 2012, Pages 101-120
Journal of Econometrics

A class of adaptive importance sampling weighted EM algorithms for efficient and robust posterior and predictive simulation

https://doi.org/10.1016/j.jeconom.2012.06.011Get rights and content

Abstract

A class of adaptive sampling methods is introduced for efficient posterior and predictive simulation. The proposed methods are robust in the sense that they can handle target distributions that exhibit non-elliptical shapes such as multimodality and skewness. The basic method makes use of sequences of importance weighted Expectation Maximization steps in order to efficiently construct a mixture of Student-t densities that approximates accurately the target distribution–typically a posterior distribution, of which we only require a kernel–in the sense that the Kullback–Leibler divergence between target and mixture is minimized. We label this approach Mixture of t by Importance Sampling weighted Expectation Maximization (MitISEM). The constructed mixture is used as a candidate density for quick and reliable application of either Importance Sampling (IS) or the Metropolis–Hastings (MH) method. We also introduce three extensions of the basic MitISEM approach. First, we propose a method for applying MitISEM in a sequential manner, so that the candidate distribution for posterior simulation is cleverly updated when new data become available. Our results show that the computational effort reduces enormously, while the quality of the approximation remains almost unchanged. This sequential approach can be combined with a tempering approach, which facilitates the simulation from densities with multiple modes that are far apart. Second, we introduce a permutation-augmented MitISEM approach. This is useful for importance or Metropolis–Hastings sampling from posterior distributions in mixture models without the requirement of imposing identification restrictions on the model’s mixture regimes’ parameters. Third, we propose a partial MitISEM approach, which aims at approximating the joint distribution by estimating a product of marginal and conditional distributions. This division can substantially reduce the dimension of the approximation problem, which facilitates the application of adaptive importance sampling for posterior simulation in more complex models with larger numbers of parameters. Our results indicate that the proposed methods can substantially reduce the computational burden in econometric models like DCC or mixture GARCH models and a mixture instrumental variables model.

Introduction

For a few decades there has been considerable interest in Bayesian analysis using computer generated pseudo random draws from the posterior and predictive distribution. Markov Chain Monte Carlo (MCMC) techniques are useful for this purpose and a popular MCMC technique is the Metropolis–Hastings algorithm, developed by Metropolis et al. (1953) and generalized by Hastings (1970). Several updates of this sampler are proposed in the literature, especially the idea of adapting the proposal distribution given sampled draws.

Monte Carlo procedures based on Importance Sampling (IS), see Hammersley and Handscomb (1964), are an alternative. This idea has been introduced in Bayesian inference by Kloek and van Dijk (1978) and is further developed by van Dijk and Kloek, 1980, van Dijk and Kloek, 1984 and, in particular, by Geweke (1989). Cappé et al. (2008) discuss that there exists renewed interest in Importance Sampling. This is due to its relatively simple properties which allow for the development of parallel implementation. The increased popularity of Importance Sampling goes jointly with the development of multiple core machines and computer clusters.

In this paper we specify a class of adaptive sampling methods for efficient and reliable posterior and predictive simulation. The proposed methods are robust in the sense that they can handle target distributions that exhibit non-elliptical shapes such as multimodality and skewness. These methods are especially useful for posteriors where the convergence of alternative simulation methods is slow or even doubtful, such as high serial correlation in Gibbs sequences that may be caused by large numbers of latent variables or non-elliptical shapes. Importance Sampling and Gibbs sampling are not necessarily substitutes: given that diagnostic checks can never fully guarantee that results have converged to the true values (that is, that convergence has been reached and that no errors have been made in the derivations and code), the use of both simulation methods that have completely different theory and implementation can be a useful validity check. Further, an appropriate candidate distribution can be used to draw initial values for multiple Gibbs sequences, whereas a sample of Gibbs draws can be used to obtain initial values for the mean and covariance matrix in the process of constructing an approximating candidate distribution. Our proposed methods make use of the novel Mixture of t by Importance Sampling weighted Expectation Maximization (MitISEM) approach. This approach uses sequences of importance weighted steps in an Expectation Maximization algorithm in order to relatively quickly construct a mixture of Student-t densities, which is used as an efficient and reliable candidate density for Importance Sampling (IS) or the Metropolis–Hastings (MH) method. Next to assessing possibly non-elliptical posterior distributions, MitISEM is particularly useful for accurately estimating marginal and predictive likelihoods via IS.

Apart from specifying the basic approach of MitISEM, we introduce three extensions. First, we propose a method for applying MitISEM in a sequential manner, so that the candidate distribution for posterior simulation is cleverly updated when new data become available. Our results show that the computational effort reduces enormously, while the quality of the approximation remains almost unchanged, as compared with an ‘ad hoc’ procedure in which the construction of the MitISEM candidate is performed ‘from scratch’ at every moment in time. This sequential approach can be combined with a tempering approach, which facilitates the simulation from densities with multiple modes that are far apart. The proposed tempering method moves sequentially from a tempered target density kernel, the target density kernel to the power of a positive number that is smaller than one, towards the real target density kernel. The tempered target distribution is more diffuse and hence the probability of detecting far-away modes is higher. The idea of tempering was introduced by Geyer (1991), see also Hukushima and Nemoto (1996).

Second, we introduce a permutation-augmented MitISEM approach, for importance sampling from posterior distributions in mixture models without the requirement of imposing a priori identification restrictions on the mixture components’ parameters. As discussed by Geweke (2007), the mixture model likelihood function is invariant with respect to permutation of the components of the mixture model. If the functions of interest are permutation sensitive, as in classification applications, then interpretation of the likelihood function requires valid inequality constraints. If the functions of interest are permutation invariant, as in prediction applications, then there are no such problems of interpretation. Geweke (2007) proposes the permutation-augmented Gibbs sampler, which can be considered as an extension of the random permutation sampler of Frühwirth-Schnatter (2001). The practical implementation of the idea of the permutation-augmented Gibbs sampler is that one simulates a Gibbs sequence with total disregard for label switching or the prior’s labeling restrictions. Only after that, and only if the functions of interest are permutation sensitive, does one simply permute the Gibbs sampler’s output so as to satisfy the labeling restrictions. We propose a method of permutation-augmented IS, for which we extend the MitISEM approach to construct an approximation to the unrestricted posterior, taking into account the permutation structure. If m is the number of components of the mixture model, then the addition of a Student-t component to the candidate implies an addition of the m! equivalent permutations. Thereby, we construct a mixture of mixtures of m! Student-t components, where the restriction is imposed that the m! permutations have equal candidate density. Intuitively stated, we help the basic MitISEM approach by ‘telling’ it about the invariance with respect to permutations. It should be noted that this invariance with respect to permutations is not the only possible cause of non-elliptical shapes in a mixture model’s posterior. For example, if the probability of one of the model’s components tends to zero, the local non-identification of the component’s other parameters causes ridge shapes.

Third, we propose a partial MitISEM approach, which aims at approximating the joint distribution by estimating a product of marginal and conditional distributions. This division can substantially reduce the dimension of the approximation problem, which facilitates the application of adaptive importance or Metropolis–Hastings sampling for posterior simulation in more complex models with larger numbers of parameters. Approximating the joint posterior density kernel with a mixture of Student-t distributions allows for a huge flexibility of shapes. However, rarely is all of this flexibility required. It is typically enough to use mixtures of Student-t distributions for the dependence within subsets of the parameters. We can often divide the parameters into subsets, where the dependence between different subsets is less complicated. Our partial MitISEM approach divides the model parameters into ordered subsets, where the conditional candidate distributions’ means are linear combinations of (functions of) the parameters in previous subsets. The partial MitISEM approach is a way to provide a usable approximation to the posterior, while preventing problems such as numerical issues with specifying huge covariance matrices for a joint candidate distribution—problems that have led researchers to conclude that IS necessarily suffers from a ‘curse of dimensionality’.

Several approaches of adaptive sampling using mixtures exist in the literature. Keith et al. (2008) developed adaptive independence samplers by minimizing the Kullback–Leibler (KL) divergence in order to provide the best candidate density, which consists of a mixture of Gaussian densities. The minimization of the KL divergence is done by applying the EM algorithm of Dempster et al. (1977) and the number of mixture components is selected through information criteria like AIC (Akaike, 1974), BIC (Schwarz, 1978) or DIC (Gelman et al., 2003). Our basic approach is a ‘bottom up’ procedure that starts with one Student-t distribution (instead of a Gaussian distribution) and Student-t components are added iteratively until a certain stop criterion is met. We emphasize that the IS-weighted version of the EM algorithm is applied in order to use all candidate draws without requiring the Metropolis–Hastings algorithm to transform the candidate draws into a set of posterior draws. Cappé et al. (2008) and Cornuet et al. (forthcoming) also use IS weights in the EM algorithm with a mixture of Student-t densities as candidate density. Cappé et al. (2008) developed the M-PMC (Mixture Population Monte Carlo) algorithm, which is an adaptive algorithm that iteratively updates both the weights and component parameters of a mixture importance sampling density. An important difference between Cappé et al. (2008) (and also Cornuet et al., forthcoming) and the present paper is the choice of the number of mixture components and the starting values of the candidate mixture’s Student-t components’ means and covariances in the EM optimization procedure. Regarding the first issue, in earlier papers the number of mixture components is chosen a priori, where we let the algorithm choose the required number of components. Second, we choose the starting values based on the draws that correspond to the highest IS weights for the previous mixture of Student-t candidate in the algorithm, where Cappé et al. (2008) do not provide a strategy for choosing starting values. Although the EM procedure is guaranteed to converge to a local optimum, the choice of the starting values may still be crucial, given that the KL divergence between target and candidate (as a function of the candidate mixture’s means, covariances, degrees of freedom and component weights) is a highly non-elliptical, multimodal function. Moreover, we provide extensions (sequential, tempered, permutation-augmented and partial MitISEM) that facilitate simulation for specific applications and for particular statistical and econometric models. A different strand of literature is the use of adaptive MCMC algorithms where the parameters of the candidate density are automatically tuned during the sampling procedure. Learning about candidate density parameter values leading to more efficient sampling while maintaining the ergodicity property for asymptotic convergence is, of course, important. Roberts and Rosenthal (2009) consider an adaptive random walk Metropolis sampler and Giordani and Kohn (2010) use a mixture of densities in their adaptive independent MH sampler. We differ from these authors by using a two-stage approach. Using the Kullback–Leibler distance function, we fit during a first stage of preliminary adaptation a flexible candidate to the target with the IS-weighted EM algorithm. In the second stage we insert the obtained candidate in a ‘standard’, non-adaptive IS or MH algorithm. So, in terms of Giordani and Kohn (2010), we do not perform strict adaptation; our second phase of non-adaptive IS or the MH algorithm ensures that the simulation output converges to the correct distribution. In Section 2.1, we compare the efficiency of our approach with those from Roberts and Rosenthal (2009) and Giordani and Kohn (2010) in the context of a DCC-GARCH model with 17 parameters. The results indicate that our approach compares favorably with these alternative adaptive MCMC schemes, but we emphasize that a systematic study of the relevant merits of alternative sampling schemes for a variety of target density shapes is a topic of great interest, which is however beyond the scope of the present study.

A final remark considering the literature regards the Adaptive Mixture of t (AdMit) approach of Hoogerheide et al. (2007). Whereas the idea behind AdMit and MitISEM is the same, i.e., iteratively constructing an approximation of a target distribution by a mixture of Student-t distributions, there are three substantial differences. First, AdMit aims at minimizing the variance of the IS estimator directly, whereas MitISEM aims at this goal indirectly by minimizing the Kullback–Leibler divergence. As a result, AdMit optimizes the mixture component weights using a non-linear optimization procedure that requires considerable computational effort. Second, in the AdMit method, the means and covariance matrices of the candidate components are chosen heuristically and are never updated when additional components are added to the mixture, whereas in MitISEM all mixture parameters are optimized jointly by means of the relatively quick EM algorithm. This implies a large reduction of the computing time in the approximation procedure, and is expected to lead to a better candidate in most applications. Third, AdMit requires the joint target density kernel, whereas MitISEM requires candidate draws and importance weights. This implies that AdMit cannot be applied partially to the marginal and conditional posterior distributions of subsets of parameters, whereas we propose a partial MitISEM approach. One relative advantage of the AdMit approach is the step in which the importance weight function is maximized with respect to the parameter vector, which may lead to finding relevant areas of the parameter space that were ‘missed’ by all draws from the previous candidate. We intend to investigate the use of such an AdMit step within MitISEM in further research.

The outline of this paper is as follows. In Section 2 we introduce the MitISEM method, and we show applications in a multivariate GARCH model with 17 parameters, and a (Wishart) posterior density kernel of up to 36 parameters in an inverse covariance matrix. Section 3 introduces the sequential MitISEM method, and includes a subsection on the tempering method. In Section 4 the permutation-augmented MitISEM approach is proposed. Section 5 introduces the partial MitISEM method. Section 6 concludes. Appendix B provide the derivations of the IS-weighted EM methods, and discuss the alternative simulation methods of Roberts and Rosenthal (2009) and Giordani and Kohn (2010).

Section snippets

Mixture of t by Importance Sampling weighted Expectation Maximization (MitISEM)

If one uses Importance Sampling or the Metropolis–Hastings algorithm to conduct posterior analysis, a key issue is to find a candidate density which approximates the target distribution. This can be quite difficult if the target density is not elliptical. This paper proposes to specify the candidate distribution as a mixture of Student-t distributions. As discussed by Hoogerheide et al. (2007), the use of mixtures of Student-t distributions has several advantages. First, they can provide an

Sequential MitISEM

In this section, we propose a method for applying MitISEM in a sequential manner, so that the candidate distribution for posterior simulation is cleverly updated when new data become available. Our results show that the computational effort reduces enormously, while the quality of the approximation remains almost unchanged, as compared with an ‘ad hoc’ procedure in which the construction of the MitISEM candidate is performed ‘from scratch’ at every moment in time. In the next subsection we show

Permutation-augmented MitISEM

In this section, we introduce a permutation-augmented MitISEM approach, for importance sampling (or the MH algorithm) from posterior distributions in mixture models without the requirement of imposing a priori identification restrictions on the mixture components’ parameters. As discussed by Geweke (2007), the mixture model likelihood function is invariant with respect to permutation of the components of the mixture. If the functions of interest are permutation sensitive, as in classification

Partial MitISEM

In this section, we propose a partial MitISEM approach, which aims at approximating the joint posterior indirectly, by approximating the product of marginal and conditional posterior distributions of subsets of model parameters. This division can substantially reduce the dimension of the approximation problem, which facilitates the application of adaptive importance or Metropolis–Hastings sampling for posterior simulation in more complex models with larger numbers of parameters. Approximating

Concluding remarks

We have introduced a new class of adaptive sampling methods for efficient and reliable posterior and predictive simulation. Multiple examples have shown the possible relevance of the novel methods, as a substitute for worse candidate distributions in Importance Sampling or the Metropolis–Hastings algorithm, or as a substitute or complement (e.g., as a validity check for estimated posterior moments or marginal likelihoods) for Gibbs sampling.

In this paper we dealt with flexible approximations to

Acknowledgments

A preliminary version of this paper was presented at ESWC 2010 in Shanghai, ESOBE 2010 in Rotterdam, and the Statistics Department at Harvard University. We are indebted to several participants at these meetings for constructive comments. Thanks are, in particular, due to the guest-editor Gary Koop and three anonymous referees for very useful suggestions that have led to substantial improvements of our preliminary paper Hoogerheide et al. (2011). The first author was supported by Veni grant

References (51)

  • S. Chib

    Marginal likelihood from the Gibbs output

    Journal of the American Statistical Association

    (1995)
  • N. Chopin

    A sequential particle filter method for static models

    Biometrika

    (2002)
  • Cornuet, J.M., Marin, J.M., Mira, A., Robert, C.P., 2012. Adaptive multiple importance sampling. Scandinavian Journal...
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    Journal of the Royal Statistical Society. Series B. Statistical Methodology

    (1977)
  • A. Doucet et al.

    Sequential Monte Carlo Methods in Practice

    (2001)
  • J. Eklund et al.

    Forecast combination and model averaging using predictive measures

    Econometric Reviews

    (2007)
  • R.F. Engle

    Dynamic conditional correlation: a simple class of multivariate generalized autoregressive conditional heteroskedasticity models

    Journal of Business & Economic Statistics

    (2002)
  • S. Frühwirth-Schnatter

    Markov chain Monte Carlo estimation of classical and dynamic switching models

    Journal of the American Statistical Association

    (2001)
  • A.E. Gelfand et al.

    Bayesian model choice: asymptotics and exact calculations

    Journal of the Royal Statistical Society. Series B. Statistical Methodology

    (1994)
  • A. Gelman et al.

    Bayesian Data Analysis

    (2003)
  • J. Geweke

    Bayesian inference in econometric models using Monte Carlo integration

    Econometrica

    (1989)
  • Geyer, C.J., 1991. Markov chain Monte Carlo maximum likelihood. In: Keramidas, E. (Ed.), Computing Science and...
  • P. Giordani et al.

    Appendix to ‘Adaptive independent Metropolis–Hastings by fast estimation of mixtures of normals’

    The Journal of Computational and Graphical Statistics

    (2009)
  • P. Giordani et al.

    Adaptive independent Metropolis–Hastings by fast estimation of mixtures of normals

    Journal of Computational and Graphical Statistics

    (2010)
  • H. Haario et al.

    An adaptive Metropolis algorithm

    Bernoulli

    (2001)
  • Cited by (0)

    View full text