Variational inference for Student-t models: Robust Bayesian interpolation and generalised component analysis

doi:10.1016/j.neucom.2005.02.016

Neurocomputing

Volume 69, Issues 1–3, December 2005, Pages 123-141

https://doi.org/10.1016/j.neucom.2005.02.016 Get rights and content

Abstract

We demonstrate how a variational approximation scheme enables effective inference of key parameters in probabilisitic signal models which employ the Student-t distribution. Using the two scenarios of robust interpolation and independent component analysis (ICA) as examples, we illustrate the key feature of the approach: that the form of the noise distribution in the interpolation case, and the source distributions in the ICA case, can be inferred from the data concurrent with all other model parameters.

Introduction

The Gaussian distribution is widely used in parametric models for signal processing, perhaps motivated by physical knowledge of the system of interest, or via appeal to the central limit theorem. However, even when such justification is absent, the Gaussian remains a popular choice due to its attractive analytical properties. For example, log likelihoods are often quadratic in parameters of interest facilitating optimisation, and a Gaussian prior can be combined with a Gaussian likelihood analytically in Bayesian inference.

However, it is clear that not all real-world data sets can be modelled well by Gaussian distributions. In particular, constraining a model to utilise a Gaussian (the tails of which rapidly decay) is a well-characterised disadvantage when the true underlying distribution is heavy-tailed (has high kurtosis). In such cases, it is obviously more appropriate to apply a more heavy-tailed distribution within the analysis, but this raises significant difficulties of tractability.

In this paper, we show how variational inference can be used to render such models tractable and offer greater overall representative power. We exploit the fact that within the variational framework, it is possible to extend and generalise the conventional Gaussian model by placing an inverse-Gamma prior over the variance of that distribution, independently for each data example. The resulting marginalised distribution is a member of the heavier-tailed Student-t family, and still contains the Gaussian as a special case.

We illustrate this modelling approach with two practical example applications: robust Bayesian interpolation with a flexible noise model (covered in Sections 2–5) and independent component analysis (ICA) with a flexible source model (covered in Sections 6–10). In both cases, we demonstrate how we can improve the expressivity of our models and advantageously infer heavy-tailed distributions where evidence from the data supports it, yet not preclude the facility to retain a Gaussian model where appropriate.

Section snippets

Robust interpolation

First, we consider the classic problem of interpolation where the observation variables are assumed to be noisy. Our data comprise N input-observation pairs ${x_{n}, y_{n}}$ and we focus on interpolation models linear in the parameters, where the interpolant $f (x)$ is expressed in terms of M fixed basis functions $φ_{m} (x)$ , $m = 1, \dots, M$ , weighted by some corresponding parameters $θ_{m}$ : $f (x) = \sum_{m = 1}^{M} θ_{m} φ_{m} (x) .$ While the non-linear functions $φ_{m} (x)$ are fixed, $f (x)$ may still be very flexible if a large set of basis functions is

Variational inference for robust interpolation

Given the observations $y = (y_{1}, \dots, y_{N})^{T}$ , we desire to compute the Bayesian posterior distribution over all unknowns by applying Bayes’ rule: $p (θ, α, β | y, a, b, c, d) = \frac{p (y | θ, α, β) p (β | c, d) p (θ | α) p (α | a, b)}{p (y)},$ where the likelihood term is given by $p (y | θ, α, β) = (2 π)^{- N / 2} \prod_{n = 1}^{N} β_{n}^{1 / 2} \exp \{- \frac{β_{n}}{2} {[y_{n} - \sum_{m = 1}^{M} θ_{m} φ_{m} (x_{n})]}^{2}\},$ and the prior terms by $p (θ | α) = \prod_{m = 1}^{M} N (θ_{m} | 0, α_{m}^{- 1})$ $p (α | a, b) = \prod_{m = 1}^{M} Gamma (α_{m} | a, b)$ and $p (β | c, d) = \prod_{n = 1}^{N} Gamma (β_{n} | c, d), as given earlier .$ For the parameters, note that we have specified a Gaussian prior (8) such as utilised in

Synthetic data

We first illustrate performance of the algorithm on univariate synthetic data generated from the function $sinc (x) = (\sin x) / x$ with both additive Gaussian and Student-t noise. We fitted a sparse Bayesian interpolation model using both standard Gaussian and the presented variational Student-t noise models. The intention is to show that the variational procedure can recover the underlying generator in both cases. Note that in the figures which follow, converged posterior mean interpolants $〈 f (x) 〉_{Q (θ, α,}$

Robust interpolation: discussion

Figs. 1–4 illustrate that with even relatively limited data (100 observations here), the variational procedure is capable of recovering appropriate estimates of the underlying character of both Gaussian and Student-t noise processes, at the same time as performing an effective Bayesian estimation of the primary model parameters.

In practice of course, it is not expected that all significantly non-Gaussian noise processes will be exactly Student-t, as synthesised for Figs. 3 and 4. However, even

Independent component analysis

In ICA [3], [8] we assume that an observed data point consists of a d-dimensional vector, $x_{n} = (x_{n}^{(1)}, \dots, x_{n}^{(d)})$ , that is generated by linear mixing of q unobserved source signals, $s_{n} = (s_{n}^{(1)}, \dots, s_{n}^{(q)})$ , $x_{n} = {As}_{n} + \pm η_{n},$ where $\pm η_{n}$ represents additive noise, which we will assume to be Gaussian and isotropic, $A$ is known as the mixing matrix and we have assumed that the set of all data-points, $X$ , is centred.

The most important characteristic of ICA is that the components of the random variables $s_{n}^{(j)}$ are

The probabilistic model for ICA

Recall from (2) the Student-t distribution, controlled by a scale parameter, $σ$ , and its degrees of freedom, $ν$ . Since the estimation of $ν$ for each latent distribution is an integral part of the model fitting process, we may use the Student-t for recovering sources which are both Gaussian (capturing principal components) and super-Gaussian (capturing independent components). In other words, we would suggest that the facility to automatically estimate the character of the source distributions

Variational inference for the latent variable model

As before, let us decompose the Student-t distribution into its hierarchical form which consists of a Gaussian whose precision (or inverse noise), $τ$ , is sampled from a gamma distribution, thereby introducing an additional latent variable that represents the source distribution's precision, $τ$ . We may write the hierarchical form of the Student-t as $p (s_{n}^{(j)} | ν_{j}, σ_{j}) = \int p (s_{n}^{(j)} | τ_{n}^{(j)}) p (τ_{n}^{(j)} | ν_{j}, σ_{j}) d τ_{n}^{(j)},$ where $p (s_{n}^{(j)} | τ_{n}^{(j)}) = \sqrt{\frac{τ_{n}^{(j)}}{2 π}} \exp (- \frac{τ_{n}^{(j)}}{2} {s_{n}^{(j)}}^{2}),$ $p (τ_{n}^{(j)} | ν_{j}, σ_{j}) = \frac{(ν_{j} σ_{j}^{2} / 2)^{(ν_{j} / 2)}}{Γ (ν_{j} / 2)} τ_{n}^{(j)^{(ν_{j} / 2) - 1}} \exp (- ν_{j})$

Variational ICA: examples

One advantage of a probabilistic formulation of the ICA model is that it is possible to compute an undercomplete representation of our data. This may be useful in situations where the data is of higher dimensionality. We explored a large MEG data set [17] containing 122 signals, a subset of which is shown in Fig. 6. We constrained ourselves to a 20 s portion of the data set and 24 s from the start of the sequence, during which the subject had been asked to blink. The resulting components, ordered

GCA: discussion

We have presented an algorithm that automatically estimates the number of independent components in a data set and explains the remaining data through PCA and a noise model. This ‘automatic independence determination’ (AID) is a characteristic shared with other models, but most of them treat the mixing matrix in a Bayesian manner to achieve this goal. The method uses a latent variable which is Student-t distributed and exploits the distribution's ability to ‘interpolate’ between a heavy-tailed

Summary

The main objective of this paper has been to demonstrate that the variational inference framework allows us to extend the limits of tractable computation in models which utilise the heavy-tailed Student-t distribution. At the same time, a key point to underline is that we can obtain good estimates of the shapes of the distributions of interest, including both heavy-tailed and Gaussian, while concurrently estimating all other model parameters, of which there may be many. Illustrations we have

Mike Tipping received a B.Eng. degree in Electronic Engineering from Bristol University in 1990, and an M.Sc. in Artificial Intelligence from the University of Edinburgh in 1992. He received his Ph.D. from Aston University in 1996, in the field of neural computing.

From 1996–1998, he was a research fellow in the Neural Computing Research Group at Aston University, working on probabilistic models for principal component analysis, and their application to data visualisation. He joined Microsoft

References (18)

P. Comon
Independent component analysis: a new concept?
Signal Process.
(1994)
A. Hyvärinen et al.
Independent component analysis: algorithms and applications
Neural Networks
(2000)
H. Attias
Independent factor analysis
Neural Comput.
(1998)
C.M. Bishop, J. Winn, Structured variational distributions in VIBES, in: C.M. Bishop, B.J. Frey (Eds.), Proceedings of...
R.M. Everson et al.
ICA: a flexible non-linearity and decorrelating manifold approach
Neural Comput.
(1999)
A. Faul et al.
A variational approach to robust regression
A. Gelman et al.
Bayesian Data Analysis
(1995)
J.F. Geweke
Bayesian treatment of the independent Student- $t$ linear model
J. Appl. Econom.
(1993)
N.D. Lawrence, C.M. Bishop, Variational Bayesian independent component analysis, Technical Report, 2000...

There are more references available in the full text version of this article.

Cited by (0)

His research interests centre on the development and application of probabilistic models for machine learning, particularly Bayesian implementations of sparse kernel regression and classification models (the ‘relevance vector machine’). Currently he is working on the application of learning techniques to video games (‘Forza Motorsport’).

Neil Lawrence received his Ph.D. from Cambridge University in 2000 after which he spend a year as a post-doctoral researcher at Microsoft Research, Cambridge. Currently he is a senior lecturer in the Department of Computer Science, University of Sheffield. His research interests are probabilistic models with a particular focus on Gaussian processes.

View full text

Variational inference for Student-t models: Robust Bayesian interpolation and generalised component analysis

Abstract

Introduction

Section snippets

Robust interpolation

Variational inference for robust interpolation

Synthetic data

Robust interpolation: discussion

Independent component analysis

The probabilistic model for ICA

Variational inference for the latent variable model

Variational ICA: examples

GCA: discussion

Summary

Signal Process.

Neural Networks

Independent factor analysis

Neural Comput.

ICA: a flexible non-linearity and decorrelating manifold approach

Neural Comput.

A variational approach to robust regression

Bayesian Data Analysis

Bayesian treatment of the independent Student-t linear model

J. Appl. Econom.

Bayesian treatment of the independent Student- $t$ linear model