Variational inference for Student-t models: Robust Bayesian interpolation and generalised component analysis
Introduction
The Gaussian distribution is widely used in parametric models for signal processing, perhaps motivated by physical knowledge of the system of interest, or via appeal to the central limit theorem. However, even when such justification is absent, the Gaussian remains a popular choice due to its attractive analytical properties. For example, log likelihoods are often quadratic in parameters of interest facilitating optimisation, and a Gaussian prior can be combined with a Gaussian likelihood analytically in Bayesian inference.
However, it is clear that not all real-world data sets can be modelled well by Gaussian distributions. In particular, constraining a model to utilise a Gaussian (the tails of which rapidly decay) is a well-characterised disadvantage when the true underlying distribution is heavy-tailed (has high kurtosis). In such cases, it is obviously more appropriate to apply a more heavy-tailed distribution within the analysis, but this raises significant difficulties of tractability.
In this paper, we show how variational inference can be used to render such models tractable and offer greater overall representative power. We exploit the fact that within the variational framework, it is possible to extend and generalise the conventional Gaussian model by placing an inverse-Gamma prior over the variance of that distribution, independently for each data example. The resulting marginalised distribution is a member of the heavier-tailed Student-t family, and still contains the Gaussian as a special case.
We illustrate this modelling approach with two practical example applications: robust Bayesian interpolation with a flexible noise model (covered in Sections 2–5) and independent component analysis (ICA) with a flexible source model (covered in Sections 6–10). In both cases, we demonstrate how we can improve the expressivity of our models and advantageously infer heavy-tailed distributions where evidence from the data supports it, yet not preclude the facility to retain a Gaussian model where appropriate.
Section snippets
Robust interpolation
First, we consider the classic problem of interpolation where the observation variables are assumed to be noisy. Our data comprise N input-observation pairs and we focus on interpolation models linear in the parameters, where the interpolant is expressed in terms of M fixed basis functions , , weighted by some corresponding parameters :While the non-linear functions are fixed, may still be very flexible if a large set of basis functions is
Variational inference for robust interpolation
Given the observations , we desire to compute the Bayesian posterior distribution over all unknowns by applying Bayes’ rule:where the likelihood term is given byand the prior terms byandFor the parameters, note that we have specified a Gaussian prior (8) such as utilised in
Synthetic data
We first illustrate performance of the algorithm on univariate synthetic data generated from the function with both additive Gaussian and Student-t noise. We fitted a sparse Bayesian interpolation model using both standard Gaussian and the presented variational Student-t noise models. The intention is to show that the variational procedure can recover the underlying generator in both cases. Note that in the figures which follow, converged posterior mean interpolants
Robust interpolation: discussion
Figs. 1–4 illustrate that with even relatively limited data (100 observations here), the variational procedure is capable of recovering appropriate estimates of the underlying character of both Gaussian and Student-t noise processes, at the same time as performing an effective Bayesian estimation of the primary model parameters.
In practice of course, it is not expected that all significantly non-Gaussian noise processes will be exactly Student-t, as synthesised for Figs. 3 and 4. However, even
Independent component analysis
In ICA [3], [8] we assume that an observed data point consists of a d-dimensional vector, , that is generated by linear mixing of q unobserved source signals, , where represents additive noise, which we will assume to be Gaussian and isotropic, is known as the mixing matrix and we have assumed that the set of all data-points, , is centred.
The most important characteristic of ICA is that the components of the random variables are
The probabilistic model for ICA
Recall from (2) the Student-t distribution, controlled by a scale parameter, , and its degrees of freedom, . Since the estimation of for each latent distribution is an integral part of the model fitting process, we may use the Student-t for recovering sources which are both Gaussian (capturing principal components) and super-Gaussian (capturing independent components). In other words, we would suggest that the facility to automatically estimate the character of the source distributions
Variational inference for the latent variable model
As before, let us decompose the Student-t distribution into its hierarchical form which consists of a Gaussian whose precision (or inverse noise), , is sampled from a gamma distribution, thereby introducing an additional latent variable that represents the source distribution's precision, . We may write the hierarchical form of the Student-t aswhere
Variational ICA: examples
One advantage of a probabilistic formulation of the ICA model is that it is possible to compute an undercomplete representation of our data. This may be useful in situations where the data is of higher dimensionality. We explored a large MEG data set [17] containing 122 signals, a subset of which is shown in Fig. 6. We constrained ourselves to a 20 s portion of the data set and 24 s from the start of the sequence, during which the subject had been asked to blink. The resulting components, ordered
GCA: discussion
We have presented an algorithm that automatically estimates the number of independent components in a data set and explains the remaining data through PCA and a noise model. This ‘automatic independence determination’ (AID) is a characteristic shared with other models, but most of them treat the mixing matrix in a Bayesian manner to achieve this goal. The method uses a latent variable which is Student-t distributed and exploits the distribution's ability to ‘interpolate’ between a heavy-tailed
Summary
The main objective of this paper has been to demonstrate that the variational inference framework allows us to extend the limits of tractable computation in models which utilise the heavy-tailed Student-t distribution. At the same time, a key point to underline is that we can obtain good estimates of the shapes of the distributions of interest, including both heavy-tailed and Gaussian, while concurrently estimating all other model parameters, of which there may be many. Illustrations we have
Mike Tipping received a B.Eng. degree in Electronic Engineering from Bristol University in 1990, and an M.Sc. in Artificial Intelligence from the University of Edinburgh in 1992. He received his Ph.D. from Aston University in 1996, in the field of neural computing.
From 1996–1998, he was a research fellow in the Neural Computing Research Group at Aston University, working on probabilistic models for principal component analysis, and their application to data visualisation. He joined Microsoft
References (18)
Independent component analysis: a new concept?
Signal Process.
(1994)- et al.
Independent component analysis: algorithms and applications
Neural Networks
(2000) Independent factor analysis
Neural Comput.
(1998)- C.M. Bishop, J. Winn, Structured variational distributions in VIBES, in: C.M. Bishop, B.J. Frey (Eds.), Proceedings of...
- et al.
ICA: a flexible non-linearity and decorrelating manifold approach
Neural Comput.
(1999) - et al.
A variational approach to robust regression
- et al.
Bayesian Data Analysis
(1995) Bayesian treatment of the independent Student- linear model
J. Appl. Econom.
(1993)- N.D. Lawrence, C.M. Bishop, Variational Bayesian independent component analysis, Technical Report, 2000...
Cited by (0)
Mike Tipping received a B.Eng. degree in Electronic Engineering from Bristol University in 1990, and an M.Sc. in Artificial Intelligence from the University of Edinburgh in 1992. He received his Ph.D. from Aston University in 1996, in the field of neural computing.
From 1996–1998, he was a research fellow in the Neural Computing Research Group at Aston University, working on probabilistic models for principal component analysis, and their application to data visualisation. He joined Microsoft Research, Cambridge, in March 1998.
His research interests centre on the development and application of probabilistic models for machine learning, particularly Bayesian implementations of sparse kernel regression and classification models (the ‘relevance vector machine’). Currently he is working on the application of learning techniques to video games (‘Forza Motorsport’).
Neil Lawrence received his Ph.D. from Cambridge University in 2000 after which he spend a year as a post-doctoral researcher at Microsoft Research, Cambridge. Currently he is a senior lecturer in the Department of Computer Science, University of Sheffield. His research interests are probabilistic models with a particular focus on Gaussian processes.