Bayesian emulation of complex multi-output and dynamic computer models
Introduction
Large computer codes, implementing sophisticated mathematical models, are widely used in all fields of science and technology to describe and understand complex systems. We refer to any such program as a simulator. The size and complexity of a simulator can become a problem when it is necessary to make very many runs at different input values. For example, the model user may wish to study the sensitivity of model outputs to variations in its inputs, which entails many model evaluations when the number of inputs is large (as is very often the case). In particular, standard Monte Carlo-based methods of sensitivity analysis (extensively reviewed by Saltelli et al., 2000) typically require thousands of model runs. Another example is the practice of calibrating model parameters by varying them to fit a set of physical observations. Such explorations can become infeasible even for moderately large computer models requiring just a few seconds per run.
Following Sacks et al. (1989), a two-stage approach based on meta-modelling (emulation) of the simulator's response has been developed (see Haylock and O’Hagan, 1996, Kennedy and O’Hagan, 2001, Oakley and O’Hagan, 2002), offering substantial efficiency gains in terms of accuracy and computing time over standard Monte Carlo-based methods. These authors represent the simulator as a function which takes as input a vector of parameters and produces an output . A Bayesian formulation assumes a Gaussian process prior distribution for the function , conditional on various hyper-parameters. This prior distribution is updated using as data a preliminary training sample of n selected simulator runs. Formally, the posterior distribution of is regarded as the emulator. This posterior distribution is also a Gaussian process conditional on the hyper-parameters; here conditioning upon the training set forces realisations from the emulator to interpolate the observed data points and induces posterior distributions for the hyper-parameters.
The first stage of the two-stage approach is to build the emulator. Problems such as sensitivity analysis or calibration are then tackled in the second stage using the emulator. Since the emulator runs almost instantaneously, the computational cost of this approach for a large and computationally intensive simulator lies primarily in obtaining the training runs. Gains in efficiency arise through the emulation approach requiring far fewer simulator runs to achieve the same accuracy as Monte Carlo methods in tasks such as sensitivity analysis. Indeed, in practice the number of runs required is typically reduced by a factor of 100 or more, and it is usually possible to emulate the code output to a high degree of precision using only a few hundreds of training runs.
A number of research advances and applications dealing with statistical emulation of an ensemble of computer outputs were noted in recent years. Much of this work relies upon extensions of the univariate Gaussian process-based emulation framework, often in association with some dimension-reducing technique to ameliorate the complexity of the examined system. This was notably achieved through a principal component decomposition of the simulator's covariance structure (Higdon et al., 2008) or some basis function representation of its time-dependent, or otherwise functional, outputs, as attained via wavelets by Bayarri et al. (2007) and more widely discussed by Campbell et al. (2006). In these cases transformed and reduced outputs are treated as independent, effectively using what is referred to in Section 3 as the MS emulator. Although without explicitly referring to multi-output simulators, work developed by Qian et al. (2008) around the incorporation of qualitative, in addition to quantitative, factors into a Gaussian process emulator setting exhibits some similarity with the methodology herein proposed. Extensions to the conventionally employed Gaussian correlation function therein formulated can in principle be utilised to model dependence of the system of interest on time via an ordered categorical input, in a similar fashion to the TI emulator introduced in Section 3. Qian et al. (2008) also elaborate around an alternative ‘independent analysis’ approach, which basically coincides with the MS emulator discussed in Section 3. Outside the field of computer experimentation, Gelfand et al. (2004) formulate a non-stationary multivariate Gaussian point process in terms of a spatially varied linear model of coregionalisation to analyse multivariate data on commercial property transactions in three separate US real estate markets.
Often, the collection of outputs has a spatial and/or temporal structure. For instance, the oilfield simulator studied by Craig et al. (1996) outputs its predictions of the pressure at a given well over time, so that we can view these outputs as a time series. Similarly, the atmospheric dispersion model used by Kennedy et al. (2002) predicts deposition of radioactive particles at points on a spatial grid. Dynamic variation in underlying trends and stochastic volatility in a simulator output are also addressed by Liu and West (2009), whose strategy revolves around a time-varying auto-regressive model with stochastic innovations linked across the input space via a Gaussian process. Although existing theory for single-output emulation may be used to emulate each output individually, this can be a laborious process and may lose important information about correlations between outputs. The purpose of the present article is to propose a multi-output emulator, and to compare it with two other approaches based on single-output emulation.
We will base our analysis particularly on approaches to emulating dynamic simulators that model a system evolving over time, thereby producing a time series of outputs. One such model is the Sheffield Dynamic Global Vegetation Model (henceforth SDGVM), which is used to simulate the carbon dynamics of forests and other kinds of vegetation. The SDGVM will be used as a practical illustration of the performance of alternative emulation approaches. However, much of our discussion is relevant to emulating simulators which produce multiple outputs in other structures, for instance on a spatial grid or at different frequencies in a power spectrum.
The single-output Bayesian methodology elaborated by O’Hagan (1992), Oakley and O’Hagan (2002) is extended in Section 2 to enable the simultaneous emulation of a vector of outputs. In Section 3 we present three alternative approaches to modelling the output of a dynamic simulator based on single-output emulation, and contrast the assumptions of these methods with those of the multi-output emulator. The methods are contrasted in a practical example using SDGVM in Section 4. Section 5 discusses the benefits and limitations of the multi-output emulator, and contrasts it with other approaches to multiple outputs in the literature.
Section snippets
Emulating multiple outputs
We consider a deterministic simulator returning outputs from inputs lying in some (often high-dimensional) input space . The simulator is essentially a function , and due to its deterministic nature it returns the same output if repeatedly executed on the same set of inputs. Despite being in principle known for any , in practice the complexity of the simulator requires the computer code to be executed in order to determine . From a Bayesian perspective, we thus regard
Emulating a dynamic simulator
Suppose that the dynamic model produces a vector of outputs spanning the simulation time period and that a data matrix D as in Section 2 is obtained from some set of training runs. Here we introduce three procedures for emulating such a dynamic simulator.
Multi-output (MO) emulator: The first method consists of just using the multi-output emulator (6), where now the dimension of the output space is .
Many single-output (MS) emulators: The second approach is to emulate the
Application: a process model for ecosystem carbon
The Centre for Terrestrial Carbon Dynamics is a consortium of British academic and governmental institutions, established to improve scientific understanding of the role played by terrestrial ecosystems in the carbon cycle, with particular emphasis on forest ecosystems. The vegetation model SDGVM plays a central role in this research and we will consider emulating its ‘daily’ version (Lomas et al., 2002). Vegetation models of its kind can be used to predict possible long-term responses of
Conclusions
The paper focuses on two intertwined problems. First, we develop the multi-output emulator as an extension of theoretical results already established in the field of Bayesian emulation of a single output. Second, we consider the use of the multi-output emulator to model the time series output from a dynamic simulator, contrasting it with two alternatives approaches: one using multiple single-output emulators and the other based on treating time as an auxiliary input. A discussion of the
Acknowledgements
This research was supported by the Natural Environment Research Council through its funding for the Centre for Terrestrial Carbon Dynamics. The authors also wish to gratefully acknowledge Dr. Marc C. Kennedy for providing the data utilised in the application and two anonymous referees for their thoughtful comments on an earlier draft of the paper.
References (30)
- et al.
Sensitivity analysis when model outputs are functions
Reliability Engineering and System Safety
(2006) - et al.
Minimax and maximin designs
Journal of Statistical Planning and Inference
(1990) - et al.
Exploratory designs for computational experiments
Journal of Statistical Planning and Inference
(1995) - et al.
Computer model validation with functional output
The Annals of Statistics
(2007) - et al.
Objective Bayesian analysis of spatially correlated data
Journal of the American Statistical Association
(2001) - et al.
Reference prior for the orbit in a group model
The Annals of Statistics
(1990) - Craig, P.S., Goldstein, M., Seheult, A.H., Smith, J.A., 1996. Bayes linear strategies for matching hydrocarbon...
- Gamerman, D. (Ed.), 1997. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference. Chapman & Hall,...
- et al.
Nonstationary multivariate process modeling through spatially varying coregionalization
TEST
(2004) - Haylock, R.G., O’Hagan, A., 1996. On inference for outputs of computationally expensive algorithms with uncertainty on...
Computer model calibration using high-dimensional output
Journal of the American Statistical Association
Bayesian calibration of computer models
Journal of the Royal Statistical Society: Series B Statistical Methodology
Bayesian analysis of computer code outputs
A dynamic modelling strategy for Bayesian computer model emulation
Bayesian Analysis
Cited by (367)
Probabilistic forecast of nonlinear dynamical systems with uncertainty quantification
2024, Physica D: Nonlinear PhenomenaQuantifying the effects of different data streams on the calibration of building energy simulation
2023, Energy and BuildingsMulti-fidelity design optimization of solid oxide fuel cells using a Bayesian feature enhanced stochastic collocation
2023, International Journal of Hydrogen EnergyRobust optimization for functional multiresponse in 3D printing process
2023, Simulation Modelling Practice and TheoryA bi-fidelity model for hydraulic fracturing
2024, International Journal for Numerical and Analytical Methods in Geomechanics