Bayesian emulation of complex multi-output and dynamic computer models

doi:10.1016/j.jspi.2009.08.006

Journal of Statistical Planning and Inference

Volume 140, Issue 3, March 2010, Pages 640-651

https://doi.org/10.1016/j.jspi.2009.08.006 Get rights and content

Abstract

Computer models are widely used in scientific research to study and predict the behaviour of complex systems. The run times of computer-intensive simulators are often such that it is impractical to make the thousands of model runs that are conventionally required for sensitivity analysis, uncertainty analysis or calibration. In response to this problem, highly efficient techniques have recently been developed based on a statistical meta-model (the emulator) that is built to approximate the computer model. The approach, however, is less straightforward for dynamic simulators, designed to represent time-evolving systems. Generalisations of the established methodology to allow for dynamic emulation are here proposed and contrasted. Advantages and difficulties are discussed and illustrated with an application to the Sheffield Dynamic Global Vegetation Model, developed within the UK Centre for Terrestrial Carbon Dynamics.

Introduction

Large computer codes, implementing sophisticated mathematical models, are widely used in all fields of science and technology to describe and understand complex systems. We refer to any such program as a simulator. The size and complexity of a simulator can become a problem when it is necessary to make very many runs at different input values. For example, the model user may wish to study the sensitivity of model outputs to variations in its inputs, which entails many model evaluations when the number of inputs is large (as is very often the case). In particular, standard Monte Carlo-based methods of sensitivity analysis (extensively reviewed by Saltelli et al., 2000) typically require thousands of model runs. Another example is the practice of calibrating model parameters by varying them to fit a set of physical observations. Such explorations can become infeasible even for moderately large computer models requiring just a few seconds per run.

Following Sacks et al. (1989), a two-stage approach based on meta-modelling (emulation) of the simulator's response has been developed (see Haylock and O’Hagan, 1996, Kennedy and O’Hagan, 2001, Oakley and O’Hagan, 2002), offering substantial efficiency gains in terms of accuracy and computing time over standard Monte Carlo-based methods. These authors represent the simulator as a function $f (\cdot)$ which takes as input a vector $x$ of parameters and produces an output $y = f (x)$ . A Bayesian formulation assumes a Gaussian process prior distribution for the function $f (\cdot)$ , conditional on various hyper-parameters. This prior distribution is updated using as data a preliminary training sample ${y_{1} = f (x_{1}), \dots, y_{n} = f (x_{n})}$ of n selected simulator runs. Formally, the posterior distribution of $f (\cdot)$ is regarded as the emulator. This posterior distribution is also a Gaussian process conditional on the hyper-parameters; here conditioning upon the training set forces realisations from the emulator to interpolate the observed data points and induces posterior distributions for the hyper-parameters.

The first stage of the two-stage approach is to build the emulator. Problems such as sensitivity analysis or calibration are then tackled in the second stage using the emulator. Since the emulator runs almost instantaneously, the computational cost of this approach for a large and computationally intensive simulator lies primarily in obtaining the training runs. Gains in efficiency arise through the emulation approach requiring far fewer simulator runs to achieve the same accuracy as Monte Carlo methods in tasks such as sensitivity analysis. Indeed, in practice the number of runs required is typically reduced by a factor of 100 or more, and it is usually possible to emulate the code output to a high degree of precision using only a few hundreds of training runs.

A number of research advances and applications dealing with statistical emulation of an ensemble of computer outputs were noted in recent years. Much of this work relies upon extensions of the univariate Gaussian process-based emulation framework, often in association with some dimension-reducing technique to ameliorate the complexity of the examined system. This was notably achieved through a principal component decomposition of the simulator's covariance structure (Higdon et al., 2008) or some basis function representation of its time-dependent, or otherwise functional, outputs, as attained via wavelets by Bayarri et al. (2007) and more widely discussed by Campbell et al. (2006). In these cases transformed and reduced outputs are treated as independent, effectively using what is referred to in Section 3 as the MS emulator. Although without explicitly referring to multi-output simulators, work developed by Qian et al. (2008) around the incorporation of qualitative, in addition to quantitative, factors into a Gaussian process emulator setting exhibits some similarity with the methodology herein proposed. Extensions to the conventionally employed Gaussian correlation function therein formulated can in principle be utilised to model dependence of the system of interest on time via an ordered categorical input, in a similar fashion to the TI emulator introduced in Section 3. Qian et al. (2008) also elaborate around an alternative ‘independent analysis’ approach, which basically coincides with the MS emulator discussed in Section 3. Outside the field of computer experimentation, Gelfand et al. (2004) formulate a non-stationary multivariate Gaussian point process in terms of a spatially varied linear model of coregionalisation to analyse multivariate data on commercial property transactions in three separate US real estate markets.

Often, the collection of outputs has a spatial and/or temporal structure. For instance, the oilfield simulator studied by Craig et al. (1996) outputs its predictions of the pressure at a given well over time, so that we can view these outputs as a time series. Similarly, the atmospheric dispersion model used by Kennedy et al. (2002) predicts deposition of radioactive particles at points on a spatial grid. Dynamic variation in underlying trends and stochastic volatility in a simulator output are also addressed by Liu and West (2009), whose strategy revolves around a time-varying auto-regressive model with stochastic innovations linked across the input space via a Gaussian process. Although existing theory for single-output emulation may be used to emulate each output individually, this can be a laborious process and may lose important information about correlations between outputs. The purpose of the present article is to propose a multi-output emulator, and to compare it with two other approaches based on single-output emulation.

We will base our analysis particularly on approaches to emulating dynamic simulators that model a system evolving over time, thereby producing a time series of outputs. One such model is the Sheffield Dynamic Global Vegetation Model (henceforth SDGVM), which is used to simulate the carbon dynamics of forests and other kinds of vegetation. The SDGVM will be used as a practical illustration of the performance of alternative emulation approaches. However, much of our discussion is relevant to emulating simulators which produce multiple outputs in other structures, for instance on a spatial grid or at different frequencies in a power spectrum.

The single-output Bayesian methodology elaborated by O’Hagan (1992), Oakley and O’Hagan (2002) is extended in Section 2 to enable the simultaneous emulation of a vector of outputs. In Section 3 we present three alternative approaches to modelling the output of a dynamic simulator based on single-output emulation, and contrast the assumptions of these methods with those of the multi-output emulator. The methods are contrasted in a practical example using SDGVM in Section 4. Section 5 discusses the benefits and limitations of the multi-output emulator, and contrasts it with other approaches to multiple outputs in the literature.

Section snippets

Emulating multiple outputs

We consider a deterministic simulator returning outputs $y \in R^{q}$ from inputs $x$ lying in some (often high-dimensional) input space $X \subseteq R^{p}$ . The simulator is essentially a function $f : X ⟼ R^{q}$ , and due to its deterministic nature it returns the same output if repeatedly executed on the same set of inputs. Despite $y = f (x)$ being in principle known for any $x$ , in practice the complexity of the simulator requires the computer code to be executed in order to determine $y$ . From a Bayesian perspective, we thus regard $f$

Emulating a dynamic simulator

Suppose that the dynamic model produces a vector of outputs $y = (y_{1}, \dots, y_{T})$ spanning the simulation time period $t = 1, 2, \dots, T$ and that a data matrix D as in Section 2 is obtained from some set of training runs. Here we introduce three procedures for emulating such a dynamic simulator.

Multi-output (MO) emulator: The first method consists of just using the multi-output emulator (6), where now the dimension of the output space is $q = T$ .

Many single-output (MS) emulators: The second approach is to emulate the

Application: a process model for ecosystem carbon

The Centre for Terrestrial Carbon Dynamics is a consortium of British academic and governmental institutions, established to improve scientific understanding of the role played by terrestrial ecosystems in the carbon cycle, with particular emphasis on forest ecosystems. The vegetation model SDGVM plays a central role in this research and we will consider emulating its ‘daily’ version (Lomas et al., 2002). Vegetation models of its kind can be used to predict possible long-term responses of

Conclusions

The paper focuses on two intertwined problems. First, we develop the multi-output emulator as an extension of theoretical results already established in the field of Bayesian emulation of a single output. Second, we consider the use of the multi-output emulator to model the time series output from a dynamic simulator, contrasting it with two alternatives approaches: one using multiple single-output emulators and the other based on treating time as an auxiliary input. A discussion of the

Acknowledgements

This research was supported by the Natural Environment Research Council through its funding for the Centre for Terrestrial Carbon Dynamics. The authors also wish to gratefully acknowledge Dr. Marc C. Kennedy for providing the data utilised in the application and two anonymous referees for their thoughtful comments on an earlier draft of the paper.

References (30)

K. Campbell et al.
Sensitivity analysis when model outputs are functions
Reliability Engineering and System Safety
(2006)
M.E. Johnson et al.
Minimax and maximin designs
Journal of Statistical Planning and Inference
(1990)
M.D. Morris et al.
Exploratory designs for computational experiments
Journal of Statistical Planning and Inference
(1995)
M.J. Bayarri et al.
Computer model validation with functional output
The Annals of Statistics
(2007)
J.O. Berger et al.
Objective Bayesian analysis of spatially correlated data
Journal of the American Statistical Association
(2001)
T. Chang et al.
Reference prior for the orbit in a group model
The Annals of Statistics
(1990)
Craig, P.S., Goldstein, M., Seheult, A.H., Smith, J.A., 1996. Bayes linear strategies for matching hydrocarbon...
Gamerman, D. (Ed.), 1997. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference. Chapman & Hall,...
A.E. Gelfand et al.
Nonstationary multivariate process modeling through spatially varying coregionalization
TEST
(2004)
Haylock, R.G., O’Hagan, A., 1996. On inference for outputs of computationally expensive algorithms with uncertainty on...

D. Higdon et al.

Computer model calibration using high-dimensional output

Journal of the American Statistical Association

(2008)

M.C. Kennedy et al.

Bayesian calibration of computer models

Journal of the Royal Statistical Society: Series B Statistical Methodology

(2001)

M.C. Kennedy et al.

Bayesian analysis of computer code outputs

Koehler, J.R., Owen, A.B., 1996. Computer experiments. In: Design and Analysis of Experiments of Handbook of Statist,...

F. Liu et al.

A dynamic modelling strategy for Bayesian computer model emulation

Bayesian Analysis

(2009)

Cited by (367)

Probabilistic forecast of nonlinear dynamical systems with uncertainty quantification
2024, Physica D: Nonlinear Phenomena
Data-driven modeling is useful for reconstructing nonlinear dynamical systems when the underlying process is unknown or too expensive to compute. Having reliable uncertainty assessment of the forecast enables tools to be deployed to predict new scenarios unobserved before. In this work, we first extend parallel partial Gaussian processes for predicting the vector-valued transition function that links the observations between the current and next time points, and quantify the uncertainty of predictions by posterior sampling. Second, we show the equivalence between the dynamic mode decomposition and the maximum likelihood estimator of the linear mapping matrix in the linear state space model. The connection provides a probabilistic generative model of dynamic mode decomposition and thus, uncertainty of predictions can be obtained. Furthermore, we draw close connections between different data-driven models for approximating nonlinear dynamics, through a unified view of generative models. We study two numerical examples, where the inputs of the dynamics are assumed to be known in the first example and the inputs are unknown in the second example. The examples indicate that uncertainty of forecast can be properly quantified, whereas model or input misspecification can degrade the accuracy of uncertainty quantification.
Quantifying the effects of different data streams on the calibration of building energy simulation
2023, Energy and Buildings
Bayesian calibration of building energy simulation (BES) has gained growing attention for its capability to tackle uncertainties and narrow the gap between simulated and measured results using expert knowledge. However, how output or parameter correlations influence single-output Bayesian calibration (SOBC) and multiple-output Bayesian calibration (MOBC) for BES has not been investigated and compared. Hence, this paper intends to determine the impacts of output or parameter correlations and data informativeness on BC of BES. Aiming to leverage multiple outputs' correlations while comparing with traditional SOBC in calibration performance and computation cost, we also developed the MOBC model. Compared to including weakly correlated outputs or parameters, the results show that strongly correlated outputs or parameters in MOBC reduce the Coefficient of Variation of the Root Mean Squared Error (CVRMSE) by 5.56% and 4.729% respectively and bring notably better Continuous Ranked Probability Score (CRPS) results. Conversely, including strongly correlated parameters in SOBC causes worse model performance. These results reflect that SOBC has parameter identifiability issues that MOBC can solve. The findings contribute to our better understanding of the impacts of (1) output or parameter correlations and (2) different data streams' informativeness on the calibration performance of SOBC and our developed MOBC for BES.
Surrogate-based model chains for establishing process-structure-property linkages with quantified uncertainties in metal additive manufacturing
2023, Manufacturing Letters
Achieving qualification and certification is the key milestone for the standardization of additive manufacturing (AM) technologies and their broader adoption in significant industrial sectors. However, the underlying knowledge gaps in complex AM processes hinder the whole progress. One critical step to advancing metal AM manufacturing readiness level is to build a comprehensive understanding of the complex physical phenomena during AM processing, specifically the process-structure–property (P-S-P) relationships. Although the significance of P-S-P linkage in metal AM has been highlighted frequently in AM community, relevant studies on physics-based models coupling are still at the early stage. In this work, we propose a framework to achieve fast prediction from process parameters to mechanical properties in laser powder bed fusion (L-PBF), and we investigate the model performance under the uncertainty of parameter variation and model inaccuracy via the methods of uncertainty propagation and sensitivity analysis. A case study on L-PBF 316L stainless steel demonstrates the proposed P-S-P surrogates are good replacements for computationally expensive physics-based model linkages. We identify the contributing variables to the variation of each output and present the P-S-P linkage performance under propagated uncertainty. This work paves a solid foundation for the continuing calibration work of the P-S-P surrogates and the thorough improvement of P-S-P prediction.
Multi-fidelity design optimization of solid oxide fuel cells using a Bayesian feature enhanced stochastic collocation
2023, International Journal of Hydrogen Energy
We develop a multi-fidelity surrogate modelling approach to replace the complex and costly physics-based computer models that are often used in the optimization of solid oxide fuel cell (SOFC) performance, or the simplified models that are used in lieu of complex models. We extend multi-fidelity stochastic collocation through a feature engineering step, and eliminate the requirement for the exact low-fidelity output at the inference stage. In contrast to previous approaches, the surrogate model we develop provides detailed spatial information, rather than one or more scalar outputs. This allows for the incorporation of such information into the objective of the optimization study, with the flexibility to choose from more than one objective, such as a minimum, maximum or average. Furthermore, the detailed spatial information can be used for general design purposes, such as ensuring uniformity in reactant and potential distributions. From the results on a 3-d SOFC model, we demonstrate highly accurate predictions of multiple spatially distributed quantities at up to spatial 250,000 locations. The results are superior to state-of-the-art multi-fidelity approaches, particularly for low numbers of high fidelity training points. We use the surrogate model to optimize the SOFC performance with respect to different objectives (including with nonlinear constraints and multiple objectives), with results that are accurate and are obtained in a fraction of the time required for the full model.
Robust optimization for functional multiresponse in 3D printing process
2023, Simulation Modelling Practice and Theory
Computer models are commonly used to simulate the functional relationships between inputs and outputs for quality design in 3D printing. However, the high-dimensional outputs of functional multiresponse make it challenging to develop the simulation model and perform robust optimization. This paper proposes a novel optimization method with an additive multiresponse Gaussian process model for dealing with functional multiresponse optimization problems. First, an additive covariance function is constructed to capture the correlation of the temporal inputs. Second, the Markov Chain Monte Carlo sampling technique is adopted to determine the simulation model and quantify the uncertainty. Finally, the optimization model is constructed by integrating the quality loss function and interval analysis method, and the Bayesian optimization algorithm is used to obtain the optimal solution. A numerical simulation example and a 3D printing case study are used to illustrate the effectiveness of the proposed method. The comparison results show that the responses of the proposed method are closer to the targets than the current ones, and all fall within the specified interval.
A bi-fidelity model for hydraulic fracturing
2024, International Journal for Numerical and Analytical Methods in Geomechanics

View all citing articles on Scopus

View full text

Bayesian emulation of complex multi-output and dynamic computer models

Abstract

Introduction

Section snippets

Emulating multiple outputs

Emulating a dynamic simulator

Application: a process model for ecosystem carbon

Conclusions

Acknowledgements

Reliability Engineering and System Safety

Journal of Statistical Planning and Inference

Journal of Statistical Planning and Inference

Computer model validation with functional output

The Annals of Statistics

Objective Bayesian analysis of spatially correlated data

Journal of the American Statistical Association

Reference prior for the orbit in a group model

The Annals of Statistics

Nonstationary multivariate process modeling through spatially varying coregionalization

TEST

Computer model calibration using high-dimensional output

Journal of the American Statistical Association

Bayesian calibration of computer models

Journal of the Royal Statistical Society: Series B Statistical Methodology

Bayesian analysis of computer code outputs

A dynamic modelling strategy for Bayesian computer model emulation

Bayesian Analysis