Statistica Sinica 33 (2023), 893-918
Chih-Li Sung, Benjamin Haaland, Youngdeok Hwang and Siyuan Lu
Abstract: The Gaussian process is one of the most important approaches for emulating computer simulations. However, the stationarity assumption common to Gaussian process emulation and the computational intractability for large-scale data sets limit accuracy and feasibility in practice. In this article, we propose a clustered Gaussian process model that simultaneously segments the input data into multiple clusters and fits a Gaussian process model in each cluster. The model parameters and the clusters are learned through the efficient stochastic expectation-maximization, which allows for emulations for large-scale computer simulations. Importantly, the proposed method provides valuable model interpretability by identifying clusters, which reveal hidden patterns in the input-output relationship. The number of clusters, which controls the bias-variance trade-off, is efficiently selected using cross-validation to ensure accurate predictions. In our simulations and a real application to solar irradiance emulation, our proposed method has smaller mean squared errors than its main competitors, with competitive computation time, and provides valuable insights from the data by discovering clusters. An R package for the proposed methodology is provided in an open repository.
Key words and phrases: Large-scale data, mixture models, nonstationarity, solar irradiance emulation, uncertainty quantification.