Back To Index Previous Article Next Article Full Text

Statistica Sinica 30 (2020), 1697-1722



Wei-Yin Loh1, Qiong Zhang2, Wenwen Zhang3 and Peigen Zhou1

1University of Wisconsin, Madison, 2Clemson University and 3Takeda

Abstract: Missing data are a major hindrance to statistical analysis because standard methods require the missing values to be imputed first. AMELIA and MICE are two popular imputation methods, but their effectiveness has not been scrutinized in complex data. Loh et al. (2019) showed that these imputation methods are impractical in an application where the number of variables and the quantity of missing values are large. They proposed a GUIDE piecewise-constant regression tree as an alternative as it does not require imputation and can handle large numbers of variables. Little (2020) questioned the generality of their conclusions as well as the assumptions behind machine learning methods. This article responds to the criticisms and uses a large simulation experiment to compare the parameter estimation bias of GUIDE and MICE and the prediction accuracy of several model-based and machine learning regression algorithms after GUIDE and MICE imputation.

Key words and phrases: Machine learning, missing at random, prediction accuracy, regression forest.

Back To Index Previous Article Next Article Full Text