Wei-Yin Loh, Qiong Zhang, Wenwen Zhang and Peigen Zhou1 (2020). MISSING DATA, IMPUTATION AND REGRESSION TREES. Vol 30 No. 4, 1697-1722.

Abstract: Missing data are a major hindrance to statistical analysis because standard methods require the missing values to be imputed first. AMELIA and MICE are two popular imputation methods, but their effectiveness has not been scrutinized in complex data. Loh et al. (2019) showed that these imputation methods are impractical in an application where the number of variables and the quantity of missing values are large. They proposed a GUIDE piecewise-constant regression tree as an alternative as it does not require imputation and can handle large numbers of variables. Little (2020) questioned the generality of their conclusions as well as the assumptions behind machine learning methods. This article responds to the criticisms and uses a large simulation experiment to compare the parameter estimation bias of GUIDE and MICE and the prediction accuracy of several model-based and machine learning regression algorithms after GUIDE and MICE imputation.

Key words and phrases: Machine learning, missing at random, prediction accuracy, regression forest.