Statistica Sinica 30 (2020), 1685-1696

DISCUSSION

ON ALGORITHMIC AND MODELING APPROACHES

TO IMPUTATION IN LARGE DATA SETS

TO IMPUTATION IN LARGE DATA SETS

Roderick J. Little

University of Michigan

Abstract: The machine learning and statistical modeling cultures provide contrasting approaches to statistical analysis. In an article in this journal, Loh, Eltinge, Cho and Li compare these approaches in the setting of imputation of large data sets, recommending machine-learning methods. All the compared methods make assumptions, and I note that these assumptions receive more critical assessment for the model-based approaches than for the tree-based machine-learning methods. I discuss in particular the assumptions about the missing-data mechanism implied by the differing approaches. I question the extent to which general conclusions can be drawn from their simulation study, given the relatively strong performance of the method that discards the incomplete cases, and the limited exploration of the relevant design space.

Key words and phrases: Imputation, missing data, machine learning, nonresponse weighting, tree and forest methods.