Statistica Sinica 33 (2023), 1483-1505
Kin Yau Wong and Jiahui Feng
Abstract: Analyses of modern biomedical data are often complicated by missing values. When variables of interest are missing for some subjects, it is desirable to use observed auxiliary variables, which are sometimes high dimensional, to impute or predict the missing values in order to improve the statistical efficiency. Although many methods have been developed for prediction using high-dimensional variables, it is challenging to perform a valid inference based on such predicted values. In this study, we develop an association test for an outcome variable and a potentially missing covariate, where the covariate can be predicted using variables selected from a set of high-dimensional auxiliary variables. We establish the validity of the test under data-driven model-selection procedures. We also demonstrate the validity of the proposed method and its advantages over existing methods using extensive simulation studies and an application to a major cancer genomics study.
Key words and phrases: Association test, integrative analysis, missing data, post-selection inference, variable selection.