Statistica Sinica 29 (2019), 1607-1630
Abstract: Cross-validation (CV) methods are popular for selecting the tuning parameter in high-dimensional variable selection problems. We show that a misalignment of the CV is one possible reason for its over-selection behavior. To fix this issue, we propose using a version of leave-𝓃𝓋-out CV (CV(𝓃𝓋)) to select the optimal model from a restricted candidate model set for high-dimensional generalized linear models. By using the same candidate model sequence and a proper order for the construction sample size nc in each CV split, CV(𝓃𝓋) avoids potential problems when developing theoretical properties. CV(𝓃𝓋) is shown to exhibit the restricted model-selection consistency property under mild conditions. Extensive simulations and a real-data analysis support the theoretical results and demonstrate the performance of CV(𝓃𝓋) in terms of both model selection and prediction.
Key words and phrases: Generalized linear models, leave-𝓃𝓋-out cross-validation, restricted maximum likelihood estimators, restricted model-selection consistency, variable selection.