Statistica Sinica 27 (2017), 1967-1985
Abstract: The C-statistic, measuring the rank concordance between predictors and outcomes, has become a standard metric of predictive accuracy and is therefore a natural criterion for variable screening and selection. However, as the C-statistic is a step function, its optimization requires brute-force search, prohibiting its direct usage in the presence of high-dimensional predictors. We propose a smoothed C-statistic sure screening (C-SS) method for screening ultrahigh-dimensional data, and a penalized C-statistic (PSC) variable selection method for regularized modeling based on the screening results. We show that these procedures form an integrated framework for screening and variable selection: the C-SS possesses the sure screening property, and the PSC possesses the oracle property. Our simulations reveal that, compared to existing procedures, our proposal is more robust and efficient. Our procedure has been applied to analyze a multiple myeloma study, and has identified several novel genes that can predict patients response to treatment.
Key words and phrases: C-statistic, false positive rates, sparsity, ultra-high dimensional predictors, variable selection, variable screening.