 
 
 
 
Statistica Sinica 34 (2024), 1325-1345
Abstract: In this article, we discuss response variable selection and the subsequent estimation of the regression coefficients in multivariate linear regression. Because of the asymmetric roles of the predictors and responses in a regression, response variable selection differs markedly from the usual predictor variable selection. When a response is inferred to have a coefficient of zero, it should not simply be removed from subsequent estimation. Instead, we should analyze its relationship with the responses that have nonzero coefficients, which we call dynamic responses. If it is correlated with the dynamic responses, given all other responses, it should be retained to improve the estimation efficiency of the nonzero coefficients, as an ancillary statistic. Otherwise, it can be removed from further inference (leading to significant resource savings in high-dimensional settings), and we call it a static response. Therefore, we can classify responses into three categories: dynamic responses, ancillary responses, and static responses. We derive an algorithm to identify these response variables, and provide an estimator of the regression coefficients based on the selection result. Applications using synthetic and real data illustrate the efficacy of the proposed response variable selection procedure in both low-and high-dimensional settings. Lastly, we establish the consistency of the variable selection procedures and the asymptotic properties of the estimators for both the large-sample setting and the high-dimensional small-sample setting.
Key words and phrases: Group sparsity, high-dimensional data, oracle property, response variable selection.
 
 
 
