Hanwen Huang (2017). REGRESSION IN HETEROGENEOUS PROBLEMS. Vol 27 No. 1, 71-88.

Abstract: We develop a new framework for modeling the impact of sub-cluster structure of data on regression. The proposed framework is specifically designed for handling situations where the sample is not homogeneous in the sense that the response variables in different regions of covariate space are generated through different mechanisms. In such situation, the sample can be viewed as a composition of multiple data sets each of which is homogeneous. The traditional linear and general nonlinear methods may not work very well because it is hard to find a model to fit multiple data sets simultaneously. The proposed method is flexible enough to ensure that the data generated from different regions can be modeled using different functions. The key step of our method incorporates the k-means clustering idea into the traditional regression framework so that the regression and clustering tasks can be performed simultaneously. The k-means clustering algorithm is extended to solve the optimization problem in our model that groups the samples with similar response-covariate relationship together. General conditions under which the estimation of the model parameters is consistent are investigated. By adding appropriate penalty terms, the proposed model can conduct variable selection to eliminate the uninformative variables. The conditions under which the proposed model can achieve asymptotic selection consistency are also studied. The effectiveness of the proposed method is demonstrated through simulations and real data analysis.

Key words and phrases: Asymptotic consistency, heterogeneous problem, k-means clustering, LASSO, regression.