Statistica Sinica 31 (2021), 1-28

ACCOUNTING FOR FACTOR VARIABLES

IN BIG DATA REGRESSION

Tonglin Zhang and Baijian Yang

Purdue University

Abstract: Continuous and factor explanatory variables are both important in linear regressions. To fit a linear model using factor variables, the traditional implementation of the least squares approach defines a number of dummy variables. However, this approach is difficult to apply to big data because the size of the design matrix can be inflated significantly by a factor variable, even if the number of factor levels is only moderately large. By treating the factor variable as an index, this study proposes a new approach, called the index least squares approach, to overcome this difficulty. Combined with the technique of scanning data by rows, the index least squares approach can provide exact solutions simultaneously to a group of linear models with factor variables. Therefore, it avoids the memory barrier caused by the size of the design matrix. Because the memory needed is unrelated to the number of observations, the index least squares approach can be used even when the size of a massive data set is hundreds of times higher than the memory available to the computing system.

Key words and phrases: Big data, factor variables, index array of sufficient statistics, index least squares, parallel or cluster computation, scanning data by rows.