Abstract
Supervised learning under measurement constraints is a common chal
lenge in statistical and machine learning. In many applications, despite extensive
design points, acquiring responses for all points is often impractical due to resource limitations. Subsampling algorithms oer a solution by selecting a subset
from the design points for observing the response. Existing subsampling methods primarily assume numerical predictors, neglecting the prevalent occurrence
of big data with categorical predictors across various disciplines. This paper proposes a novel balanced subsampling approach tailored for data with categorical
predictors. A balanced subsample signi
cantly reduces the cost of observing the
response and possesses three desired merits. First, it is nonsingular and, therefore, allows linear regression with all dummy variables encoded from categorical
predictors.
Second, it oers optimal parameter estimation by minimizing the
generalized variance of the estimated parameters. Third, it allows robust prediction in the sense of minimizing the worst-case prediction error. We demonstrate
the superiority of balanced subsampling over existing methods through extensive
simulation studies and a real-world application.
Information
| Preprint No. | SS-2023-0434 |
|---|---|
| Manuscript ID | SS-2023-0434 |
| Complete Authors | Lin Wang |
| Corresponding Authors | Lin Wang |
| Emails | linwang@purdue.edu |
References
- Ai, M., F. Wang, J. Yu, and H. Zhang (2021). Optimal subsampling for large-scale quantile regression. Journal of Complexity 62, 101512.
- Ai, M., J. Yu, H. Zhang, and H. Wang (2021). Optimal subsampling algorithms for big data regressions. Statistica Sinica 31 (1), 749772.
- Atkinson, A., A. Donev, and R. Tobias (2007). Optimum Experimental Designs, with SAS, Volume 34. Oxford University Press.
- Cheng, C.-S. (1980). Orthogonal arrays with variable numbers of symbols. The Annals of Statistics 8 (2), 447453.
- Cheng, Q., H. Wang, and M. Yang (2020). Information-based optimal subdata selection for big data logistic regression. Journal of Statistical Planning and Inference 209, 112122.
- Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33(1), 122.
- He, L. and Y. Hung (2022). Gaussian process prediction using design-based subsampling. Statistica Sinica 32 (2), 11651186.
- Hedayat, A., N. Sloane, and J. Stufken (1999). Orthogonal arrays: theory and applications. Springer, New York.
- Huang, D., R. Li, and H. Wang (2014). Feature screening for ultrahigh dimensional categorical data with applications. Journal of Business & Economic Statistics 32 (2), 237244.
- Johnson, A. C., C. G. Ethun, Y. Liu, A. G. Lopez-Aguiar, T. B. Tran, G. Poultsides, V. Grignol, J. H. Howard, M. Bedi, T. C. Gamblin, et al.
- (2018). Studying a rare disease using multi-institutional research collaborations vs big data: Where lies the truth? Journal of the American College of Surgeons 227 (3), 357366.
- Kanda, Y. (2013). Investigation of the freely available easy-to-use software `EZR' for medical statistics. Bone marrow transplantation 48 (3), 452 458.
- Kiefer, J. (1959). Optimum experimental designs. Journal of the Royal Statistical Society, Series B 21 (2), 272304.
- Koller, M. and W. A. Stahel (2017). Nonsingular subsampling for regression s estimators with categorical predictors. Computational Statistics 32 (2), 631646.
- apczy«ski, M. and S. Biaªow¡s (2013). Discovering patterns of users' behaviour in an e-shop-comparison of consumer buying behaviours in poland and other european countries. Studia Ekonomiczne 151, 144153.
- Ma, P., M. W. Mahoney, and B. Yu (2015). A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research 16 (1), 861911.
- Ma, P. and X. Sun (2015). Leveraging for big data regression. Wiley Interdisciplinary Reviews: Computational Statistics 7 (1), 7076.
- Mak, S. and V. R. Joseph (2018). Support points. The Annals of Statistics 46(6A), 25622592.
- Maronna, R. A. and V. J. Yohai (2000). Robust regression with both continuous and categorical predictors. Journal of Statistical Planning and Inference 89(1-2), 197214.
- Meng, C., R. Xie, A. Mandal, X. Zhang, W. Zhong, and P. Ma (2021). Lowcon: A design-based subsampling approach in a misspeci ed linear model. Journal of Computational and Graphical Statistics 30 (3), 694 708.
- Meng, C., X. Zhang, J. Zhang, W. Zhong, and P. Ma (2020). More e cient approximation of smoothing splines via spacelling basis selection. Biometrika 107(3), 723735.
- Morris, M. D. and T. J. Mitchell (1995). Exploratory designs for computational experiments. Journal of Statistical Planning and Inference 43 (3), 381402.
- Rousseeuw, P. and V. Yohai (1984). Robust regression by means of sestimators. In Robust and Nonlinear Time Series Analysis: Proceedings of a Workshop Organized by the Sonderforschungsbereich 123 Stochastische Mathematische Modelle, Heidelberg 1983, pp. 256272. Springer.
- Shi, C. and B. Tang (2021). Model-robust subdata selection for big data. Journal of Statistical Theory and Practice 15 (4), 117.
- Song, D., N. M. Xi, J. J. Li, and L. Wang (2022). scsampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data. Bioinformatics 38(11), 31263127.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58 (1), 267288.
- Wang, H. and Y. Ma (2021). Optimal subsampling for quantile regression in big data. Biometrika 108(1), 99112.
- Wang, H., M. Yang, and J. Stufken (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association 114 (525), 393405.
- Wang, H., R. Zhu, and P. Ma (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113(522), 829844.
- Wang, L., J. Elmstedt, W. K. Wong, and H. Xu (2021). Orthogonal subsampling for big data linear regression. Annals of Applied Statistics 15 (3), 12731290.
- Wang, Y., A. W. Yu, and A. Singh (2017). On computationally tractable selection of experiments in measurement-constrained regression models. The Journal of Machine Learning Research 18 (1), 52385278.
- Yu, Y., S.-K. Chao, and G. Cheng (2022). Distributed bootstrap for simultaneous inference under high dimensionality. Journal of Machine Learning Research 23(195), 177.
- Zhang, Y., L. Wang, X. Zhang, and H. Wang (2024). Independenceencouraging subsampling for nonparametric additive models. Journal of Computational and Graphical Statistics 33 (4), 14241433.
- Zhu, J., L. Wang, and F. Sun (2024). Group-orthogonal subsampling for hierarchical data based on linear mixed models. Journal of Computational and Graphical Statistics 33 (3), 10371046.
- Zuccolotto, P., M. Manisera, and M. Sandri (2018). Big data analytics for modeling scoring probability in basketball: The eect of shooting under high-pressure conditions. International journal of sports science & coaching 13(4), 569589. Lin Wang
Acknowledgments
Wang is supported by the U.S. National Science Foundation (DMS-2413741)
and the Central Indiana Corporate Partnership AnalytiXIN Initiative.
Supplementary Materials
The online supplementary material provides proofs of the theoretical results
and discusses the computational complexity of the proposed algorithm.