Abstract
Under measurement constraints, where covariates are always accessible but obtain
ing responses is costly or restricted, we propose a unified response-free cluster subsampling
framework for massive longitudinal data, focusing on two aspects. First, when the dimension of covariates is fixed and small, to account for within-subject correlation, we consider
cluster subsampling and formulate a response-free weighted quasi-score to obtain the subsample estimator with consistency and asymptotic normality. An optimal cluster subsampling
scheme is obtained by optimizing a general criterion that encompasses both A-optimality and
L-optimality criteria. To enhance the estimation efficiency, a response-free unweighted estimator is subsequently constructed based on the optimal subsample and a two-step algorithm
is devised to facilitate practical implementation. Second, when the dimension of covariates is
comparable to or exceeds the subsample size, we further construct a response-free weighted
quasi decorrelated score for the preconceived low-dimensional parameter of main interest and
derive the optimal subsampling schemes. The resulting unweighted estimator and a two-step
algorithm are also proposed. Extensive simulation studies, along with a real-data applica-
Author's ORCID: Lei Wang, https://orcid.org/0000-0003-2530-883X
tion, are conducted to empirically demonstrate the effectiveness of the proposed methods.
Key words and phrases: A-optimality, decorrelated score, generalized linear models, longitu- dinal data, Poisson sampling
Information
| Preprint No. | SS-2026-0022 |
|---|---|
| Manuscript ID | SS-2026-0022 |
| Complete Authors | Junhao Shan, Lei Wang, Haiying Wang |
| Corresponding Authors | Lei Wang |
| Emails | lwangstat@nankai.edu.cn |
References
- Ai, M., J. Yu, H. Zhang, and H. Wang (2021). Optimal subsampling algorithms for big data regressions. Statistica Sinica 31(2), 749–772.
- Emrich, L. J. and M. R. Piedmonte (1991). A method for generating high-dimensional multivariate binary variates. The American Statistician 45(4), 302–304.
- Fan, Y., Y. Liu, Y. Liu, and J. Qin (2026). Nearly optimal two-step poisson sampling and empirical likelihood weighting estimation for M-estimation with big data. Statistica Sinica 36(3), 1–20.
- Fang, E. X., Y. Ning, and R. Li (2020). Test of significance for high-dimensional longitudinal data. The Annals of Statistics 48(5), 2622–2645.
- Gao, J., L. Wang, and J. Shao (2025). Distributed subsampling and quasi decorrelated score for cluster data: An application to beijing multisite air quality. The Annals of Applied Statistics 19(3), 1967–1987.
- Hamidieh, K. (2018). A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science 154, 346–354.
- Han, P. (2014). Multiply robust estimation in regression analysis with missing data. Journal of the American Statistical Association 109(507), 1159–1173.
- Liang, K.-Y. and S. L. Zeger (1986). Longitudinal data analysis using generalized linear models. Biometrika 73(1), 13–22.
- Ma, P., Y. Chen, X. Zhang, X. Xing, J. Ma, and M. W. Mahoney (2022). Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. Journal of Machine Learning Research 23(177), 1–45.
- Ma, P., M. W. Mahoney, and B. Yu (2015). A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research 16(1), 861–911.
- Ning, Y. and H. Liu (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics 45(1), 158– 195.
- Obozinski, G., M. J. Wainwright, and M. I. Jordan (2011). Support union recovery in high-dimensional multivariate regression. The Annals of Statistics 39(1), 1–47.
- Qin, J., B. Zhang, and D. H. Leung (2017). Efficient augmented inverse probability weighted estimation in missing data problems. Journal of Business & Economic Statistics 35(1), 86–97.
- Raskutti, G., M. J. Wainwright, and B. Yu (2010). Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research 11(78), 2241–2259.
- Shao, Y., L. Wang, and H. Lian (2025). Optimal decorrelated score subsampling for high-dimensional generalized linear models under measurement constraints. Journal of Computational and Graphical Statistics 34(2), 530–539.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288.
- Wang, H. and J. K. Kim (2022). Maximum sampled conditional likelihood for informative subsampling. Journal of Machine Learning Research 23(332), 1–50.
- Wang, H. and Y. Ma (2021). Optimal subsampling for quantile regression in big data. Biometrika 108(1), 99–112.
- Wang, H., R. Zhu, and P. Ma (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113(522), 829–844.
- Wang, J., H. Wang, and S. Xiong (2024). Unweighted estimation based on optimal sample under measurement constraints. Canadian Journal of Statistics 52(1), 291–309.
- Wang, J., H. Wang, and H. H. Zhang (2024). Scale-invariant optimal sampling for rare-events data and sparse models. Advances in neural information processing systems 37, 98384–98418.
- Wang, J., J. Zou, and H. Wang (2022). Sampling with replacement vs Poisson sampling: a comparative study in optimal subsampling. IEEE Transactions on Information Theory 68(10), 6605–6630.
- Wang, L. (2011). GEE analysis of clustered binary data with diverging number of covariates. The Annals of Statistics 39(1), 389–417.
- Wang, L., J. Zhou, and A. Qu (2012). Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics 68(2), 353–360.
- Wang, Y., A. W. Yu, and A. Singh (2017). On computationally tractable selection of experiments in measurement-constrained regression models. Journal of Machine Learning Research 18(143), 1–41.
- Wang, Z., H. Wang, and N. Ravishanker (2023). Subsampling in longitudinal models. Methodology and Computing in Applied Probability 25(1), 1–29.
- Xie, R., T. Sriram, W. B. Wu, and P. Ma (2025). Online sequential leveraging sampling method for streaming autoregressive time series with application to seismic data. The Annals of Applied Statistics 19(4), 3330–3350.
- Xie, R., Z. Wang, S. Bai, P. Ma, and W. Zhong (2019). Online decentralized leverage score sampling for streaming multidimensional time series. Proceedings of Machine Learning Research 89, 2301–2311.
- Yu, J., H. Wang, M. Ai, and H. Zhang (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association 117(537), 265–276.
- Yu, J., Z. Ye, M. Ai, and P. Ma (2025). Optimal subsampling for data streams with measurement constrained categorical responses. Journal of Computational and Graphical Statistics 34(3), 994–1004.
- Zhang, H. and H. Wang (2026). Refitted cross-validation estimation for highdimensional subsamples from low-dimension full data. Computational Statistics 41(2), 1–15.
- Zhang, J., C. Meng, J. Yu, M. Zhang, W. Zhong, and P. Ma (2023). An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation. Journal of Computational and Graphical Statistics 32(1), 329–339.
- Zhang, T., Y. Ning, and D. Ruppert (2021). Optimal sampling for generalized linear models under measurement constraints. Journal of Computational and Graphical Statistics 30(1), 106–114. Supplementary Material The Supplementary Material contains the unbalanced cluster-size scenario, proofs of theorems and additional simulation results. Junhao Shan School of Statistics and Data Science, KLMDASR, LEBPS and LPMC, Nankai University,
Acknowledgments
We would like to extend our sincere gratitude to the Editor, an Associate Editor and
two anonymous referees for their insightful comments and constructive suggestions,
which have significantly enhanced the quality of this paper.
Lei Wang was supported by the National Natural Science Foundation of China (Grant No. 12271272).
HaiYing Wang was supported by the NSF (Grant No. 2105571) and UConn CLAS
Research Funding in Academic Themes. The corresponding author is Lei Wang.
Supplementary Materials
The Supplementary Material contains the unbalanced cluster-size scenario, proofs of
theorems and additional simulation results.
Junhao Shan School of Statistics and Data Science, KLMDASR, LEBPS and LPMC, Nankai University,
Lei Wang School of Statistics and Data Science, KLMDASR, LEBPS and LPMC, Nankai University, Tian-