Nearly Optimal Two-step Poisson Sampling and Empirical Likelihood Weighting Estimation for M-estimation with Big Data

Fan, Yan; Liu, Yang; Liu, Yukun; Qin, Jing

doi:10.5705/ss.202023.0274

Abstract

Subsampling techniques can effectively reduce the computational costs of processing big data. Practical subsampling plans typically involve initial uniform

sampling and refined sampling.

Subsample-based big data inferences are

generally built on the inverse probability weighting (IPW), which may be unstable

and cannot incorporate auxiliary information. In this paper, we consider a twostep Poisson sampling, which combines an initial uniform sampling with a second

Poisson sampling. Under this sampling plan, we propose an empirical likelihood

weighting (ELW) estimation approach to an M-estimation parameter, and then

construct a nearly optimal two-step Poisson sampling plan based on the ELW

method to improve estimation efficiency of IPW-based optimal subsamplings.

Further, we derive methods for determining the smallest sample sizes with which

the proposed sampling-and-estimation method produces estimators of guaranteed

precision. Our ELW method overcomes the instability of IPW by circumventing

the use of inverse probabilities, and utilizes auxiliary information including the

size and certain sample moments of big data. We show that the proposed ELW

method produces more efficient estimators than IPW, leading to more efficient

optimal sampling plans and more economical sample sizes for a prespecified

estimation precision. These advantages are confirmed through real data based

simulations.

Key words and phrases: Big data; Two-step Poisson sampling; Empirical likelihood 1

Information

Preprint No.	SS-2023-0274
Manuscript ID	SS-2023-0274
Complete Authors	Yan Fan, Yang Liu, Yukun Liu, Jing Qin
Corresponding Authors	Yukun Liu
Emails	ykliu@sfs.ecnu.edu.cn

References

Ai, M., Wang, F., Yu, J., and Zhang, H. (2021). Optimal subsampling for large-scale quantile regression. Journal of Complexity, 62, 101512.
Ai, M., Yu, J., Zhang, H., and Wang, H. (2021b). Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31, 749–772.
Bellhouse, D. R. (1984). A review of optimal designs in survey sampling. Canadian Journal of Statistics, 12(1), 53–65.
Brewer, K. R. W. (1979). A class of robust sampling designs for large-scale surveys. Journal of the American Statistical Association, 74(368), 911–915.
Cassel, C. M., Sarndal, C. E., and Wretman, J. H. (1976). Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3), 615–620.
Chen, S., and Haziza, D. (2017). Multiply robust imputation procedures for the treatment of item nonresponse in surveys. Biometrika, 104(2), 439–453.
Drineas, P., Mahoney, M. W., and Muthukrishnan, S. (2006). Sampling algorithms for ℓ2 regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1127–1136.
Drineas, P., Mahoney, M. W., Muthukrishnan, S., and Sarl´os, T. (2011). Faster least squares approximation. Numerische Mathematik, 117(2), 219–249.
Fan, Y., Liu, Y., and Zhu, L. (2021). Optimal subsampling for linear quantile regression models. Canadian Journal of Statistics, 49(4), 1039–1057.
Fithian, W. and Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of Statistics, 42(5), 1693–1724.
Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Annals of Mathematical Statistics, 31, 1208–1212.
Hajek, J. (1959). Optimal strategy and other problems in probability sampling. Casopis pro Pestovani Matematiky, 84(4), 387–423.
Han, P. (2014). Multiply robust estimation in regression analysis with missing data. Journal of the American Statistical Association, 109(507), 1159–1173.
Han, P. (2016). Combining inverse probability weighting and multiple imputation to improve robustness of estimation. Scandinavian Journal of Statistics, 43(1), 246–260.
Hansen, M. H. and Hurwitz, W. N. (1943). On the theory of sampling from finite populations. Annals of Mathematical Statistics, 14(4), 333–362.
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.
Huber, P. J. (2011). Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer.
Isaki, C. T. and Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77(377), 89–96.
Kang, J. D. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4), 523–539.
Kim, H.-Y., Gribbin, M. J., Muller, K. E., and Taylor, D. J. (2006). Analytic, computational, and approximate forms for ratios of noncentral and central gaussian quadratic forms. Journal of Computational and Graphical Statistics, 15(2), 443–459.
Liu, Y. and Fan, Y. (2023). Biased-sample empirical likelihood weighting: An alternative to inverse probability weighting. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 85(1), 67–83.
Ma, P., Mahoney, M., and Yu, B. (2014). A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning, pages 91–99. PMLR.
Neyman, J. (1938). Contribution to the theory of sampling human populations. Journal of the American Statistical Association, 33(201), 101–116.
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2), 237–249.
Powell, J. L. (1990). Estimation of monotonic regression models under quantile restrictions. In Nonparametric and Semiparametric Methods in Econometrics. Cambridge University Press.
Qin, J. (2017). Biased Sampling, Over-Identified Parameter Problems and Beyond. Singapore: Springer.
Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Annals of Statistics, 22(1), 300–325.
Shen, X., Chen, K., and Yu, W. (2021). Surprise sampling: Improving and extending the local case-control sampling. Electronic Journal of Statistics, 15(1), 2454–2482.
Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. New York: Springer.
Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research, 20, 1–59.
Wang, H. and Kim J. K. (2022). Maximum sampled conditional likelihood for informative subsampling. Journal of Machine Learning Research, 23, 1–50.
Wang, H., Zhu, R. and Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522), 829–844.
Wang, H. and Ma, Y. (2021). Optimal subsampling for quantile regression in big data. Biometrika, 108(1), 99–112.
Yao, Y., Zou, J. , and Wang, H. (2023). Optimal poisson subsampling for softmax regression. Journal of Systems Science & Complexity, 36(4), 1609–1625.
Yao, Y. and Wang, H. (2021). A review on optimal subsampling methods for massive datasets. Journal of Data Science, 19(1), 151–172.
Yu, J., Wang, H., Ai, M., and Zhang, H. (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association, 117(537), 265–276.

Acknowledgments

This research is supported by the National Key R&D Program of

China (2021YFA1000100 and 2021YFA1000101), the National Natural

Science Foundation of China (11971300, 12101239, 12171157, 71931004),

the Natural Science Foundation of Shanghai (19ZR1420900), the China

Postdoctoral Science Foundation (Grant 2020M681220), and the 111

Project (B14019). The first two authors contributed equally to this paper.

The second and third authors are co-corresponding authors.

Supplementary Materials

The online supplementary material contains the proofs of Lemma 1 and

Theorems 1–2, and additional simulation results.

Supplementary materials are available for download.

[1] Ai, M., Wang, F., Yu, J., and Zhang, H. (2021). Optimal subsampling for large-scale quantile regression. Journal of Complexity, 62, 101512.

[2] Ai, M., Yu, J., Zhang, H., and Wang, H. (2021b). Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31, 749–772.

[3] Bellhouse, D. R. (1984). A review of optimal designs in survey sampling. Canadian Journal of Statistics, 12(1), 53–65.

[4] Brewer, K. R. W. (1979). A class of robust sampling designs for large-scale surveys. Journal of the American Statistical Association, 74(368), 911–915.

[5] Cassel, C. M., Sarndal, C. E., and Wretman, J. H. (1976). Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3), 615–620.

[6] Chen, S., and Haziza, D. (2017). Multiply robust imputation procedures for the treatment of item nonresponse in surveys. Biometrika, 104(2), 439–453.

[7] Drineas, P., Mahoney, M. W., and Muthukrishnan, S. (2006). Sampling algorithms for ℓ2 regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1127–1136.

[8] Drineas, P., Mahoney, M. W., Muthukrishnan, S., and Sarl´os, T. (2011). Faster least squares approximation. Numerische Mathematik, 117(2), 219–249.

[9] Fan, Y., Liu, Y., and Zhu, L. (2021). Optimal subsampling for linear quantile regression models. Canadian Journal of Statistics, 49(4), 1039–1057.

[10] Fithian, W. and Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of Statistics, 42(5), 1693–1724.

[11] Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Annals of Mathematical Statistics, 31, 1208–1212.

[12] Hajek, J. (1959). Optimal strategy and other problems in probability sampling. Casopis pro Pestovani Matematiky, 84(4), 387–423.

[13] Han, P. (2014). Multiply robust estimation in regression analysis with missing data. Journal of the American Statistical Association, 109(507), 1159–1173.

[14] Han, P. (2016). Combining inverse probability weighting and multiple imputation to improve robustness of estimation. Scandinavian Journal of Statistics, 43(1), 246–260.

[15] Hansen, M. H. and Hurwitz, W. N. (1943). On the theory of sampling from finite populations. Annals of Mathematical Statistics, 14(4), 333–362.

[16] Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.

[17] Huber, P. J. (2011). Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer.

[18] Isaki, C. T. and Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77(377), 89–96.

[19] Kang, J. D. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4), 523–539.

[20] Kim, H.-Y., Gribbin, M. J., Muller, K. E., and Taylor, D. J. (2006). Analytic, computational, and approximate forms for ratios of noncentral and central gaussian quadratic forms. Journal of Computational and Graphical Statistics, 15(2), 443–459.

[21] Liu, Y. and Fan, Y. (2023). Biased-sample empirical likelihood weighting: An alternative to inverse probability weighting. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 85(1), 67–83.

[22] Ma, P., Mahoney, M., and Yu, B. (2014). A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning, pages 91–99. PMLR.

[23] Neyman, J. (1938). Contribution to the theory of sampling human populations. Journal of the American Statistical Association, 33(201), 101–116.

[24] Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2), 237–249.

[25] Powell, J. L. (1990). Estimation of monotonic regression models under quantile restrictions. In Nonparametric and Semiparametric Methods in Econometrics. Cambridge University Press.

[26] Qin, J. (2017). Biased Sampling, Over-Identified Parameter Problems and Beyond. Singapore: Springer.

[27] Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Annals of Statistics, 22(1), 300–325.

[28] Shen, X., Chen, K., and Yu, W. (2021). Surprise sampling: Improving and extending the local case-control sampling. Electronic Journal of Statistics, 15(1), 2454–2482.

[29] Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. New York: Springer.

[30] Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research, 20, 1–59.

[31] Wang, H. and Kim J. K. (2022). Maximum sampled conditional likelihood for informative subsampling. Journal of Machine Learning Research, 23, 1–50.

[32] Wang, H., Zhu, R. and Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522), 829–844.

[33] Wang, H. and Ma, Y. (2021). Optimal subsampling for quantile regression in big data. Biometrika, 108(1), 99–112.

[34] Yao, Y., Zou, J. , and Wang, H. (2023). Optimal poisson subsampling for softmax regression. Journal of Systems Science & Complexity, 36(4), 1609–1625.

[35] Yao, Y. and Wang, H. (2021). A review on optimal subsampling methods for massive datasets. Journal of Data Science, 19(1), 151–172.

[36] Yu, J., Wang, H., Ai, M., and Zhang, H. (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association, 117(537), 265–276.