Abstract
Subsampling techniques can effectively reduce the computational costs of processing big data. Practical subsampling plans typically involve initial uniform
sampling and refined sampling.
Subsample-based big data inferences are
generally built on the inverse probability weighting (IPW), which may be unstable
and cannot incorporate auxiliary information. In this paper, we consider a twostep Poisson sampling, which combines an initial uniform sampling with a second
Poisson sampling. Under this sampling plan, we propose an empirical likelihood
weighting (ELW) estimation approach to an M-estimation parameter, and then
construct a nearly optimal two-step Poisson sampling plan based on the ELW
method to improve estimation efficiency of IPW-based optimal subsamplings.
Further, we derive methods for determining the smallest sample sizes with which
the proposed sampling-and-estimation method produces estimators of guaranteed
precision. Our ELW method overcomes the instability of IPW by circumventing
the use of inverse probabilities, and utilizes auxiliary information including the
size and certain sample moments of big data. We show that the proposed ELW
method produces more efficient estimators than IPW, leading to more efficient
optimal sampling plans and more economical sample sizes for a prespecified
estimation precision. These advantages are confirmed through real data based
simulations.
Key words and phrases: Big data; Two-step Poisson sampling; Empirical likelihood 1
Information
| Preprint No. | SS-2023-0274 |
|---|---|
| Manuscript ID | SS-2023-0274 |
| Complete Authors | Yan Fan, Yang Liu, Yukun Liu, Jing Qin |
| Corresponding Authors | Yukun Liu |
| Emails | ykliu@sfs.ecnu.edu.cn |
References
- Ai, M., Wang, F., Yu, J., and Zhang, H. (2021). Optimal subsampling for large-scale quantile regression. Journal of Complexity, 62, 101512.
- Ai, M., Yu, J., Zhang, H., and Wang, H. (2021b). Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31, 749–772.
- Bellhouse, D. R. (1984). A review of optimal designs in survey sampling. Canadian Journal of Statistics, 12(1), 53–65.
- Brewer, K. R. W. (1979). A class of robust sampling designs for large-scale surveys. Journal of the American Statistical Association, 74(368), 911–915.
- Cassel, C. M., Sarndal, C. E., and Wretman, J. H. (1976). Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3), 615–620.
- Chen, S., and Haziza, D. (2017). Multiply robust imputation procedures for the treatment of item nonresponse in surveys. Biometrika, 104(2), 439–453.
- Drineas, P., Mahoney, M. W., and Muthukrishnan, S. (2006). Sampling algorithms for ℓ2 regression and applications. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1127–1136.
- Drineas, P., Mahoney, M. W., Muthukrishnan, S., and Sarl´os, T. (2011). Faster least squares approximation. Numerische Mathematik, 117(2), 219–249.
- Fan, Y., Liu, Y., and Zhu, L. (2021). Optimal subsampling for linear quantile regression models. Canadian Journal of Statistics, 49(4), 1039–1057.
- Fithian, W. and Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of Statistics, 42(5), 1693–1724.
- Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Annals of Mathematical Statistics, 31, 1208–1212.
- Hajek, J. (1959). Optimal strategy and other problems in probability sampling. Casopis pro Pestovani Matematiky, 84(4), 387–423.
- Han, P. (2014). Multiply robust estimation in regression analysis with missing data. Journal of the American Statistical Association, 109(507), 1159–1173.
- Han, P. (2016). Combining inverse probability weighting and multiple imputation to improve robustness of estimation. Scandinavian Journal of Statistics, 43(1), 246–260.
- Hansen, M. H. and Hurwitz, W. N. (1943). On the theory of sampling from finite populations. Annals of Mathematical Statistics, 14(4), 333–362.
- Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685.
- Huber, P. J. (2011). Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer.
- Isaki, C. T. and Fuller, W. A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77(377), 89–96.
- Kang, J. D. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4), 523–539.
- Kim, H.-Y., Gribbin, M. J., Muller, K. E., and Taylor, D. J. (2006). Analytic, computational, and approximate forms for ratios of noncentral and central gaussian quadratic forms. Journal of Computational and Graphical Statistics, 15(2), 443–459.
- Liu, Y. and Fan, Y. (2023). Biased-sample empirical likelihood weighting: An alternative to inverse probability weighting. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 85(1), 67–83.
- Ma, P., Mahoney, M., and Yu, B. (2014). A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning, pages 91–99. PMLR.
- Neyman, J. (1938). Contribution to the theory of sampling human populations. Journal of the American Statistical Association, 33(201), 101–116.
- Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2), 237–249.
- Powell, J. L. (1990). Estimation of monotonic regression models under quantile restrictions. In Nonparametric and Semiparametric Methods in Econometrics. Cambridge University Press.
- Qin, J. (2017). Biased Sampling, Over-Identified Parameter Problems and Beyond. Singapore: Springer.
- Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Annals of Statistics, 22(1), 300–325.
- Shen, X., Chen, K., and Yu, W. (2021). Surprise sampling: Improving and extending the local case-control sampling. Electronic Journal of Statistics, 15(1), 2454–2482.
- Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. New York: Springer.
- Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research, 20, 1–59.
- Wang, H. and Kim J. K. (2022). Maximum sampled conditional likelihood for informative subsampling. Journal of Machine Learning Research, 23, 1–50.
- Wang, H., Zhu, R. and Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association, 113(522), 829–844.
- Wang, H. and Ma, Y. (2021). Optimal subsampling for quantile regression in big data. Biometrika, 108(1), 99–112.
- Yao, Y., Zou, J. , and Wang, H. (2023). Optimal poisson subsampling for softmax regression. Journal of Systems Science & Complexity, 36(4), 1609–1625.
- Yao, Y. and Wang, H. (2021). A review on optimal subsampling methods for massive datasets. Journal of Data Science, 19(1), 151–172.
- Yu, J., Wang, H., Ai, M., and Zhang, H. (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association, 117(537), 265–276.
Acknowledgments
This research is supported by the National Key R&D Program of
China (2021YFA1000100 and 2021YFA1000101), the National Natural
Science Foundation of China (11971300, 12101239, 12171157, 71931004),
the Natural Science Foundation of Shanghai (19ZR1420900), the China
Postdoctoral Science Foundation (Grant 2020M681220), and the 111
Project (B14019). The first two authors contributed equally to this paper.
The second and third authors are co-corresponding authors.
Supplementary Materials
The online supplementary material contains the proofs of Lemma 1 and
Theorems 1–2, and additional simulation results.