Online-Forgetting Process for Debiased-Lasso Using Summary Statistics

Xiao Guo, Xu Zhang and Hai Zhang

doi:10.5705/ss.202024.0335

Abstract

Data removal and data forgetting have become standard requests for

users, since the enactment of regulations such as GDPR and CCPA. In this paper, we study the data removal problem for statistical inference of lasso within an

online-forgetting framework, where the new data batch arrives sequentially while

the earliest data batch is removed to ensure a constant memory constraint. We

propose a new algorithm, dOnFL, which has several appealing properties: it is

computationally efficient compared to retraining the model, and it avoids accessing the full data batches by utilizing only the current batch and the summary

statistics of historical batches within the available time frame.

In particular,

we develop an efficient debiasing technique to reduce the bias induced by the

ℓ1 penalty of lasso. Theoretically, we establish the asymptotic normality of the

proposed estimator as the total sample size of available data batches goes to infinity. The simulation and real data experiments demonstrate the merits of the

proposed algorithm.

Key words and phrases: Data removal, Summary statistics, Statistical inference, Lasso 1

Information

Preprint No.	SS-2024-0335
Manuscript ID	SS-2024-0335
Complete Authors	Xiao Guo, Xu Zhang, Hai Zhang
Corresponding Authors	Hai Zhang
Emails	zhanghai@nwu.edu.cn

References

Beck, A. and M. Teboulle (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202.
Bourtoule, L., V. Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie,
and N. Papernot (2021). Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE.
Cai, Z., S. Li, X. Xia, and L. Zhang (2023). Private estimation and inference in high-dimensional regression with fdr control. arXiv preprint arXiv:2310.16260.
Chen, X., J. D. Lee, X. T. Tong, and Y. Zhang (2020). Statistical inference for model parameters in stochastic gradient descent. Annals of Statistics 48, 251–273.
Deshpande, Y., A. Javanmard, and M. Mehrabi (2023). Online debiasing for adaptively collected high-dimensional data with applications to time series analysis. Journal of the American Statistical Association 118(542), 1126–1139.
Duchi, J., E. Hazan, and Y. Singer (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12(7), 2121–2159.
Dwork, C., F. McSherry, K. Nissim, and A. Smith (2006). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284. Springer.
Friedman, J., T. Hastie, H. H¨ofling, and R. Tibshirani (2007). Pathwise coordinate optimization. The Annals of Applied Statistics 1(2), 302–332.
Ginart, A., M. Guan, G. Valiant, and J. Y. Zou (2019). Making AI forget you: Data deletion in machine learning. Advances in Neural Information Processing Systems 32.
Guo, C., T. Goldstein, A. Hannun, and L. Van Der Maaten (2020, 13–18 Jul). Certified data removal from machine learning models.
In H. D. III and A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning, Volume 119 of Proceedings of
Machine Learning Research, pp. 3832–3842. PMLR.
Han, R., L. Luo, Y. Lin, and J. Huang (2024). Online inference with debiased stochastic gradient descent. Biometrika 111(1), 93–108.
Izzo, Z., M. A. Smart, K. Chaudhuri, and J. Zou (2021). Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp. 2008–2016. PMLR.
Javanmard, A. and A. Montanari (2014). Confidence intervals and hypothesis testing for highdimensional regression. Journal of Machine Learning Research 15(1), 2869–2909.
Langford, J., L. Li, and T. Zhang (2009). Sparse online learning via truncated gradient. Journal of Machine Learning Research 10, 777–801.
Li, Y., C.-H. Wang, and G. Cheng (2021). Online forgetting process for linear regression models. In International Conference on Artificial Intelligence and Statistics, pp. 217–225. PMLR.
Liu, X. and S. A. Tsaftaris (2020). Have you forgotten? a method to assess if machine learning models have forgotten data. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23, pp. 95–105. Springer.
Pace, R. K. and R. Barry (1997). Sparse spatial autoregressions. Statistics & Probability Letters 33(3), 291–297.
Parikh, N., S. Boyd, et al. (2014). Proximal algorithms. Foundations and trends® in Optimization 1(3), 127–239.
Shi, C., R. Song, W. Lu, and R. Li (2021). Statistical inference for high-dimensional models via recursive online-score estimation. Journal of the American Statistical Association 116(535), 1307–1318.
Sun, L., M. Wang, S. Zhu, and A. Barbu (2024). A novel framework for online supervised learning with feature selection. Journal of Nonparametric Statistics, 1–27.
Tarres, P. and Y. Yao (2014). Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence. IEEE Transactions on Information Theory 60(9), 5716–5735.
van de Geer, S., P. B¨uhlmann, Y. A. Ritov, and R. Dezeure (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Annals of Statistics 42, 1166– 1202.
Zhang, C.-H. and S. S. Zhang (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology 76(1), 217–242. Xiao Guo, School of Mathematics, Northwest University, Xi’an, China Xu Zhang, School of Mathematics, Northwest University, Xi’an, China Hai Zhang, School of Mathematics, Northwest University, Xi’an, China

Acknowledgments

This research was partially supported by the National Natural Science

Foundation of China (Nos.12301384, 12326615) and the Major Key Project

of PCL under Grant PCL2024A06. The authors are grateful to the editor, associate editor, and two anonymous reviewers for their helpful and

insightful comments.

Supplementary Materials

The supplementary material contains the proofs of the main theoretical

results, the technical lemmas, further details of the dLASSO algorithm and

its comparison with dOnFL, as well as additional experimental results.

Supplementary materials are available for download.

[1] Beck, A. and M. Teboulle (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202.

[2] Bourtoule, L., V. Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie,

[3] and N. Papernot (2021). Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE.

[4] Cai, Z., S. Li, X. Xia, and L. Zhang (2023). Private estimation and inference in high-dimensional regression with fdr control. arXiv preprint arXiv:2310.16260.

[5] Chen, X., J. D. Lee, X. T. Tong, and Y. Zhang (2020). Statistical inference for model parameters in stochastic gradient descent. Annals of Statistics 48, 251–273.

[6] Deshpande, Y., A. Javanmard, and M. Mehrabi (2023). Online debiasing for adaptively collected high-dimensional data with applications to time series analysis. Journal of the American Statistical Association 118(542), 1126–1139.

[7] Duchi, J., E. Hazan, and Y. Singer (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12(7), 2121–2159.

[8] Dwork, C., F. McSherry, K. Nissim, and A. Smith (2006). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284. Springer.

[9] Friedman, J., T. Hastie, H. H¨ofling, and R. Tibshirani (2007). Pathwise coordinate optimization. The Annals of Applied Statistics 1(2), 302–332.

[10] Ginart, A., M. Guan, G. Valiant, and J. Y. Zou (2019). Making AI forget you: Data deletion in machine learning. Advances in Neural Information Processing Systems 32.

[11] Guo, C., T. Goldstein, A. Hannun, and L. Van Der Maaten (2020, 13–18 Jul). Certified data removal from machine learning models.

[12] In H. D. III and A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning, Volume 119 of Proceedings of

[13] Machine Learning Research, pp. 3832–3842. PMLR.

[14] Han, R., L. Luo, Y. Lin, and J. Huang (2024). Online inference with debiased stochastic gradient descent. Biometrika 111(1), 93–108.

[15] Izzo, Z., M. A. Smart, K. Chaudhuri, and J. Zou (2021). Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp. 2008–2016. PMLR.

[16] Javanmard, A. and A. Montanari (2014). Confidence intervals and hypothesis testing for highdimensional regression. Journal of Machine Learning Research 15(1), 2869–2909.

[17] Langford, J., L. Li, and T. Zhang (2009). Sparse online learning via truncated gradient. Journal of Machine Learning Research 10, 777–801.

[18] Li, Y., C.-H. Wang, and G. Cheng (2021). Online forgetting process for linear regression models. In International Conference on Artificial Intelligence and Statistics, pp. 217–225. PMLR.

[19] Liu, X. and S. A. Tsaftaris (2020). Have you forgotten? a method to assess if machine learning models have forgotten data. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23, pp. 95–105. Springer.

[20] Pace, R. K. and R. Barry (1997). Sparse spatial autoregressions. Statistics & Probability Letters 33(3), 291–297.

[21] Parikh, N., S. Boyd, et al. (2014). Proximal algorithms. Foundations and trends® in Optimization 1(3), 127–239.

[22] Shi, C., R. Song, W. Lu, and R. Li (2021). Statistical inference for high-dimensional models via recursive online-score estimation. Journal of the American Statistical Association 116(535), 1307–1318.

[23] Sun, L., M. Wang, S. Zhu, and A. Barbu (2024). A novel framework for online supervised learning with feature selection. Journal of Nonparametric Statistics, 1–27.

[24] Tarres, P. and Y. Yao (2014). Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence. IEEE Transactions on Information Theory 60(9), 5716–5735.

[25] van de Geer, S., P. B¨uhlmann, Y. A. Ritov, and R. Dezeure (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Annals of Statistics 42, 1166– 1202.

[26] Zhang, C.-H. and S. S. Zhang (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology 76(1), 217–242. Xiao Guo, School of Mathematics, Northwest University, Xi’an, China Xu Zhang, School of Mathematics, Northwest University, Xi’an, China Hai Zhang, School of Mathematics, Northwest University, Xi’an, China