Abstract
Data removal and data forgetting have become standard requests for
users, since the enactment of regulations such as GDPR and CCPA. In this paper, we study the data removal problem for statistical inference of lasso within an
online-forgetting framework, where the new data batch arrives sequentially while
the earliest data batch is removed to ensure a constant memory constraint. We
propose a new algorithm, dOnFL, which has several appealing properties: it is
computationally efficient compared to retraining the model, and it avoids accessing the full data batches by utilizing only the current batch and the summary
statistics of historical batches within the available time frame.
In particular,
we develop an efficient debiasing technique to reduce the bias induced by the
ℓ1 penalty of lasso. Theoretically, we establish the asymptotic normality of the
proposed estimator as the total sample size of available data batches goes to infinity. The simulation and real data experiments demonstrate the merits of the
proposed algorithm.
Information
| Preprint No. | SS-2024-0335 |
|---|---|
| Manuscript ID | SS-2024-0335 |
| Complete Authors | Xiao Guo, Xu Zhang, Hai Zhang |
| Corresponding Authors | Hai Zhang |
| Emails | zhanghai@nwu.edu.cn |
References
- Beck, A. and M. Teboulle (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202.
- Bourtoule, L., V. Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie,
- and N. Papernot (2021). Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE.
- Cai, Z., S. Li, X. Xia, and L. Zhang (2023). Private estimation and inference in high-dimensional regression with fdr control. arXiv preprint arXiv:2310.16260.
- Chen, X., J. D. Lee, X. T. Tong, and Y. Zhang (2020). Statistical inference for model parameters in stochastic gradient descent. Annals of Statistics 48, 251–273.
- Deshpande, Y., A. Javanmard, and M. Mehrabi (2023). Online debiasing for adaptively collected high-dimensional data with applications to time series analysis. Journal of the American Statistical Association 118(542), 1126–1139.
- Duchi, J., E. Hazan, and Y. Singer (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12(7), 2121–2159.
- Dwork, C., F. McSherry, K. Nissim, and A. Smith (2006). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284. Springer.
- Friedman, J., T. Hastie, H. H¨ofling, and R. Tibshirani (2007). Pathwise coordinate optimization. The Annals of Applied Statistics 1(2), 302–332.
- Ginart, A., M. Guan, G. Valiant, and J. Y. Zou (2019). Making AI forget you: Data deletion in machine learning. Advances in Neural Information Processing Systems 32.
- Guo, C., T. Goldstein, A. Hannun, and L. Van Der Maaten (2020, 13–18 Jul). Certified data removal from machine learning models.
- In H. D. III and A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning, Volume 119 of Proceedings of
- Machine Learning Research, pp. 3832–3842. PMLR.
- Han, R., L. Luo, Y. Lin, and J. Huang (2024). Online inference with debiased stochastic gradient descent. Biometrika 111(1), 93–108.
- Izzo, Z., M. A. Smart, K. Chaudhuri, and J. Zou (2021). Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp. 2008–2016. PMLR.
- Javanmard, A. and A. Montanari (2014). Confidence intervals and hypothesis testing for highdimensional regression. Journal of Machine Learning Research 15(1), 2869–2909.
- Langford, J., L. Li, and T. Zhang (2009). Sparse online learning via truncated gradient. Journal of Machine Learning Research 10, 777–801.
- Li, Y., C.-H. Wang, and G. Cheng (2021). Online forgetting process for linear regression models. In International Conference on Artificial Intelligence and Statistics, pp. 217–225. PMLR.
- Liu, X. and S. A. Tsaftaris (2020). Have you forgotten? a method to assess if machine learning models have forgotten data. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23, pp. 95–105. Springer.
- Pace, R. K. and R. Barry (1997). Sparse spatial autoregressions. Statistics & Probability Letters 33(3), 291–297.
- Parikh, N., S. Boyd, et al. (2014). Proximal algorithms. Foundations and trends® in Optimization 1(3), 127–239.
- Shi, C., R. Song, W. Lu, and R. Li (2021). Statistical inference for high-dimensional models via recursive online-score estimation. Journal of the American Statistical Association 116(535), 1307–1318.
- Sun, L., M. Wang, S. Zhu, and A. Barbu (2024). A novel framework for online supervised learning with feature selection. Journal of Nonparametric Statistics, 1–27.
- Tarres, P. and Y. Yao (2014). Online learning as stochastic approximation of regularization paths: Optimality and almost-sure convergence. IEEE Transactions on Information Theory 60(9), 5716–5735.
- van de Geer, S., P. B¨uhlmann, Y. A. Ritov, and R. Dezeure (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Annals of Statistics 42, 1166– 1202.
- Zhang, C.-H. and S. S. Zhang (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology 76(1), 217–242. Xiao Guo, School of Mathematics, Northwest University, Xi’an, China Xu Zhang, School of Mathematics, Northwest University, Xi’an, China Hai Zhang, School of Mathematics, Northwest University, Xi’an, China
Acknowledgments
This research was partially supported by the National Natural Science
Foundation of China (Nos.12301384, 12326615) and the Major Key Project
of PCL under Grant PCL2024A06. The authors are grateful to the editor, associate editor, and two anonymous reviewers for their helpful and
insightful comments.
Supplementary Materials
The supplementary material contains the proofs of the main theoretical
results, the technical lemmas, further details of the dLASSO algorithm and
its comparison with dOnFL, as well as additional experimental results.