Asymptotic Normality of Robust Risk Minimizers

Minsker, Stanislav

doi:10.5705/ss.202024.0268

Abstract

This paper investigates the asymptotic properties of algorithms that

can be viewed as robust analogues of the classical empirical risk minimization.

These strategies are based on replacing the usual empirical average with a robust

proxy of the mean, such as a variant of the median-of-means estimator. It is

well known by now that the excess risk of resulting estimators often converges

to zero at optimal rates under much weaker assumptions than those required

by their classical counterparts.

However, less is known about the asymptotic

properties of the estimators themselves, for instance, whether robust analogues

of the maximum likelihood estimators are asymptotically efficient. We take a

step towards answering these questions and show that for a wide class of parametric problems, minimizers of the appropriately defined robust proxy of the risk

converge to the minimizers of the true risk at the same rate, and often have the

same asymptotic variance, as the estimators obtained by minimizing the usual

empirical risk. Finally, we discuss the computational aspects of the problem and

demonstrate the numerical performance of the methods under consideration in

numerical experiments.

Information

Preprint No.	SS-2024-0268
Manuscript ID	SS-2024-0268
Complete Authors	Stanislav Minsker
Corresponding Authors	Stanislav Minsker
Emails	stas.minsker@gmail.com

References

Alistarh, D., Z. Allen-Zhu, and J. Li (2018). Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 4613–4623.
Alon, N., Y. Matias, and M. Szegedy (1996). The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pp. 20–29. ACM.
Audibert, J.-Y., O. Catoni, et al. (2011). Robust linear least squares regression. The Annals of Statistics 39(5), 2766–2794.
Brownlees, C., E. Joly, G. Lugosi, et al. (2015). Empirical risk minimization for heavy-tailed losses. The Annals of Statistics 43(6), 2507–2536.
Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’Institut Henri Poincar´e, Probabilit´es et Statistiques, Volume 48, pp. 1148– 1185. Institut Henri Poincar´e.
Chen, Y., L. Su, and J. Xu (2017). Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems 1(2), 1–25.
Cherapanamjeri, Y., S. B. Hopkins, T. Kathuria, P. Raghavendra, and N. Tripuraneni (2019). Algorithms for heavy-tailed statistics: Regression, covariance estimation, and beyond. arXiv preprint arXiv:1912.11071.
Devroye, L., M. Lerasle, G. Lugosi, and R. I. Oliveira (2016). Sub-Gaussian mean estimators. The Annals of Statistics 44(6), 2695–2725.
Feller, W. (1968). On the Berry-Esseen theorem. Zeitschrift f¨ur Wahrscheinlichkeitstheorie und Verwandte Gebiete 10(3), 261–268.
Holland, M. J. and K. Ikeda (2017). Robust regression using biased objectives. Machine Learning 106(9-10), 1643–1679.
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics 35(1), 73–101.
Ibragimov, R. and S. Sharakhmetov (2001). The best constant in the Rosenthal inequality for nonnegative random variables. Statistics & probability letters 55(4), 367–376.
Lecu´e, G. and M. Lerasle (2020). Robust machine learning by median-of-means: theory and practice. The Annals of Statistics 48(2), 906–931.
Lecu´e, G., M. Lerasle, and T. Mathieu (2020). Robust classification via MOM minimization. Machine learning 109, 1635–1665.
Ledoux, M. and M. Talagrand (1991). Probability in Banach Spaces: isoperimetry and processes. Berlin: Springer-Verlag.
Lerasle, M. and R. I. Oliveira (2011). Robust empirical mean estimators. arXiv preprint arXiv:1112.3914.
Li, K., H. Bao, and L. Zhang (2022). Robust covariance estimation for distributed principal component analysis. Metrika 85(6), 707–732.
Lugosi, G. and S. Mendelson (2019a).
Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics 19(5), 1145–1190.
Lugosi, G. and S. Mendelson (2019b). Risk minimization by median-of-means tournaments.
Journal of the European Mathematical Society 22(3), 925–965.
Mathieu, T. and S. Minsker (2021). Excess risk bounds in robust empirical risk minimization. Information and Inference: A Journal of the IMA 10(4), 1423–1490.
Minsker, S. (2019). Distributed statistical estimation and rates of convergence in normal approximation. Electronic Journal of Statistics 13(2), 5213–5252.
Minsker, S. (2025). Uniform bounds for robust mean estimators. To appear in Stochastic Processes and Their Applications.
Minsker, S. and S. Yao (2025). Generalized median of means principle for Bayesian inference. Machine Learning 114(4).
Nemirovski, A. and D. Yudin (1983). Problem complexity and method efficiency in optimization. John Wiley & Sons Inc.
O’Donnell, R. (2014). Analysis of boolean functions. Cambridge University Press.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011). Scikit-learn: Machine learning in python. Journal of machine learning research 12(Oct), 2825–2830.
Prasad, A., A. S. Suggala, S. Balakrishnan, P. Ravikumar, et al. (2020). Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society Series B 82(3), 601–627.
Talagrand, M. (2005). The generic chaining. Springer.
van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge University Press.
van der Vaart, A. W. and J. A. Wellner (1996). Weak convergence and empirical processes. Springer Series in Statistics. New York: Springer-Verlag.

Acknowledgments

This research was partially supported by the National Science Foundation

under grants DMS CAREER-2045068 and CCF-1908905. The author would

like to thank Timoth´ee Mathieu for sharing his code used in the simulation

risk-bounds-in-robust-empirical-risk-minimization/).

Supplementary Materials

The online supplementary material includes the proof of Theorem 1, the

proofs of technical results and description of numerical simulation.

Supplementary materials are available for download.

[1] Alistarh, D., Z. Allen-Zhu, and J. Li (2018). Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 4613–4623.

[2] Alon, N., Y. Matias, and M. Szegedy (1996). The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pp. 20–29. ACM.

[3] Audibert, J.-Y., O. Catoni, et al. (2011). Robust linear least squares regression. The Annals of Statistics 39(5), 2766–2794.

[4] Brownlees, C., E. Joly, G. Lugosi, et al. (2015). Empirical risk minimization for heavy-tailed losses. The Annals of Statistics 43(6), 2507–2536.

[5] Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’Institut Henri Poincar´e, Probabilit´es et Statistiques, Volume 48, pp. 1148– 1185. Institut Henri Poincar´e.

[6] Chen, Y., L. Su, and J. Xu (2017). Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems 1(2), 1–25.

[7] Cherapanamjeri, Y., S. B. Hopkins, T. Kathuria, P. Raghavendra, and N. Tripuraneni (2019). Algorithms for heavy-tailed statistics: Regression, covariance estimation, and beyond. arXiv preprint arXiv:1912.11071.

[8] Devroye, L., M. Lerasle, G. Lugosi, and R. I. Oliveira (2016). Sub-Gaussian mean estimators. The Annals of Statistics 44(6), 2695–2725.

[9] Feller, W. (1968). On the Berry-Esseen theorem. Zeitschrift f¨ur Wahrscheinlichkeitstheorie und Verwandte Gebiete 10(3), 261–268.

[10] Holland, M. J. and K. Ikeda (2017). Robust regression using biased objectives. Machine Learning 106(9-10), 1643–1679.

[11] Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics 35(1), 73–101.

[12] Ibragimov, R. and S. Sharakhmetov (2001). The best constant in the Rosenthal inequality for nonnegative random variables. Statistics & probability letters 55(4), 367–376.

[13] Lecu´e, G. and M. Lerasle (2020). Robust machine learning by median-of-means: theory and practice. The Annals of Statistics 48(2), 906–931.

[14] Lecu´e, G., M. Lerasle, and T. Mathieu (2020). Robust classification via MOM minimization. Machine learning 109, 1635–1665.

[15] Ledoux, M. and M. Talagrand (1991). Probability in Banach Spaces: isoperimetry and processes. Berlin: Springer-Verlag.

[16] Lerasle, M. and R. I. Oliveira (2011). Robust empirical mean estimators. arXiv preprint arXiv:1112.3914.

[17] Li, K., H. Bao, and L. Zhang (2022). Robust covariance estimation for distributed principal component analysis. Metrika 85(6), 707–732.

[18] Lugosi, G. and S. Mendelson (2019a).

[19] Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics 19(5), 1145–1190.

[20] Lugosi, G. and S. Mendelson (2019b). Risk minimization by median-of-means tournaments.

[21] Journal of the European Mathematical Society 22(3), 925–965.

[22] Mathieu, T. and S. Minsker (2021). Excess risk bounds in robust empirical risk minimization. Information and Inference: A Journal of the IMA 10(4), 1423–1490.

[23] Minsker, S. (2019). Distributed statistical estimation and rates of convergence in normal approximation. Electronic Journal of Statistics 13(2), 5213–5252.

[24] Minsker, S. (2025). Uniform bounds for robust mean estimators. To appear in Stochastic Processes and Their Applications.

[25] Minsker, S. and S. Yao (2025). Generalized median of means principle for Bayesian inference. Machine Learning 114(4).

[26] Nemirovski, A. and D. Yudin (1983). Problem complexity and method efficiency in optimization. John Wiley & Sons Inc.

[27] O’Donnell, R. (2014). Analysis of boolean functions. Cambridge University Press.

[28] Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,

[29] P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011). Scikit-learn: Machine learning in python. Journal of machine learning research 12(Oct), 2825–2830.

[30] Prasad, A., A. S. Suggala, S. Balakrishnan, P. Ravikumar, et al. (2020). Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society Series B 82(3), 601–627.

[31] Talagrand, M. (2005). The generic chaining. Springer.

[32] van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge University Press.

[33] van der Vaart, A. W. and J. A. Wellner (1996). Weak convergence and empirical processes. Springer Series in Statistics. New York: Springer-Verlag.