Empirical Risk Minimization for Losses without Variance

Guanhua Fang, Ping Li and Gennady Samorodnitsky

doi:10.5705/ss.202023.0329

Abstract

This paper considers an empirical risk minimization problem under

heavy-tailed settings, where data does not have finite variance, but only has

(1 + ε)-th moment with ε ∈(0, 1).

Instead of using an estimation procedure

based on truncated observed data, we choose the optimizer by minimizing the

risk value. Those risk values can be robustly estimated via using the remarkable

Catoni’s method. Thanks to the structure of Catoni-type influence functions,

we are able to establish excess risk upper bounds via using generalized generic

chaining methods. Moreover, we take computational issues into consideration.

We especially theoretically investigate two types of optimization methods, robust gradient descent algorithm and empirical risk-based methods.

With an

extensive numerical study, we find that the optimizer based on empirical risks

via Catoni-style estimation indeed shows better performance than other baselines. It indicates that estimation directly based on truncated data may lead to

unsatisfactory results.

Key words and phrases: Catoni estimator, Chaining method, Excess risk, Heavy tail, Robust gradient descent 1

Information

Preprint No.	SS-2023-0329
Manuscript ID	SS-2023-0329
Complete Authors	Guanhua Fang, Ping Li, Gennady Samorodnitsky
Corresponding Authors	Guanhua Fang
Emails	fanggh@fudan.edu.cn

References

Ahn, S., J. H. Kim, and V. Ramaswami (2012). A new class of models for heavy tailed distributions in finance and insurance risk. Insurance: Mathematics and Economics 51(1), 43–52.
Bartlett, P. L., O. Bousquet, and S. Mendelson (2005). Local rademacher complexities. The Annals of Statistics 33(4), 1497–1537.
Bartlett, P. L. and S. Mendelson (2006). Empirical minimization. Probability theory and related fields 135(3), 311–334.
Bhatt, S., G. Fang, P. Li, and G. Samorodnitsky (2022a). Minimax m-estimation under adversarial contamination. In International Conference on Machine Learning, pp. 1906–1924.
Bhatt, S., G. Fang, P. Li, and G. Samorodnitsky (2022b). Nearly optimal catoni’s m-estimator for infinite variance. In International Conference on Machine Learning, pp. 1925–1944.
Bradley, B. O. and M. S. Taqqu (2003). Financial risk and heavy tails. In Handbook of heavy tailed distributions in finance, pp. 35–103. Elsevier.
Brownlees, C., E. Joly, and G. Lugosi (2015). Empirical risk minimization for heavy-tailed losses. The Annals of Statistics 43(6), 2507–2536.
Bubeck, S., N. Cesa-Bianchi, and G. Lugosi (2013). Bandits with heavy tail. IEEE Transactions on Information Theory 59(11), 7711–7717.
Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilit´es et statistiques, Volume 48, pp. 1148–1185.
Chen, P., X. Jin, X. Li, and L. Xu (2021). A generalized catoni’s m-estimator under finite α-th moment assumption with α ∈(1, 2). Electronic Journal of Statistics 15(2), 5523–5544.
Chong, E. K. and S. H. Zak (2004). An introduction to optimization. John Wiley & Sons.
Crovella, M. E. and M. S. Taqqu (1999). Estimating the heavy tail index from scaling properties. Methodology and computing in applied probability 1(1), 55–79.
Diakonikolas, I., D. M. Kane, and A. Pensia (2020). Outlier robust mean estimation with subgaussian rates via stability. Advances in Neural Information Processing Systems 33, 1830–1840.
Glaz, J., J. I. Naus, and S. Wallenstein (2001). Scan statistics. Springer.
Holland, M. and M. Haress (2021). Learning with risk-averse feedback under potentially heavy tails. In International Conference on Artificial Intelligence and Statistics, pp. 892–900.
Holland, M. and K. Ikeda (2019). Better generalization with less data using robust gradient descent. In International Conference on Machine Learning, pp. 2761–2770. PMLR.
Hsu, D. and S. Sabato (2016). Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research 17(1), 543–582.
Huber, P. J. (2011). Robust statistics. In International encyclopedia of statistical science, pp. 1248–1251. Springer.
Lecu´e, G. and S. Mendelson (2013). Learning subgaussian classes: Upper and minimax bounds. arXiv preprint arXiv:1305.4825.
Lecu´e, G. and S. Mendelson (2016). Performance of empirical risk minimization in linear aggregation. Bernoulli 22(3), 1520–1534.
Lemar´echal, C. (2012). Cauchy and the gradient method. In Doc Math Extra, pp. 251–254.
Lerasle, M. and R. I. Oliveira (2011). Robust empirical mean estimators. arXiv preprint arXiv:1112.3914.
Liu, T. and D. Tao (2014). On the robustness and generalization of cauchy regression. In 2014 4th IEEE International Conference on Information Science and Technology, pp. 100–105.
Lugosi, G. and S. Mendelson (2019). Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics 19(5), 1145–1190.
Mathieu, T. and S. Minsker (2021). Excess risk bounds in robust empirical risk minimization. Information and Inference: A Journal of the IMA 10(4), 1423–1490.
Meusel, R., S. Vigna, O. Lehmberg, and C. Bizer (2014). Graph structure in the web—revisited: a trick of the heavy tail. In Proceedings of the 23rd international conference on World Wide Web, pp. 427–432.
Minsker, S. (2015). Geometric median and robust estimation in banach spaces. Bernoulli 21(4), 2308–2335.
Minsker, S. (2018). Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics 46(6A), 2871–2903.
Roy, A., K. Balasubramanian, and M. A. Erdogdu (2021). On empirical risk minimization with dependent and heavy-tailed data. Advances in Neural Information Processing Systems 34, 8913–8926.
Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
Talagrand, M. (1996). Majorizing measures: the generic chaining. The Annals of Probability 24(3), 1049–1103.
Van de Geer, S. A. and S. van de Geer (2000). Empirical Processes in M-estimation, Volume 6. Cambridge university press.
Vapnik, V. (1991). Principles of risk minimization for learning theory. Advances in neural information processing systems 4, 831–838.
Xu, L., F. Yao, Q. Yao, and H. Zhang (2023). Non-asymptotic guarantees for robust statistical learning under infinite variance assumption. Journal of Machine Learning Research 24(92), 1–46.
Xu, Y., S. Zhu, S. Yang, C. Zhang, R. Jin, and T. Yang (2020). Learning with non-convex truncated losses by sgd. In Uncertainty in Artificial Intelligence, pp. 701–711. PMLR.
Yi, M., R. Wang, and Z.-M. Ma (2022). Characterization of excess risk for locally strongly convex population risk. Advances in Neural Information Processing Systems 35, 21270– 21285.
Zhang, H., M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
Zhang, L. and Z.-H. Zhou (2018). l1-regression with heavy-tailed distributions. Advances in Neural Information Processing Systems 31, 1084–1094.
Zhu, Z. and W. Zhou (2021). Taming heavy-tailed features by shrinkage. In International Conference on Artificial Intelligence and Statistics, pp. 3268–3276. PMLR.
Zhuang, V. and Y. Sui (2021). No-regret reinforcement learning with heavy-tailed rewards. In International Conference on Artificial Intelligence and Statistics, pp. 3385–3393. PMLR. Algorithm 1 Robust Gradient Descent Method 1: Input: Observations: {Xi, i ∈{1, . . . , n}}. A bounded Catoni influence function ϕ. 2: Output: Estimated parameter: ˜w 3: Initialization: Randomly choose w(0) from Rd and set time index t = 0. 4: while not converged do 5: [Robust gradient] Compute robust gradient g(t) by solving n X i=1 ϕ(α(∇fw(t)(Xi)[j] −g(t)[j])) = 0 (5.1) coordinate-wisely for j ∈[d]. 6: Update parameter by w(t+1) = w(t) −γtg(t). 7: Increase time index t = t + 1. 8: end while 9: Return parameter estimate ˜w = w(t). Algorithm 2 Empirical Risk Gradient Descent 1: Input: Observations: {Xi, i ∈{1, . . . , n}}. 2: Output: Estimated parameter: ˜w 3: Initialization: Randomly choose w(0) from Rd and set time index t = 0. 4: while not converged do 5: Find ˆµfw(t) by solving n X 0 = 1 i=1 ϕ(α(fw(t)(Xi) −µ)). (5.9) nα 6: Compute gradient g(t) by i ϕ′(α(fw(t)(Xi) −ˆµf(t) w )) ∂fw(t)(Xi) P g(t) = ∇wˆµf(t) w = ∂w P i ϕ′(α(fw(t)(Xi) −ˆµf(t) w )) . (5.10) 7: Update parameter by w(t+1) = w(t) −γtg(t). 8: Increase time index t = t + 1. 9: end while 10: Return parameter estimate ˜w = w(t). Algorithm 3 Double-weighted Gradient Descent 1: Input: Observations: {Xi, i ∈{1, . . . , n}}; A catoni-influence function ϕ; Stopping threshold ˜ϱ. 2: Output: Estimated parameter: ˜w 3: Initialization: Set time index t = 0. Randomly choose w(0) from Rd, choose an µ(0) ∈R+ such that ˆµ(0) is an α-approximate solution to n X 0 = 1 i=1 ϕ(α(fw(0)(Xi) −µ)). (5.12) n Choose weights to be ν(0) i = ϕ′(α(fw(0)(Xi)−ˆµ(0))) P i ϕ′(α(fw(0)(Xi)−ˆµ(0))). 4: while not converged do 5: Compute gradient g(t) by g(t) = P ν(t) i ∂fw(t)(Xi) ∂w . 6: Update parameter by w(t+1) = w(t) −γtg(t). 7: Find ˆµ(t+1) by computing ˆµ(t+1) = ˆµ(t) + X i ν(t) i (fw(t+1)(Xi) −fw(t)(Xi)). (5.13) 8: Compute weights ν(t+1) i = ϕ′(α(fw(t+1)(Xi)−ˆµ(t+1))) P i ϕ′(α(fw(t+1)(Xi)−ˆµ(t+1))). 9: Increase time index t = t + 1. 10: end while 11: Return parameter estimate ˜w = w(t).

Acknowledgments

The authors thank the Associate Editor and two anonymous referees for their constructive suggestions and comments to improve

the quality of our paper. The earlier version of this work was done when the

authors were employed by Baidu USA. Guanhua Fang is partly supported

by the National Natural Science Foundation of China (nos. 12301376) and

Shanghai Educational Development Foundation (23CGA02).

Supplementary Materials

The online material contains additional numerical results, technical proofs, and more explanations and discussions.

Supplementary materials are available for download.

[1] Ahn, S., J. H. Kim, and V. Ramaswami (2012). A new class of models for heavy tailed distributions in finance and insurance risk. Insurance: Mathematics and Economics 51(1), 43–52.

[2] Bartlett, P. L., O. Bousquet, and S. Mendelson (2005). Local rademacher complexities. The Annals of Statistics 33(4), 1497–1537.

[3] Bartlett, P. L. and S. Mendelson (2006). Empirical minimization. Probability theory and related fields 135(3), 311–334.

[4] Bhatt, S., G. Fang, P. Li, and G. Samorodnitsky (2022a). Minimax m-estimation under adversarial contamination. In International Conference on Machine Learning, pp. 1906–1924.

[5] Bhatt, S., G. Fang, P. Li, and G. Samorodnitsky (2022b). Nearly optimal catoni’s m-estimator for infinite variance. In International Conference on Machine Learning, pp. 1925–1944.

[6] Bradley, B. O. and M. S. Taqqu (2003). Financial risk and heavy tails. In Handbook of heavy tailed distributions in finance, pp. 35–103. Elsevier.

[7] Brownlees, C., E. Joly, and G. Lugosi (2015). Empirical risk minimization for heavy-tailed losses. The Annals of Statistics 43(6), 2507–2536.

[8] Bubeck, S., N. Cesa-Bianchi, and G. Lugosi (2013). Bandits with heavy tail. IEEE Transactions on Information Theory 59(11), 7711–7717.

[9] Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilit´es et statistiques, Volume 48, pp. 1148–1185.

[10] Chen, P., X. Jin, X. Li, and L. Xu (2021). A generalized catoni’s m-estimator under finite α-th moment assumption with α ∈(1, 2). Electronic Journal of Statistics 15(2), 5523–5544.

[11] Chong, E. K. and S. H. Zak (2004). An introduction to optimization. John Wiley & Sons.

[12] Crovella, M. E. and M. S. Taqqu (1999). Estimating the heavy tail index from scaling properties. Methodology and computing in applied probability 1(1), 55–79.

[13] Diakonikolas, I., D. M. Kane, and A. Pensia (2020). Outlier robust mean estimation with subgaussian rates via stability. Advances in Neural Information Processing Systems 33, 1830–1840.

[14] Glaz, J., J. I. Naus, and S. Wallenstein (2001). Scan statistics. Springer.

[15] Holland, M. and M. Haress (2021). Learning with risk-averse feedback under potentially heavy tails. In International Conference on Artificial Intelligence and Statistics, pp. 892–900.

[16] Holland, M. and K. Ikeda (2019). Better generalization with less data using robust gradient descent. In International Conference on Machine Learning, pp. 2761–2770. PMLR.

[17] Hsu, D. and S. Sabato (2016). Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research 17(1), 543–582.

[18] Huber, P. J. (2011). Robust statistics. In International encyclopedia of statistical science, pp. 1248–1251. Springer.

[19] Lecu´e, G. and S. Mendelson (2013). Learning subgaussian classes: Upper and minimax bounds. arXiv preprint arXiv:1305.4825.

[20] Lecu´e, G. and S. Mendelson (2016). Performance of empirical risk minimization in linear aggregation. Bernoulli 22(3), 1520–1534.

[21] Lemar´echal, C. (2012). Cauchy and the gradient method. In Doc Math Extra, pp. 251–254.

[22] Lerasle, M. and R. I. Oliveira (2011). Robust empirical mean estimators. arXiv preprint arXiv:1112.3914.

[23] Liu, T. and D. Tao (2014). On the robustness and generalization of cauchy regression. In 2014 4th IEEE International Conference on Information Science and Technology, pp. 100–105.

[24] Lugosi, G. and S. Mendelson (2019). Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics 19(5), 1145–1190.

[25] Mathieu, T. and S. Minsker (2021). Excess risk bounds in robust empirical risk minimization. Information and Inference: A Journal of the IMA 10(4), 1423–1490.

[26] Meusel, R., S. Vigna, O. Lehmberg, and C. Bizer (2014). Graph structure in the web—revisited: a trick of the heavy tail. In Proceedings of the 23rd international conference on World Wide Web, pp. 427–432.

[27] Minsker, S. (2015). Geometric median and robust estimation in banach spaces. Bernoulli 21(4), 2308–2335.

[28] Minsker, S. (2018). Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics 46(6A), 2871–2903.

[29] Roy, A., K. Balasubramanian, and M. A. Erdogdu (2021). On empirical risk minimization with dependent and heavy-tailed data. Advances in Neural Information Processing Systems 34, 8913–8926.

[30] Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.

[31] Talagrand, M. (1996). Majorizing measures: the generic chaining. The Annals of Probability 24(3), 1049–1103.

[32] Van de Geer, S. A. and S. van de Geer (2000). Empirical Processes in M-estimation, Volume 6. Cambridge university press.

[33] Vapnik, V. (1991). Principles of risk minimization for learning theory. Advances in neural information processing systems 4, 831–838.

[34] Xu, L., F. Yao, Q. Yao, and H. Zhang (2023). Non-asymptotic guarantees for robust statistical learning under infinite variance assumption. Journal of Machine Learning Research 24(92), 1–46.

[35] Xu, Y., S. Zhu, S. Yang, C. Zhang, R. Jin, and T. Yang (2020). Learning with non-convex truncated losses by sgd. In Uncertainty in Artificial Intelligence, pp. 701–711. PMLR.

[36] Yi, M., R. Wang, and Z.-M. Ma (2022). Characterization of excess risk for locally strongly convex population risk. Advances in Neural Information Processing Systems 35, 21270– 21285.

[37] Zhang, H., M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.

[38] Zhang, L. and Z.-H. Zhou (2018). l1-regression with heavy-tailed distributions. Advances in Neural Information Processing Systems 31, 1084–1094.

[39] Zhu, Z. and W. Zhou (2021). Taming heavy-tailed features by shrinkage. In International Conference on Artificial Intelligence and Statistics, pp. 3268–3276. PMLR.