Bayesian Optimization with Pareto-Principled Training for Economical Hyperparameter Optimization

Yang, Yang; Deng, Ke; Zhu, Yu

doi:10.5705/ss.202023.0310

Abstract

The specification of hyperparameters plays a critical role in determining the practical per

formance of a machine learning method. Hyperparameter Optimization (HPO), i.e., the searching for

optimal specification of hyperparameters, however, often faces critical computational challenges due to

the vast searching space and the high computational cost on model training under a given hyperparameter

specification. In this paper, we propose BOPT-HPO, a systematic approach for efficient HPO by leveraging Bayesian optimization with Pareto-principled training, based on the observation that the training

procedure of a machine learning method under a given hyperparameter specification often follows the

Pareto principle (the 80/20 rule) that about 80% of the total improvement in the objective function is

achieved in 20% of the training time. By introducing two levels of training corresponding to the Pareto

principle, i.e., the eighty-percent training (ET) and the complete training (CT), and establishing a joint

surrogate model for CT runs and ET runs, BOPT-HPO reduces the computational cost of HPO significantly under the framework of Bayesian optimization with multi-fidelity measurements. A wide range of

experimental studies confirm that the proposed approach achieves economical HPO for various machine

learning models, including support vector machines, feed-forward neural networks, and convolutional

neural networks.

Key words and phrases: Automated artificial intelligence; Black-box function optimization; Computer experiments; Multi-fidelity modelling; Truncated Gaussian process

Information

Preprint No.	SS-2023-0310
Manuscript ID	SS-2023-0310
Complete Authors	Yang Yang, Ke Deng, Yu Zhu
Corresponding Authors	Ke Deng
Emails	kdeng@tsinghua.edu.cn

References

Batra, R., L. Song, and R. Ramprasad (2021). Emerging materials intelligence ecosystems propelled by machine learning. Nature Review Materials 6(8), 655–678.
Bergstra, J., R. Bardenet, Y. Bengio, and B. K´egl (2011). Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, Volume 24.
Binois, M. and N. Wycoff(2022). A survey on high-dimensional gaussian process modeling with application to bayesian optimization. ACM Transactions on Evolutionary Learning and Optimization 2(2), 1–26. BO WITH PARETO-PRINCIPLED TRAINING FOR HPO
Candelieri, A. and F. Archetti (2021). Sparsifying to optimize over multiple information sources: an augmented gaussian process based algorithm. Structural and Multidisciplinary Optimization 64, 239–255.
Domhan, T., J. T. Springenberg, and F. Hutter (2015). Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-fourth International Joint Conference on Artificial Intelligence, pp. 3460–3468.
Ezzat, A. A., A. Pourhabib, and Y. Ding (2018). Sequential design for functional calibration of computer models. Technometrics 60(3), 286–296.
Falkner, S., A. Klein, and F. Hutter (2018). BOHB: Robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, pp. 1437–1446. PMLR.
Forrester, A. I. J., A. S´obester, and A. J. Keane (2007). Multi-fidelity optimization via surrogate modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 463, 3251–3269.
Gahrooei, M. R., K. Paynabar, M. Pacella, and B. M. Colosimo (2019). An adaptive fused sampling approach of high-accuracy data in the presence of low-accuracy data. IISE Transactions 55(11), 1251–1264.
Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics 1(2), 141–149.
Goh, J., D. Bingham, J. P. Holloway, M. J. Grosskopf, C. C. Kuranz, and E. Rutter (2013). Prediction and computer model calibration using outputs from multifidelity simulators. Technometrics 55(4), 501–512.
Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies 10, 1–309.
Gramacy, R. B. (2020). Surrogates: Gaussian process modeling, design, and optimization for the applied sciences. CRC BO WITH PARETO-PRINCIPLED TRAINING FOR HPO press.
He, X., R. Tuo, and C. F. J. Wu (2017). Optimization of multi-fidelity computer experiments via the EQIE criterion. Technometrics 59(1), 58–68.
Hinton, G. E. (2012). A practical guide to training restricted boltzmann machines. In Neural Networks: Tricks of the Trade, pp. 599–619. Springer.
Huang, D., T. T. Allen, W. I. Notz, and R. A. Miller (2006). Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization 32, 369–382.
Hutter, F., H. H. Hoos, and K. Leyton-Brown (2011). Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization, pp. 507–523. Springer.
Jones, D. R. (2001). A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization 21(4), 345–383.
Jones, D. R., M. Schonlau, and W. J. William (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13(4), 455–492.
Kandasamy, K. (2018). Tuning hyperparameters without grad students: Scaling up bandit optimisation. Ph. D. thesis, Carnegie Mellon University.
Kandasamy, K., G. Dasarathy, J. B. Oliva, J. Schneider, and B. Poczos (2016). Gaussian process bandit optimisation with multi-fidelity evaluations. In Advances in Neural Information Processing Systems, Volume 29.
Kennedy, M. C. and A. O’Hagan (2000). Predicting the output from a complex computer code when fast approximations are available. Biometrika 87(1), 1–13. Krizhevsky, A., V. Nair, and G. Hinton
(2014). Cifar-10 and cifar-100 datasets. Retrieved from BO WITH PARETO-PRINCIPLED TRAINING FOR HPO https://www.cs.toronto.edu/ kriz/cifar.html.
Krizhevsky, A., I. Sutskever, and G. E. Hinton (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Volume 25.
Kushner, H. J. (1964). A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering 86, 97–106.
Kuya, Y., K. Takeda, X. Zhang, and A. I. J. Forrester (2011). Multifidelity surrogate modeling of experimental and computational aerodynamic data sets. AIAA Journal 49(2), 289–298.
Le Gratiet, L. and C. Cannamela (2015). Cokriging-based sequential design strategies using fast ccross-validation techniques for multi-fidelity computer codes. Technometrics 57, 418–427.
LeCun, Y., L. Bottou, G. Orr, and K. M¨uller (2012). Efficient backprop. In Neural Networks: Tricks of the Trade, pp. 9–48. Springer. LeCun, Y., C. Cortes, and C. Burges
(2010). Mnist handwritten digit database. Retrieved from http://yann.lecun.com/exdb/mnist.
Li, L., K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2018). Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research 18(185), 1–52.
Mainini, L., A. Serani, M. P. Rumpfkeil, E. Minisci, D. Quagliarella, H. Pehlivan, S. Yildiz, S. Ficini, R. Pellegrini,
F. Di Fiore, et al. (2022). Analytical benchmark problems for multifidelity optimization methods. arXiv preprint arXiv:2204.07867.
Marler, R. T. and J. S. Arora (2004). Survey of multi-objective optimization methods for engineering. Structural and Multidisciplinary Optimization 26(6), 369–395. BO WITH PARETO-PRINCIPLED TRAINING FOR HPO
Picheny, V., D. Ginsbourger, Y. Richet, and G. Caplin (2013). Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics 55(1), 2–13.
Poloczek, M., J. Wang, and P. Frazier (2017). Multi-information source optimization. In Advances in Neural Information Processing Systems, Volume 30.
Prechelt, L. (2012). Early stopping-But when? In Neural Networks: Tricks of the Trade, pp. 53–67. Springer.
Qian, P. Z. G. (2009). Nested latin hypercube designs. Biometrika 96(4), 957–970.
Qian, P. Z. G. and C. F. J. Wu (2008). Bayesian hierarchical modeling for integrating low-accuracy and high-accuracy experiments. Technometrics 50(2), 192–204.
Qian, Z., C. C. Seepersad, V. R. Joseph, J. K. Allen, and C. F. J. Wu (2006). Building surrogate models based on detailed and approximate simulations. Journal of Mechanical Design 128, 668–677.
Rasmussen, C. E. (2004). Gaussian processes in machine learning. In Advanced Lectures on Machine Learning, pp. 63–71. Springer.
Sanders, R. (1987). The pareto principle: Its use and abuse. Journal of Services Marketing 1(2), 37–40.
Shahriari, B., K. Swersky, Z. Y. Wang, R. R. Adams, and N. de Freitas (2016). Taking the human out of the loop: A review of bayesian optimization. In Proceedings of the IEEE, Volume 104, pp. 148–175.
Snoek, J., H. Larochelle, and R. P. Adams (2012). Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, Volume 25.
Srinivas, N., A. Krause, S. Kakade, and M. Seeger (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, pp. 1015–1022. BO WITH PARETO-PRINCIPLED TRAINING FOR HPO
Stroh, R., J. Bect, S. Demeyer, N. Fischer, D. Marquis, and E. Vazquez (2022). Sequential design of multi-fidelity computer experiments: Maximizing the rate of stepwise uncertainty reduction. Technometrics 64(2), 199–209.
Tuo, R., C. F. J. Wu, and D. Yu (2014). Surrogate modeling of computer experiments with different mesh densities. Technometrics 56(3), 372–380.
Xiong, S., P. Z. G. Qian, and C. F. J. Wu (2013). Sequential design and analysis of high-accuracy and low-accuracy computer codes. Technometrics 55(1), 37–46.
Yang, Y., C. Ji, and K. Deng (2021). Rapid design of metamaterials via multitarget bayesian optimization. The Annals of Applied Statistics 15(2), 768–796.
Zhu, Z., D. W. Ng, H. S. Park, and M. C. McAlpine (2021). 3D-printed multifunctional materials enabled by artificialintelligence-assisted fabrication technologies. Nature Review Materials 6, 27–47.

Acknowledgments

This work is supported by the National Key Research and Development Program of China

(Grant No. 2023YFF0614702), and the National Natural Science Foundation of China (Grant

Nos. 12401353 and 12371269). The authors knowledge the summer support of Huzhou University.

Supplementary Materials

Details about the proof of Theorem 1 and some experimental settings and results can be found

in the supplementary materials.

Supplementary materials are available for download.

[1] Batra, R., L. Song, and R. Ramprasad (2021). Emerging materials intelligence ecosystems propelled by machine learning. Nature Review Materials 6(8), 655–678.

[2] Bergstra, J., R. Bardenet, Y. Bengio, and B. K´egl (2011). Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, Volume 24.

[3] Binois, M. and N. Wycoff(2022). A survey on high-dimensional gaussian process modeling with application to bayesian optimization. ACM Transactions on Evolutionary Learning and Optimization 2(2), 1–26. BO WITH PARETO-PRINCIPLED TRAINING FOR HPO

[4] Candelieri, A. and F. Archetti (2021). Sparsifying to optimize over multiple information sources: an augmented gaussian process based algorithm. Structural and Multidisciplinary Optimization 64, 239–255.

[5] Domhan, T., J. T. Springenberg, and F. Hutter (2015). Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-fourth International Joint Conference on Artificial Intelligence, pp. 3460–3468.

[6] Ezzat, A. A., A. Pourhabib, and Y. Ding (2018). Sequential design for functional calibration of computer models. Technometrics 60(3), 286–296.

[7] Falkner, S., A. Klein, and F. Hutter (2018). BOHB: Robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, pp. 1437–1446. PMLR.

[8] Forrester, A. I. J., A. S´obester, and A. J. Keane (2007). Multi-fidelity optimization via surrogate modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 463, 3251–3269.

[9] Gahrooei, M. R., K. Paynabar, M. Pacella, and B. M. Colosimo (2019). An adaptive fused sampling approach of high-accuracy data in the presence of low-accuracy data. IISE Transactions 55(11), 1251–1264.

[10] Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics 1(2), 141–149.

[11] Goh, J., D. Bingham, J. P. Holloway, M. J. Grosskopf, C. C. Kuranz, and E. Rutter (2013). Prediction and computer model calibration using outputs from multifidelity simulators. Technometrics 55(4), 501–512.

[12] Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies 10, 1–309.

[13] Gramacy, R. B. (2020). Surrogates: Gaussian process modeling, design, and optimization for the applied sciences. CRC BO WITH PARETO-PRINCIPLED TRAINING FOR HPO press.

[14] He, X., R. Tuo, and C. F. J. Wu (2017). Optimization of multi-fidelity computer experiments via the EQIE criterion. Technometrics 59(1), 58–68.

[15] Hinton, G. E. (2012). A practical guide to training restricted boltzmann machines. In Neural Networks: Tricks of the Trade, pp. 599–619. Springer.

[16] Huang, D., T. T. Allen, W. I. Notz, and R. A. Miller (2006). Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization 32, 369–382.

[17] Hutter, F., H. H. Hoos, and K. Leyton-Brown (2011). Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization, pp. 507–523. Springer.

[18] Jones, D. R. (2001). A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization 21(4), 345–383.

[19] Jones, D. R., M. Schonlau, and W. J. William (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13(4), 455–492.

[20] Kandasamy, K. (2018). Tuning hyperparameters without grad students: Scaling up bandit optimisation. Ph. D. thesis, Carnegie Mellon University.

[21] Kandasamy, K., G. Dasarathy, J. B. Oliva, J. Schneider, and B. Poczos (2016). Gaussian process bandit optimisation with multi-fidelity evaluations. In Advances in Neural Information Processing Systems, Volume 29.

[22] Kennedy, M. C. and A. O’Hagan (2000). Predicting the output from a complex computer code when fast approximations are available. Biometrika 87(1), 1–13. Krizhevsky, A., V. Nair, and G. Hinton

[23] (2014). Cifar-10 and cifar-100 datasets. Retrieved from BO WITH PARETO-PRINCIPLED TRAINING FOR HPO https://www.cs.toronto.edu/ kriz/cifar.html.

[24] Krizhevsky, A., I. Sutskever, and G. E. Hinton (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Volume 25.

[25] Kushner, H. J. (1964). A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering 86, 97–106.

[26] Kuya, Y., K. Takeda, X. Zhang, and A. I. J. Forrester (2011). Multifidelity surrogate modeling of experimental and computational aerodynamic data sets. AIAA Journal 49(2), 289–298.

[27] Le Gratiet, L. and C. Cannamela (2015). Cokriging-based sequential design strategies using fast ccross-validation techniques for multi-fidelity computer codes. Technometrics 57, 418–427.

[28] LeCun, Y., L. Bottou, G. Orr, and K. M¨uller (2012). Efficient backprop. In Neural Networks: Tricks of the Trade, pp. 9–48. Springer. LeCun, Y., C. Cortes, and C. Burges

[29] (2010). Mnist handwritten digit database. Retrieved from http://yann.lecun.com/exdb/mnist.

[30] Li, L., K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2018). Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research 18(185), 1–52.

[31] Mainini, L., A. Serani, M. P. Rumpfkeil, E. Minisci, D. Quagliarella, H. Pehlivan, S. Yildiz, S. Ficini, R. Pellegrini,

[32] F. Di Fiore, et al. (2022). Analytical benchmark problems for multifidelity optimization methods. arXiv preprint arXiv:2204.07867.

[33] Marler, R. T. and J. S. Arora (2004). Survey of multi-objective optimization methods for engineering. Structural and Multidisciplinary Optimization 26(6), 369–395. BO WITH PARETO-PRINCIPLED TRAINING FOR HPO

[34] Picheny, V., D. Ginsbourger, Y. Richet, and G. Caplin (2013). Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics 55(1), 2–13.

[35] Poloczek, M., J. Wang, and P. Frazier (2017). Multi-information source optimization. In Advances in Neural Information Processing Systems, Volume 30.

[36] Prechelt, L. (2012). Early stopping-But when? In Neural Networks: Tricks of the Trade, pp. 53–67. Springer.

[37] Qian, P. Z. G. (2009). Nested latin hypercube designs. Biometrika 96(4), 957–970.

[38] Qian, P. Z. G. and C. F. J. Wu (2008). Bayesian hierarchical modeling for integrating low-accuracy and high-accuracy experiments. Technometrics 50(2), 192–204.

[39] Qian, Z., C. C. Seepersad, V. R. Joseph, J. K. Allen, and C. F. J. Wu (2006). Building surrogate models based on detailed and approximate simulations. Journal of Mechanical Design 128, 668–677.

[40] Rasmussen, C. E. (2004). Gaussian processes in machine learning. In Advanced Lectures on Machine Learning, pp. 63–71. Springer.

[41] Sanders, R. (1987). The pareto principle: Its use and abuse. Journal of Services Marketing 1(2), 37–40.

[42] Shahriari, B., K. Swersky, Z. Y. Wang, R. R. Adams, and N. de Freitas (2016). Taking the human out of the loop: A review of bayesian optimization. In Proceedings of the IEEE, Volume 104, pp. 148–175.

[43] Snoek, J., H. Larochelle, and R. P. Adams (2012). Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, Volume 25.

[44] Srinivas, N., A. Krause, S. Kakade, and M. Seeger (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, pp. 1015–1022. BO WITH PARETO-PRINCIPLED TRAINING FOR HPO

[45] Stroh, R., J. Bect, S. Demeyer, N. Fischer, D. Marquis, and E. Vazquez (2022). Sequential design of multi-fidelity computer experiments: Maximizing the rate of stepwise uncertainty reduction. Technometrics 64(2), 199–209.

[46] Tuo, R., C. F. J. Wu, and D. Yu (2014). Surrogate modeling of computer experiments with different mesh densities. Technometrics 56(3), 372–380.

[47] Xiong, S., P. Z. G. Qian, and C. F. J. Wu (2013). Sequential design and analysis of high-accuracy and low-accuracy computer codes. Technometrics 55(1), 37–46.

[48] Yang, Y., C. Ji, and K. Deng (2021). Rapid design of metamaterials via multitarget bayesian optimization. The Annals of Applied Statistics 15(2), 768–796.

[49] Zhu, Z., D. W. Ng, H. S. Park, and M. C. McAlpine (2021). 3D-printed multifunctional materials enabled by artificialintelligence-assisted fabrication technologies. Nature Review Materials 6, 27–47.