Uncertainty Quantification for Large-Scale Deep Neural Networks via Post-StoNet Modeling

Yan Sun and Faming Liang

doi:10.5705/ss.202024.0294

Abstract

Deep learning has revolutionized modern data science. However, how to accurately quantify the uncertainty of predictions from large-scale deep neural net-

works (DNNs) remains an unresolved issue. To address this issue, we introduce

a novel post-processing approach. This approach feeds the output from the last

hidden layer of a pre-trained large-scale DNN model into a stochastic neural

network (StoNet), then trains the StoNet with a sparse penalty on a validation

dataset and constructs prediction intervals for future observations. We establish

a theoretical guarantee for the validity of this approach; in particular, the parameter estimation consistency for the sparse StoNet is essential for the success of this

approach. Comprehensive experiments demonstrate that the proposed approach

can construct honest confidence intervals with shorter interval lengths compared

to conformal methods and achieves better calibration compared to other posthoc calibration techniques. Additionally, we show that the StoNet formulation

provides us with a platform to adapt sparse learning theory and methods from

linear models to DNNs.

Key words and phrases: Calibration, Conformal Method, Imputation Regularized- Optimization, Sparse Deep Learning, Stochastic Neural Network

Information

Preprint No.	SS-2024-0294
Manuscript ID	SS-2024-0294
Complete Authors	Yan Sun, Faming Liang
Corresponding Authors	Faming Liang
Emails	fmliang@purdue.edu

References

Barber, R. F., E. J. Cand`es, A. Ramdas, and R. J. Tibshirani (2021). Predictive inference with the jackknife+. The Annals of Statistics 49(1), 486–507.
Bojarski, M., D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel,
M. Monfort, U. Muller, J. Zhang, et al. (2016). End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316.
Bolcskei, H., P. Grohs, G. Kutyniok, and P. Petersen (2019). Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science 1(1), 8–45.
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood estimation from incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical
Society, Series B 39, 1–38.
Dosovitskiy, A., L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Esteva, A., B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun
(2017). Dermatologist-level classification of skin cancer with deep neural networks. nature 542(7639), 115–118.
Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360.
Ghosh, S., J. Yao, and F. Doshi-Velez (2018). Structured variational learning of bayesian neural networks with horseshoe priors. ArXiv abs/1806.05975.
Guo, C., G. Pleiss, Y. Sun, and K. Q. Weinberger (2017). On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR.
Han, S., J. Pool, J. Tran, and W. Dally (2015). Learning both weights and connections for efficient neural network. In NeurIPS 28, pp. 1135–1143.
He, K., X. Zhang, S. Ren, and J. Sun (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
Hinton, G. E., S. Osindero, and Y.-W. Teh (2006). A fast learning algorithm for deep belief nets. Neural computation 18(7), 1527–1554.
Huang, G., Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708.
Huang, J., S. Ma, and C.-H. Zhang (2008). The iterated lasso for high-dimensional logistic regression. Technical report no. 392. The University of Iowa, Department of Statistics and Actuarial Sciences.
Jiang, W. et al. (2007). Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities. The Annals of Statistics 35(4), 1487– 1511.
Kingma, D. P., T. Salimans, and M. Welling (2015). Variational dropout and the local reparameterization trick. NIPS 28, 2575–2583.
Kull, M., M. Perello Nieto, M. K¨angsepp, T. Silva Filho, H. Song, and P. Flach (2019). Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. NeurIPS 32, 12295–12305.
Kumar, A., P. S. Liang, and T. Ma (2019). Verified uncertainty calibration. NeurIPS 32, 3787–3798.
Kumar, A., S. Sarawagi, and U. Jain (2018). Trainable calibration measures for neural networks from kernel mean embeddings. ICML 35, 2805–2814.
Liang, F., B. Jia, J. Xue, Q. Li, and Y. Luo (2018). An imputation–regularized optimization algorithm for high dimensional missing data problems and beyond. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 899–926.
Liang, S., Y. Sun, and F. Liang (2022). Nonlinear sufficient dimension reduction with a stochastic neural network. NeurIPS 35, 27360–27373.
Loh, P. and M. Wainwright (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics 45(6), 2455–2482.
Louizos, C., K. Ullrich, and M. Welling (2017). Bayesian compression for deep learning. NIPS 31, 3288–3298.
Lu, Y. and J. Lu (2020). A universal approximation theorem of deep neural networks for expressing probability distributions. NeurIPS 33, 3094–3105.
Meinshausen, N. and B. Yu (2009). Lasso-type recovery of sparse representations for highdimensional data. Annals of Statistics 37, 246–270.
Minderer, M., J. Djolonga, R. Romijnders, F. Hubis, X. Zhai, N. Houlsby, D. Tran, and M. Lucic
(2021). Revisiting the calibration of modern neural networks. NeurIPS 34, 15682–15694.
Mukhoti, J., V. Kulharia, A. Sanyal, S. Golodetz, P. Torr, and P. Dokania (2020). Calibrating deep neural networks using focal loss. NeurIPS 33, 15288–15299.
Nielsen, S. (2000). The stochastic em algorithm: Estimation and asymptotic results. Bernoulli 6, 457–489.
RosascoLorenzo, VillaSilvia, MosciSofia, SantoroMatteo, and VerriAlessandro (2013). Nonparametric sparsity and regularization. Journal of Machine Learning Research 14, 1665–1714.
Salakhutdinov, R. and G. Hinton (2009). Deep boltzmann machines. In Artificial intelligence and statistics, pp. 448–455. PMLR.
Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1), 1929–1958.
Sun, Y. and F. Liang (2022). A kernel-expanded stochastic neural network. Journal of the Royal Statistical Society Series B 84(2), 547–578.
Sun, Y., Q. Song, and F. Liang (2022a). Consistent sparse deep learning: Theory and computation.
Journal of the American Statistical Association 117(540), 1981–1995.
Sun, Y., Q. Song, and F. Liang (2022b). Learning sparse deep neural networks with a spikeand-slab prior. Statistics & probability letters 180, 109246.
Sun, Y., W. Xiong, and F. Liang (2021). Sparse deep learning: A new framework immune to local traps and miscalibration. NeurIPS 34, 22301–22312.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.
Vovk, V., A. Gammerman, and G. Shafer (2005). Algorithmic Learning in a Random World. Springer.
Wang, Y. and V. Rockov´a (2020). Uncertainty quantification for sparse deep learning. AISTATS 23, 298–308.
Wu, C. (1983). On the convergence properties of the em algorithm. Annals of Statistics 11, 95–103.
Zadrozny, B. and C. Elkan (2002). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699. Zagoruyko,
S. and N. Komodakis (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 38, 894–942.
Zhao, P. and B. Yu (2006). On model selection consistency of lasso. The Journal of Machine Learning Research 7, 2541–2563.
Zheng, X., C. Dan, B. Aragam, P. Ravikumar, and E. P. Xing (2020). Learning sparse nonparametric DAGs. AISTATS 23, 3414–3425. Yan Sun, University of Pennsylvania, Philadelphia, PA, 19104, USA.

Acknowledgments

Liang’s research is support in part by the NSF grants DMS-2015498 and

DMS-2210819, and the NIH grant R01-GM152717. The authors thank the

editor, associate editor, and referees for their constructive comments which

have led to significant improvement of this paper.

Supplementary Materials

The online Supplementary Material contains (i) the proofs of Lemma 2,

Theorem 1, Corollary 1, and Corollary 2; (ii) Algorithms S1 & S2; (iii)

additional formulas and results; and (iv) detailed experimental settings.

Supplementary materials are available for download.

[1] Barber, R. F., E. J. Cand`es, A. Ramdas, and R. J. Tibshirani (2021). Predictive inference with the jackknife+. The Annals of Statistics 49(1), 486–507.

[2] Bojarski, M., D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel,

[3] M. Monfort, U. Muller, J. Zhang, et al. (2016). End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316.

[4] Bolcskei, H., P. Grohs, G. Kutyniok, and P. Petersen (2019). Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science 1(1), 8–45.

[5] Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood estimation from incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical

[6] Society, Series B 39, 1–38.

[7] Dosovitskiy, A., L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[8] Esteva, A., B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun

[9] (2017). Dermatologist-level classification of skin cancer with deep neural networks. nature 542(7639), 115–118.

[10] Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360.

[11] Ghosh, S., J. Yao, and F. Doshi-Velez (2018). Structured variational learning of bayesian neural networks with horseshoe priors. ArXiv abs/1806.05975.

[12] Guo, C., G. Pleiss, Y. Sun, and K. Q. Weinberger (2017). On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR.

[13] Han, S., J. Pool, J. Tran, and W. Dally (2015). Learning both weights and connections for efficient neural network. In NeurIPS 28, pp. 1135–1143.

[14] He, K., X. Zhang, S. Ren, and J. Sun (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.

[15] Hinton, G. E., S. Osindero, and Y.-W. Teh (2006). A fast learning algorithm for deep belief nets. Neural computation 18(7), 1527–1554.

[16] Huang, G., Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708.

[17] Huang, J., S. Ma, and C.-H. Zhang (2008). The iterated lasso for high-dimensional logistic regression. Technical report no. 392. The University of Iowa, Department of Statistics and Actuarial Sciences.

[18] Jiang, W. et al. (2007). Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities. The Annals of Statistics 35(4), 1487– 1511.

[19] Kingma, D. P., T. Salimans, and M. Welling (2015). Variational dropout and the local reparameterization trick. NIPS 28, 2575–2583.

[20] Kull, M., M. Perello Nieto, M. K¨angsepp, T. Silva Filho, H. Song, and P. Flach (2019). Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. NeurIPS 32, 12295–12305.

[21] Kumar, A., P. S. Liang, and T. Ma (2019). Verified uncertainty calibration. NeurIPS 32, 3787–3798.

[22] Kumar, A., S. Sarawagi, and U. Jain (2018). Trainable calibration measures for neural networks from kernel mean embeddings. ICML 35, 2805–2814.

[23] Liang, F., B. Jia, J. Xue, Q. Li, and Y. Luo (2018). An imputation–regularized optimization algorithm for high dimensional missing data problems and beyond. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 899–926.

[24] Liang, S., Y. Sun, and F. Liang (2022). Nonlinear sufficient dimension reduction with a stochastic neural network. NeurIPS 35, 27360–27373.

[25] Loh, P. and M. Wainwright (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics 45(6), 2455–2482.

[26] Louizos, C., K. Ullrich, and M. Welling (2017). Bayesian compression for deep learning. NIPS 31, 3288–3298.

[27] Lu, Y. and J. Lu (2020). A universal approximation theorem of deep neural networks for expressing probability distributions. NeurIPS 33, 3094–3105.

[28] Meinshausen, N. and B. Yu (2009). Lasso-type recovery of sparse representations for highdimensional data. Annals of Statistics 37, 246–270.

[29] Minderer, M., J. Djolonga, R. Romijnders, F. Hubis, X. Zhai, N. Houlsby, D. Tran, and M. Lucic

[30] (2021). Revisiting the calibration of modern neural networks. NeurIPS 34, 15682–15694.

[31] Mukhoti, J., V. Kulharia, A. Sanyal, S. Golodetz, P. Torr, and P. Dokania (2020). Calibrating deep neural networks using focal loss. NeurIPS 33, 15288–15299.

[32] Nielsen, S. (2000). The stochastic em algorithm: Estimation and asymptotic results. Bernoulli 6, 457–489.

[33] RosascoLorenzo, VillaSilvia, MosciSofia, SantoroMatteo, and VerriAlessandro (2013). Nonparametric sparsity and regularization. Journal of Machine Learning Research 14, 1665–1714.

[34] Salakhutdinov, R. and G. Hinton (2009). Deep boltzmann machines. In Artificial intelligence and statistics, pp. 448–455. PMLR.

[35] Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1), 1929–1958.

[36] Sun, Y. and F. Liang (2022). A kernel-expanded stochastic neural network. Journal of the Royal Statistical Society Series B 84(2), 547–578.

[37] Sun, Y., Q. Song, and F. Liang (2022a). Consistent sparse deep learning: Theory and computation.

[38] Journal of the American Statistical Association 117(540), 1981–1995.

[39] Sun, Y., Q. Song, and F. Liang (2022b). Learning sparse deep neural networks with a spikeand-slab prior. Statistics & probability letters 180, 109246.

[40] Sun, Y., W. Xiong, and F. Liang (2021). Sparse deep learning: A new framework immune to local traps and miscalibration. NeurIPS 34, 22301–22312.

[41] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.

[42] Vovk, V., A. Gammerman, and G. Shafer (2005). Algorithmic Learning in a Random World. Springer.

[43] Wang, Y. and V. Rockov´a (2020). Uncertainty quantification for sparse deep learning. AISTATS 23, 298–308.

[44] Wu, C. (1983). On the convergence properties of the em algorithm. Annals of Statistics 11, 95–103.

[45] Zadrozny, B. and C. Elkan (2002). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699. Zagoruyko,

[46] S. and N. Komodakis (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.

[47] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 38, 894–942.

[48] Zhao, P. and B. Yu (2006). On model selection consistency of lasso. The Journal of Machine Learning Research 7, 2541–2563.

[49] Zheng, X., C. Dan, B. Aragam, P. Ravikumar, and E. P. Xing (2020). Learning sparse nonparametric DAGs. AISTATS 23, 3414–3425. Yan Sun, University of Pennsylvania, Philadelphia, PA, 19104, USA.