Auxiliary Learning and its Statistical Understanding

Hanchao Yan, Feifei Wang, Chuanxin Xia and Hansheng Wang

doi:10.5705/ss.202024.0310

Abstract

Modern statistical analysis often encounters high-dimensional problems but with a limited

sample size. It poses great challenges to traditional statistical estimation methods. In this work, we

adopt auxiliary learning to solve the estimation problem in high-dimensional settings. We start with

the linear regression setup. To improve the statistical efficiency of the parameter estimator for the

primary task, we consider several auxiliary tasks, which share the same covariates with the primary

task. Then a weighted estimator for the primary task is developed, which is a linear combination of the

ordinary least squares estimators of both the primary task and auxiliary tasks. The optimal weight is

analytically derived and the statistical properties of the corresponding weighted estimator are studied.

We then extend the weighted estimator to generalized linear regression models. Extensive numerical

experiments are conducted to verify our theoretical results. Last, a deep learning-related real-data

example of smart vending machines is presented for illustration purposes.

Key words and phrases: Auxiliary Learning, Deep Learning, High-Dimensional Data Analysis, Ordi- nary Least Squares Estimator, Smart Vending Machines

Information

Preprint No.	SS-2024-0310
Manuscript ID	SS-2024-0310
Complete Authors	Hanchao Yan, Feifei Wang, Chuanxin Xia, Hansheng Wang
Corresponding Authors	Feifei Wang
Emails	feifei.wang@ruc.edu.cn

References

Chen, H., X. Wang, C. Guan, Y. Liu, and W. Zhu (2022). Auxiliary learning with joint task and data scheduling. In International Conference on Machine Learning, pp. 3634–3647. PMLR.
Chen, H., X. Wang, Y. Liu, Y. Zhou, C. Guan, and W. Zhu (2022). Module-aware optimization for auxiliary learning. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022), pp. 145–152.
Chen, Z., V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich (2018). Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pp. 794–803. PMLR.
Fan, J., Y. Fan, and J. Lv (2008). High dimensional covariance matrix estimation using a factor model. Journal of Econometrics 147(1), 186–197.
Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96(456), 1348–1360.
Fan, J., R. Li, C.-H. Zhang, and H. Zou (2020). Statistical foundations of data science. CRC press.
Fan, J., Z. Lou, and M. Yu (2024). Are latent factor regression and sparse regression adequate? Journal of the American Statistical Association 119(546), 1076–1088.
Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.
Gao, Y., W. Liu, H. Wang, X. Wang, Y. Yan, and R. Zhang (2022). A review of distributed statistical inference. Statistical Theory and Related Fields 6(2), 89–99.
Howard, J. and S. Ruder (2018, July). Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
328–339.
Huang, G., Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708.
Jiang, Y., Y. He, and H. Zhang (2016). Variable selection with prior information for generalized linear models via the prior lasso method. Journal of the American Statistical Association 111(513), 355–376.
Jiang, Z., H. Guo, and J. Wang (2023). Feature screening for multiple responses. Journal of Multivariate Analysis 198, 105223.
Kanezaki, A., Y. Matsushita, and Y. Nishida (2018). Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5010–5019.
Kendall, A., Y. Gal, and R. Cipolla (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491.
Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6129–6138.
Kung, P.-N., S.-S. Yin, Y.-C. Chen, T.-H. Yang, and Y.-N. Chen (2021, November). Efficient multi-task auxiliary learning: Selecting auxiliary data by feature similarity. In Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 416–428. Association for Computational Linguistics.
Li, B. and A. Dong (2021). Multi-task learning with attention : Constructing auxiliary tasks for learning to learn. In 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 145–152.
Li, R., W. Zhong, and L. Zhu (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association 107(499), 1129–1139.
Li, X., Y. Gao, H. Chang, D. Huang, Y. Ma, R. Pan, H. Qi, F. Wang, S. Wu, K. Xu, et al. (2024). A selective review on statistical methods for massive data computation: distributed computing subsampling and minibatch techniques. Statistical Theory and Related Fields 8(3), 163–185.
Liebel, L. and M. K¨orner (2018). Auxiliary tasks in multi-task learning. arXiv preprint arXiv:1805.06334.
Misra, I., A. Shrivastava, A. Gupta, and M. Hebert (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3994–4003.
Murray, J. D. (2012). Asymptotic analysis, Volume 48. Springer Science & Business Media.
Navon, A., I. Achituve, H. Maron, G. Chechik, and E. Fetaya (2021). Auxiliary learning by implicit differentiation. In International Conference on Learning Representations.
Paszke, A., S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai,
and S. Chintala (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc.
Ruder, S., J. Bingel, I. Augenstein, and A. Søgaard (2019). Latent multi-task architecture learning. In Proceedings of the AAAI conference on artificial intelligence, Volume 33, pp. 4822–4829.
Tian, Y. and Y. Feng (2023). Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association 118(544), 2684–2697.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1), 267–288.
Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge university press.
Vandenhende, S., S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, and L. Van Gool (2021). Multi-task learning for dense prediction tasks: A survey. IEEE transactions on pattern analysis and machine intelligence 44(7), 3614–3633.
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, Volume 47.
Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, Volume 48. Cambridge university press.
Wang, F., J. Liu, and H. Wang (2021). Sequential text-term selection in vector space models. Journal of Business & Economic Statistics 39(1), 82–97.
Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association 104(488), 1512–1524.
Wang, H. (2012). Factor profiled sure independence screening. Biometrika 99(1), 15–28.
Wang, X., T. Li, J. Chen, S. Zong, and Z. Kai (2022). Design and implementation of intelligent unmanned vending container system. Modern Electronics Technique 6, 163–168.
Yuan, M. and Y. Lin (2005). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology 68(1), 49–67.
Zhu, X., D. Huang, R. Pan, and H. Wang (2020). Multivariate spatial autoregressive model for large scale social networks. Journal of Econometrics 215(2), 591–606.
Zhu, X., R. Pan, S. Wu, and H. Wang (2022). Feature screening for massive data analysis by subsampling. Journal of Business & Economic Statistics 40(4), 1892–1903.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association 101(476), 1418–1429. Guanghua School of Management, Peking University, Beijing, China.

Acknowledgments

This work is supported by National Natural Science Foundation of China (No.72371241,

72495123, 12271012), the MOE Project of Key Research Institute of Humanities and Social Sciences (22JJD910002), and the Big Data and Responsible Artificial Intelligence for

National Governance, Renmin University of China.

Supplementary Materials

The online Supplementary Material contains four appendices: Appendix A provides the

detailed verification for the form of W and w∗. Appendix B provides the proof of Theorem

Supplementary materials are available for download.

[1] Chen, H., X. Wang, C. Guan, Y. Liu, and W. Zhu (2022). Auxiliary learning with joint task and data scheduling. In International Conference on Machine Learning, pp. 3634–3647. PMLR.

[2] Chen, H., X. Wang, Y. Liu, Y. Zhou, C. Guan, and W. Zhu (2022). Module-aware optimization for auxiliary learning. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022), pp. 145–152.

[3] Chen, Z., V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich (2018). Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pp. 794–803. PMLR.

[4] Fan, J., Y. Fan, and J. Lv (2008). High dimensional covariance matrix estimation using a factor model. Journal of Econometrics 147(1), 186–197.

[5] Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96(456), 1348–1360.

[6] Fan, J., R. Li, C.-H. Zhang, and H. Zou (2020). Statistical foundations of data science. CRC press.

[7] Fan, J., Z. Lou, and M. Yu (2024). Are latent factor regression and sparse regression adequate? Journal of the American Statistical Association 119(546), 1076–1088.

[8] Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.

[9] Gao, Y., W. Liu, H. Wang, X. Wang, Y. Yan, and R. Zhang (2022). A review of distributed statistical inference. Statistical Theory and Related Fields 6(2), 89–99.

[10] Howard, J. and S. Ruder (2018, July). Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.

[11] 328–339.

[12] Huang, G., Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708.

[13] Jiang, Y., Y. He, and H. Zhang (2016). Variable selection with prior information for generalized linear models via the prior lasso method. Journal of the American Statistical Association 111(513), 355–376.

[14] Jiang, Z., H. Guo, and J. Wang (2023). Feature screening for multiple responses. Journal of Multivariate Analysis 198, 105223.

[15] Kanezaki, A., Y. Matsushita, and Y. Nishida (2018). Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5010–5019.

[16] Kendall, A., Y. Gal, and R. Cipolla (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491.

[17] Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6129–6138.

[18] Kung, P.-N., S.-S. Yin, Y.-C. Chen, T.-H. Yang, and Y.-N. Chen (2021, November). Efficient multi-task auxiliary learning: Selecting auxiliary data by feature similarity. In Proceedings of the 2021 Conference on Empirical

[19] Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 416–428. Association for Computational Linguistics.

[20] Li, B. and A. Dong (2021). Multi-task learning with attention : Constructing auxiliary tasks for learning to learn. In 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 145–152.

[21] Li, R., W. Zhong, and L. Zhu (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association 107(499), 1129–1139.

[22] Li, X., Y. Gao, H. Chang, D. Huang, Y. Ma, R. Pan, H. Qi, F. Wang, S. Wu, K. Xu, et al. (2024). A selective review on statistical methods for massive data computation: distributed computing subsampling and minibatch techniques. Statistical Theory and Related Fields 8(3), 163–185.

[23] Liebel, L. and M. K¨orner (2018). Auxiliary tasks in multi-task learning. arXiv preprint arXiv:1805.06334.

[24] Misra, I., A. Shrivastava, A. Gupta, and M. Hebert (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3994–4003.

[25] Murray, J. D. (2012). Asymptotic analysis, Volume 48. Springer Science & Business Media.

[26] Navon, A., I. Achituve, H. Maron, G. Chechik, and E. Fetaya (2021). Auxiliary learning by implicit differentiation. In International Conference on Learning Representations.

[27] Paszke, A., S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai,

[28] and S. Chintala (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc.

[29] Ruder, S., J. Bingel, I. Augenstein, and A. Søgaard (2019). Latent multi-task architecture learning. In Proceedings of the AAAI conference on artificial intelligence, Volume 33, pp. 4822–4829.

[30] Tian, Y. and Y. Feng (2023). Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association 118(544), 2684–2697.

[31] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1), 267–288.

[32] Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge university press.

[33] Vandenhende, S., S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, and L. Van Gool (2021). Multi-task learning for dense prediction tasks: A survey. IEEE transactions on pattern analysis and machine intelligence 44(7), 3614–3633.

[34] Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, Volume 47.

[35] Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, Volume 48. Cambridge university press.

[36] Wang, F., J. Liu, and H. Wang (2021). Sequential text-term selection in vector space models. Journal of Business & Economic Statistics 39(1), 82–97.

[37] Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association 104(488), 1512–1524.

[38] Wang, H. (2012). Factor profiled sure independence screening. Biometrika 99(1), 15–28.

[39] Wang, X., T. Li, J. Chen, S. Zong, and Z. Kai (2022). Design and implementation of intelligent unmanned vending container system. Modern Electronics Technique 6, 163–168.

[40] Yuan, M. and Y. Lin (2005). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology 68(1), 49–67.

[41] Zhu, X., D. Huang, R. Pan, and H. Wang (2020). Multivariate spatial autoregressive model for large scale social networks. Journal of Econometrics 215(2), 591–606.

[42] Zhu, X., R. Pan, S. Wu, and H. Wang (2022). Feature screening for massive data analysis by subsampling. Journal of Business & Economic Statistics 40(4), 1892–1903.

[43] Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association 101(476), 1418–1429. Guanghua School of Management, Peking University, Beijing, China.