Simultaneous Estimation and Dataset Selection for Transfer Learning in High Dimensions by a Non-convex Penalty

Zeyu Li, Dong Liu, Yong He and Xinsheng Zhang

doi:10.5705/ss.202024.0423

Abstract

In this paper, we introduce a method for simultaneous parameter esti

mation and informative source dataset identification in high-dimensional transfer learning, leveraging the truncated norm penalty function. This integrated

approach contrasts with conventional strategies that treat useful dataset selection and transfer learning as separate steps. To solve the resulting non-convex

optimization problem, specifically under sparse linear regression and generalized low-rank trace regression models, we adopt the difference of convex (DC)

programming with the alternating direction method of multipliers (ADMM) procedure. We theoretically justify the proposed algorithm from both statistical and

computational perspectives. Numerical results are reported alongside to validate

the theoretical assertions. An R package MHDTL is developed to implement the

proposed methods.

Key words and phrases: clustering analysis; DC-ADMM; knowledge transfer; M-estimators

Information

Preprint No.	SS-2024-0423
Manuscript ID	SS-2024-0423
Complete Authors	Zeyu Li, Dong Liu, Yong He, Xinsheng Zhang
Corresponding Authors	Dong Liu
Emails	liudong_stat@163.com

References

Bastani, H. (2021). Predicting with proxies: Transfer learning in high dimension. Management Science 67(5), 2964–2984.
Boyd, S., N. Parikh, and E. Chu (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc.
Cai, J.-F., E. J. Cand`es, and Z. Shen (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20(4), 1956–1982.
Cai, T. T. and H. Wei (2021). Transfer learning for nonparametric classification. The Annals of Statistics 49(1), 100–128.
Candes, E. and T. Tao (2007). The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics 35(6), 2313–2351.
Chen, S., Q. Zheng, Q. Long, and W. J. Su (2021). A theorem of the alternative for personalized federated learning. arXiv preprint arXiv:2103.01901.
Duan, Y. and K. Wang (2023). Adaptive and robust multi-task learning. The Annals of Statistics 51(5), 2015–2039.
Fan, J., W. Gong, and Z. Zhu (2019). Generalized high-dimensional trace regression via nuclear norm regularization. Journal of Econometrics 212(1), 177–202.
Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96(456), 1348–1360.
Fan, J., W. Wang, and Z. Zhu (2021). A shrinkage principle for heavy-tailed data: Highdimensional robust low-rank matrix recovery. The Annals of Statistics 49(3), 1239–1266.
Gross, S. M. and R. Tibshirani (2016). Data shared lasso: A novel tool to discover uplift. Computational statistics & data analysis 101, 226–235.
Hamidi, N. and M. Bayati (2022). On low-rank trace regression under general sampling distribution. Journal of Machine Learning Research 23(321), 1–49.
He, Y., X. Kong, L. Trapani, and L. Yu (2023). One-way or two-way factor model for matrix sequences? Journal of Econometrics 235(2), 1981–2004.
He, Z., Y. Sun, and R. Li (2024). Transfusion: Covariate-shift robust transfer learning for high-dimensional regression. In International Conference on Artificial Intelligence and Statistics, pp. 703–711. PMLR.
Li, S., T. T. Cai, and H. Li (2022). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society Series B: Statistical Methodology 84(1), 149–173.
Li, S., L. Zhang, T. T. Cai, and H. Li (2024). Estimation and inference for high-dimensional generalized linear models with knowledge transfer. Journal of the American Statistical Association 119(546), 1274–1285.
Liu, D., C. Zhao, Y. He, L. Liu, Y. Guo, and X. Zhang (2023). Simultaneous cluster structure learning and estimation of heterogeneous graphs for matrix-variate fmri data. Biometrics 79(3), 2246–2259.
McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017). Communicationefficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR.
Negahban, S. N., P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science 27(4), 538–557.
Niu, S., Y. Liu, J. Wang, and H. Song (2020). A decade survey of transfer learning (2010–2020). IEEE Transactions on Artificial Intelligence 1(2), 151–166.
Ollier, E. and V. Viallon (2017). Regression modelling on stratified data with the lasso. Biometrika 104(1), 83–96.
Pan, W., X. Shen, and B. Liu (2013). Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. Journal of Machine Learning Research 14(7), 1865– 1889.
Shen, X., W. Pan, and Y. Zhu (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association 107(497), 223–232.
Sun, Y., P. Babu, and D. P. Palomar (2016). Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE Transactions on Signal Processing 65(3), 794–816.
Thi Hoai An, L. and P. Dinh Tao (1997). Solving a class of linearly constrained indefinite quadratic problems by dc algorithms. Journal of global optimization 11(3), 253–285.
Tian, Y. and Y. Feng (2023). Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association 118(544), 2684–2697.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1), 267–288.
Torrey, L. and J. Shavlik (2010). Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242–264. IGI global.
Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge University Press.
Wu, C., S. Kwon, X. Shen, and W. Pan (2016). A new algorithm and theory for penalized regression-based clustering. Journal of Machine Learning Research 17(188), 1–25.
Zhou, H. and L. Li (2014). Regularized matrix regression. Journal of the Royal Statistical Society Series B: Statistical Methodology 76(2), 463–483.
Zhuang, F., Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He (2020). A comprehensive survey on transfer learning. Proceedings of the IEEE 109(1), 43–76.

Supplementary Materials

The online Supplementary Material contains extensional theoretical arguments concerning the two specific statistical models and remark on the

optional fine-tuning step. Then, we present additional numerical details

that further support our arguments. Finally, we provide the proofs of the

theoretical results.

Supplementary materials are available for download.

[1] Bastani, H. (2021). Predicting with proxies: Transfer learning in high dimension. Management Science 67(5), 2964–2984.

[2] Boyd, S., N. Parikh, and E. Chu (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc.

[3] Cai, J.-F., E. J. Cand`es, and Z. Shen (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20(4), 1956–1982.

[4] Cai, T. T. and H. Wei (2021). Transfer learning for nonparametric classification. The Annals of Statistics 49(1), 100–128.

[5] Candes, E. and T. Tao (2007). The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics 35(6), 2313–2351.

[6] Chen, S., Q. Zheng, Q. Long, and W. J. Su (2021). A theorem of the alternative for personalized federated learning. arXiv preprint arXiv:2103.01901.

[7] Duan, Y. and K. Wang (2023). Adaptive and robust multi-task learning. The Annals of Statistics 51(5), 2015–2039.

[8] Fan, J., W. Gong, and Z. Zhu (2019). Generalized high-dimensional trace regression via nuclear norm regularization. Journal of Econometrics 212(1), 177–202.

[9] Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96(456), 1348–1360.

[10] Fan, J., W. Wang, and Z. Zhu (2021). A shrinkage principle for heavy-tailed data: Highdimensional robust low-rank matrix recovery. The Annals of Statistics 49(3), 1239–1266.

[11] Gross, S. M. and R. Tibshirani (2016). Data shared lasso: A novel tool to discover uplift. Computational statistics & data analysis 101, 226–235.

[12] Hamidi, N. and M. Bayati (2022). On low-rank trace regression under general sampling distribution. Journal of Machine Learning Research 23(321), 1–49.

[13] He, Y., X. Kong, L. Trapani, and L. Yu (2023). One-way or two-way factor model for matrix sequences? Journal of Econometrics 235(2), 1981–2004.

[14] He, Z., Y. Sun, and R. Li (2024). Transfusion: Covariate-shift robust transfer learning for high-dimensional regression. In International Conference on Artificial Intelligence and Statistics, pp. 703–711. PMLR.

[15] Li, S., T. T. Cai, and H. Li (2022). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society Series B: Statistical Methodology 84(1), 149–173.

[16] Li, S., L. Zhang, T. T. Cai, and H. Li (2024). Estimation and inference for high-dimensional generalized linear models with knowledge transfer. Journal of the American Statistical Association 119(546), 1274–1285.

[17] Liu, D., C. Zhao, Y. He, L. Liu, Y. Guo, and X. Zhang (2023). Simultaneous cluster structure learning and estimation of heterogeneous graphs for matrix-variate fmri data. Biometrics 79(3), 2246–2259.

[18] McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017). Communicationefficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR.

[19] Negahban, S. N., P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science 27(4), 538–557.

[20] Niu, S., Y. Liu, J. Wang, and H. Song (2020). A decade survey of transfer learning (2010–2020). IEEE Transactions on Artificial Intelligence 1(2), 151–166.

[21] Ollier, E. and V. Viallon (2017). Regression modelling on stratified data with the lasso. Biometrika 104(1), 83–96.

[22] Pan, W., X. Shen, and B. Liu (2013). Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. Journal of Machine Learning Research 14(7), 1865– 1889.

[23] Shen, X., W. Pan, and Y. Zhu (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association 107(497), 223–232.

[24] Sun, Y., P. Babu, and D. P. Palomar (2016). Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE Transactions on Signal Processing 65(3), 794–816.

[25] Thi Hoai An, L. and P. Dinh Tao (1997). Solving a class of linearly constrained indefinite quadratic problems by dc algorithms. Journal of global optimization 11(3), 253–285.

[26] Tian, Y. and Y. Feng (2023). Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association 118(544), 2684–2697.

[27] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1), 267–288.

[28] Torrey, L. and J. Shavlik (2010). Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242–264. IGI global.

[29] Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge University Press.

[30] Wu, C., S. Kwon, X. Shen, and W. Pan (2016). A new algorithm and theory for penalized regression-based clustering. Journal of Machine Learning Research 17(188), 1–25.

[31] Zhou, H. and L. Li (2014). Regularized matrix regression. Journal of the Royal Statistical Society Series B: Statistical Methodology 76(2), 463–483.

[32] Zhuang, F., Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He (2020). A comprehensive survey on transfer learning. Proceedings of the IEEE 109(1), 43–76.