Semi-supervised Regression Analysis with Model Misspecification and High-dimensional Data

Ye Tian, Peng Wu and Zhiqiang Tan

doi:10.5705/ss.202024.0261

Abstract

The accessibility of vast volumes of unlabeled data has sparked growing inter

est in semi-supervised learning (SSL) and covariate shift transfer learning (CSTL). In this

paper, we present an inference framework for estimating regression coefficients in conditional mean models within both SSL and CSTL settings, while allowing for the misspecifi-

cation of conditional mean models. We develop an augmented inverse probability weighted

(AIPW) method, employing regularized calibrated estimators for both propensity score (PS)

and outcome regression (OR) nuisance models, with PS and OR models being sequentially

dependent. We show that when the PS model is correctly specified, the proposed estimator achieves consistency, asymptotic normality, and valid confidence intervals, even with

possible OR model misspecification and high-dimensional data. Moreover, by suppressing

detailed technical choices, we demonstrate that previous methods can be unified within our

AIPW framework. Our theoretical findings are verified through extensive simulation studies

and a real-world data application.

Key words and phrases: Augmented inverse probability weighted estimator, Covariate shift transfer learning, High-dimensional data, Semi-supervised learning

Information

Preprint No.	SS-2024-0261
Manuscript ID	SS-2024-0261
Complete Authors	Ye Tian, Peng Wu, Zhiqiang Tan
Corresponding Authors	Zhiqiang Tan
Emails	ztan@stat.rutgers.edu

References

Aloui, A., J. Dong, C. P. Le, and V. Tarokh (2023). Transfer learning for individual treatment effect estimation. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, pp. 56–66.
Alvari, H., E. Shaabani, S. Sarkar, G. Beigi, and P. Shakarian (2019). Less is more: Semi-supervised causal inference for detecting pathogenic users in social media. In Companion Proceedings of The 2019 World Wide Web Conference, pp. 154–161.
Cai, T. T. and Z. Guo (2020). Semisupervised inference for explained variance in high dimensional linear regression and its applications. Journal of the Royal Statistical Society Series B: Statistical Methodology 82, 391–419.
Castro, D. C., I. Walker, and B. Glocker (2020). Causality matters in medical imaging. Nature Communications 11, 3673.
Chakrabortty, A. (2016). Robust Semi-Parametric Inference in Semi-Supervised Settings. Ph. D. thesis, Harvard University.
Chakrabortty, A. and T. Cai (2018). Efficient and adaptive linear regression in semi-supervised settings. The Annals of Statistics 46, 1541–1572.
Chakrabortty, A., G. Dai, and R. J. Carroll (2022). Semi-supervised quantile estimation: Robust and efficient inference in high dimensional settings. arXiv preprint arXiv:2201.10208.
Chakrabortty, A., J. Lu, T. T. Cai, and H. Li (2019). High dimensional m-estimation with missing outcomes: A semi-parametric framework. arXiv preprint arXiv:1911.11345.
Chapelle, O., B. Sch¨olkopf, and A. Zien (2006). Semi-Supervised Learning. The MIT Press.
Chen, B. and F. Huang (2016). Semi-supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pp. 314–323.
Chen, K. and Y. Zhang (2023). Enhancing efficiency and robustness in high-dimensional linear regression with additional unlabeled data. arXiv preprint arXiv:2311.17685.
Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21, C1–C68.
Deng, S., Y. Ning, J. Zhao, and H. Zhang (2024). Optimal and safe estimation for high-dimensional semi-supervised learning. Journal of the American Statistical Association 119, 2748–2759.
Fan, Q., Y.-C. Hsu, R. P. Lieli, and Y. Zhang (2022). Estimation of conditional average treatment effects with high-dimensional data. Journal of Business & Economic Statistics 40, 313–327.
Ghosh, S. and Z. Tan (2022). Doubly robust semiparametric inference using regularized calibrated estimation with high-dimensional data. Bernoulli 28, 1675–1703.
Gronsbell, J. L. and T. Cai (2017). Semi-supervised approaches to efficient evaluation of model prediction performance. Journal of the Royal Statistical Society Series B: Statistical Methodology 80, 579–594.
He, Z., Y. Sun, and R. Li (2024). Transfusion: Covariate-shift robust transfer learning for high-dimensional regression. In International Conference on Artificial Intelligence and Statistics, pp. 703–711.
Imbens, G. W. and D. B. Rubin (2015). Causal Inference for Statistics Social and Biomedical Science. Cambridge University Press.
Little, R. J. and D. B. Rubin (2019). Statistical Analysis with Missing Data. John Wiley & Sons.
Liu, M., Y. Zhang, K. P. Liao, and T. Cai (2023). Augmented transfer regression learning with semi-nonparametric nuisance models. Journal of Machine Learning Research 24, 1–50.
Luo, R. and X. Qi (2017). Signal extraction approach for sparse multivariate response regression. Journal of Multivariate Analysis 153, 83–97.
Molenberghs, G., G. Fitzmaurice, M. G. Kenward, A. Tsiatis, and G. Verbeke (2015). Handbook of Missing Data Methodology. Chapman & Hall/CRC.
Prentice, R. L. and R. Pyke (1979). Logistic disease incidence models and case-control studies. Biometrika 66, 403–411.
Qin, J. (1998). Inferences for case-control and semiparametric two-sample density ratio models. Biometrika 85, 619–630.
Qui˜nonero-Candela, J., M. Sugiyama, A. Schwaighofer, and N. D. Lawrence (2009). Dataset Shift in Machine Learning. The MIT Press.
Robins, J. M. (1999). Marginal structural models versus structural nested models as tools for causal inference. In Statistical Models in Epidemiology, the Environment, and Clinical Trials, pp. 95–133. Springer New York.
Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89, 846–866.
Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal. Biometric 70, 41–55.
Roth, J. J. (2019). Empty homes and acquisitive crime: Does vacancy type matter? American Journal of Criminal Justice 44, 770–787.
Ruder, S., M. E. Peters, S. Swayamdipta, and T. Wolf (2019). Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pp. 15–18.
Sekhon, J. S. (2011). Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software 42, 1–52.
Sohn, K., D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and
C.-L. Li (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In Advances in Neural Information Processing Systems, pp. 596–608.
Tan, Z. (2020a). Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data. The Annals of Statistics 48, 811–837.
Tan, Z. (2020b). Regularized calibrated estimation of propensity scores with model misspecification and high-dimensional data. Biometrika 107, 137–158.
Tang, C., X. Zeng, L. Zhou, Q. Zhou, P. Wang, X. Wu, H. Ren, J. Zhou, and Y. Wang (2024). Semisupervised medical image segmentation via hard positives oriented contrastive learning. Pattern Recognition 146, 110020.
Tian, Y., X. Zhang, and Z. Tan (2026). On semi-supervised estimation using exponential tilt mixture models. Journal of Statistical Planning and Inference 241, 106314.
Wu, P., Z. Tan, W. Hu, and X. Zhou (2024). Model-assisted inference for covariate-specific treatment effects with high-dimensional data. Statistica Sinica 34, 459–479.
You, K. (2023). maotai: Tools for Matrix Algebra, Optimization and Inference. R package version 0.2.5.
Zhang, A., L. D. Brown, and T. T. Cai (2019). Semi-supervised inference: General theory and estimation of means. The Annals of Statistics 47, 2538 – 2566.
Zhang, Y. and J. Bradic (2021). High-dimensional semi-supervised learning: In search of optimal inference of the mean. Biometrika 109, 387–403.
Zhang, Y., A. Chakrabortty, and J. Bradic (2023). Semi-supervised causal inference: Generalizable and double robust inference for average treatment effects under selection bias with decaying overlap. arXiv preprint arXiv:2305.12789.
Zhang, Y., A. Giessing, and Y.-C. Chen (2023). Efficient inference on high-dimensional linear models with missing outcomes. arXiv preprint arXiv:2309.06429.
Zhao, Y., Y. Zheng, B. Yu, Z. Tian, D. Lee, J. Sun, Y. Li, and N. L. Zhang (2022). Semi-supervised lifelong language learning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 3937–3951.
Zheng, M., S. You, L. Huang, F. Wang, C. Qian, and C. Xu (2022). Simmatch: Semi-supervised learning with similarity matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14471–14481.
Zhou, A. and S. Levine (2021). Bayesian adaptation for covariate shift. In Advances in Neural Information Processing Systems, pp. 914–927.
Zhu, X. (2008). Semi-supervised learning literature survey. Technical Report No. 1530, Department of Computer Sciences, University of Wisconsin-Madison, USA.
Zimmert, M. and M. Lechner (2019). Nonparametric estimation of causal heterogeneity under highdimensional confounding. arXiv preprint arXiv:1908.08779. Ye Tian Address: School of Mathematics and Statistics, Northeast Normal University, Jilin 130024, China

Acknowledgments

The authors thank the assistant editor and the anonymous reviewers for their helpful

comments and valuable suggestions. Ye Tian conducted this research while at Rutgers University and is now affiliated with Northeast Normal University. Peng Wu was

supported by the National Natural Science Foundation of China (No. 12301370),

the funding from the Beijing Municipal Education Commission for the Emerging

Interdisciplinary Platform for Digital Business at Beijing Technology and Business

University, and the Beijing Key Laboratory of Applied Statistics and Digital Regulation.

Supplementary Materials

The online Supplementary Material contains a heuristic discussion on conditions for

the proposed estimator to be

√

N-consistent and asymptotic normal, a comparison

of our paper with several related papers with regression of Y on high-dimensional

Z = X and papers under stratified sampling settings, detailed proofs of theorems as

well as propositions, and details of the numerical implementation and application.

Supplementary materials are available for download.

[1] Aloui, A., J. Dong, C. P. Le, and V. Tarokh (2023). Transfer learning for individual treatment effect estimation. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, pp. 56–66.

[2] Alvari, H., E. Shaabani, S. Sarkar, G. Beigi, and P. Shakarian (2019). Less is more: Semi-supervised causal inference for detecting pathogenic users in social media. In Companion Proceedings of The 2019 World Wide Web Conference, pp. 154–161.

[3] Cai, T. T. and Z. Guo (2020). Semisupervised inference for explained variance in high dimensional linear regression and its applications. Journal of the Royal Statistical Society Series B: Statistical Methodology 82, 391–419.

[4] Castro, D. C., I. Walker, and B. Glocker (2020). Causality matters in medical imaging. Nature Communications 11, 3673.

[5] Chakrabortty, A. (2016). Robust Semi-Parametric Inference in Semi-Supervised Settings. Ph. D. thesis, Harvard University.

[6] Chakrabortty, A. and T. Cai (2018). Efficient and adaptive linear regression in semi-supervised settings. The Annals of Statistics 46, 1541–1572.

[7] Chakrabortty, A., G. Dai, and R. J. Carroll (2022). Semi-supervised quantile estimation: Robust and efficient inference in high dimensional settings. arXiv preprint arXiv:2201.10208.

[8] Chakrabortty, A., J. Lu, T. T. Cai, and H. Li (2019). High dimensional m-estimation with missing outcomes: A semi-parametric framework. arXiv preprint arXiv:1911.11345.

[9] Chapelle, O., B. Sch¨olkopf, and A. Zien (2006). Semi-Supervised Learning. The MIT Press.

[10] Chen, B. and F. Huang (2016). Semi-supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pp. 314–323.

[11] Chen, K. and Y. Zhang (2023). Enhancing efficiency and robustness in high-dimensional linear regression with additional unlabeled data. arXiv preprint arXiv:2311.17685.

[12] Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21, C1–C68.

[13] Deng, S., Y. Ning, J. Zhao, and H. Zhang (2024). Optimal and safe estimation for high-dimensional semi-supervised learning. Journal of the American Statistical Association 119, 2748–2759.

[14] Fan, Q., Y.-C. Hsu, R. P. Lieli, and Y. Zhang (2022). Estimation of conditional average treatment effects with high-dimensional data. Journal of Business & Economic Statistics 40, 313–327.

[15] Ghosh, S. and Z. Tan (2022). Doubly robust semiparametric inference using regularized calibrated estimation with high-dimensional data. Bernoulli 28, 1675–1703.

[16] Gronsbell, J. L. and T. Cai (2017). Semi-supervised approaches to efficient evaluation of model prediction performance. Journal of the Royal Statistical Society Series B: Statistical Methodology 80, 579–594.

[17] He, Z., Y. Sun, and R. Li (2024). Transfusion: Covariate-shift robust transfer learning for high-dimensional regression. In International Conference on Artificial Intelligence and Statistics, pp. 703–711.

[18] Imbens, G. W. and D. B. Rubin (2015). Causal Inference for Statistics Social and Biomedical Science. Cambridge University Press.

[19] Little, R. J. and D. B. Rubin (2019). Statistical Analysis with Missing Data. John Wiley & Sons.

[20] Liu, M., Y. Zhang, K. P. Liao, and T. Cai (2023). Augmented transfer regression learning with semi-nonparametric nuisance models. Journal of Machine Learning Research 24, 1–50.

[21] Luo, R. and X. Qi (2017). Signal extraction approach for sparse multivariate response regression. Journal of Multivariate Analysis 153, 83–97.

[22] Molenberghs, G., G. Fitzmaurice, M. G. Kenward, A. Tsiatis, and G. Verbeke (2015). Handbook of Missing Data Methodology. Chapman & Hall/CRC.

[23] Prentice, R. L. and R. Pyke (1979). Logistic disease incidence models and case-control studies. Biometrika 66, 403–411.

[24] Qin, J. (1998). Inferences for case-control and semiparametric two-sample density ratio models. Biometrika 85, 619–630.

[25] Qui˜nonero-Candela, J., M. Sugiyama, A. Schwaighofer, and N. D. Lawrence (2009). Dataset Shift in Machine Learning. The MIT Press.

[26] Robins, J. M. (1999). Marginal structural models versus structural nested models as tools for causal inference. In Statistical Models in Epidemiology, the Environment, and Clinical Trials, pp. 95–133. Springer New York.

[27] Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89, 846–866.

[28] Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal. Biometric 70, 41–55.

[29] Roth, J. J. (2019). Empty homes and acquisitive crime: Does vacancy type matter? American Journal of Criminal Justice 44, 770–787.

[30] Ruder, S., M. E. Peters, S. Swayamdipta, and T. Wolf (2019). Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pp. 15–18.

[31] Sekhon, J. S. (2011). Multivariate and propensity score matching software with automated balance optimization: The Matching package for R. Journal of Statistical Software 42, 1–52.

[32] Sohn, K., D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and

[33] C.-L. Li (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In Advances in Neural Information Processing Systems, pp. 596–608.

[34] Tan, Z. (2020a). Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data. The Annals of Statistics 48, 811–837.

[35] Tan, Z. (2020b). Regularized calibrated estimation of propensity scores with model misspecification and high-dimensional data. Biometrika 107, 137–158.

[36] Tang, C., X. Zeng, L. Zhou, Q. Zhou, P. Wang, X. Wu, H. Ren, J. Zhou, and Y. Wang (2024). Semisupervised medical image segmentation via hard positives oriented contrastive learning. Pattern Recognition 146, 110020.

[37] Tian, Y., X. Zhang, and Z. Tan (2026). On semi-supervised estimation using exponential tilt mixture models. Journal of Statistical Planning and Inference 241, 106314.

[38] Wu, P., Z. Tan, W. Hu, and X. Zhou (2024). Model-assisted inference for covariate-specific treatment effects with high-dimensional data. Statistica Sinica 34, 459–479.

[39] You, K. (2023). maotai: Tools for Matrix Algebra, Optimization and Inference. R package version 0.2.5.

[40] Zhang, A., L. D. Brown, and T. T. Cai (2019). Semi-supervised inference: General theory and estimation of means. The Annals of Statistics 47, 2538 – 2566.

[41] Zhang, Y. and J. Bradic (2021). High-dimensional semi-supervised learning: In search of optimal inference of the mean. Biometrika 109, 387–403.

[42] Zhang, Y., A. Chakrabortty, and J. Bradic (2023). Semi-supervised causal inference: Generalizable and double robust inference for average treatment effects under selection bias with decaying overlap. arXiv preprint arXiv:2305.12789.

[43] Zhang, Y., A. Giessing, and Y.-C. Chen (2023). Efficient inference on high-dimensional linear models with missing outcomes. arXiv preprint arXiv:2309.06429.

[44] Zhao, Y., Y. Zheng, B. Yu, Z. Tian, D. Lee, J. Sun, Y. Li, and N. L. Zhang (2022). Semi-supervised lifelong language learning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 3937–3951.

[45] Zheng, M., S. You, L. Huang, F. Wang, C. Qian, and C. Xu (2022). Simmatch: Semi-supervised learning with similarity matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14471–14481.

[46] Zhou, A. and S. Levine (2021). Bayesian adaptation for covariate shift. In Advances in Neural Information Processing Systems, pp. 914–927.

[47] Zhu, X. (2008). Semi-supervised learning literature survey. Technical Report No. 1530, Department of Computer Sciences, University of Wisconsin-Madison, USA.

[48] Zimmert, M. and M. Lechner (2019). Nonparametric estimation of causal heterogeneity under highdimensional confounding. arXiv preprint arXiv:1908.08779. Ye Tian Address: School of Mathematics and Statistics, Northeast Normal University, Jilin 130024, China