Hybrid Denoising-screening for High-dimensional Contaminated Data

Liming Wang, Peng Lai, Chen Xu and Xingxiang Li

doi:10.5705/ss.202024.0248

Abstract

Feature screening is an effective tool to eliminate irrelevant features in high-dimensional

analysis. When a high-dimensional dataset is contaminated with noisy observations, the conventional screening methods may lead to a poor screening accuracy.

To tackle this problem, one

practical strategy is to remove noisy observations and irrelevant features simultaneously. In this

paper, we propose a novel hybrid denoising-screening (HDS) procedure for high-dimensional contaminated data. The new method is built upon a dual sample-feature L0 fitting procedure, which

precisely controls both numbers of observations and features to be retained for the analysis. In

the HDS process, only clean observations are selected and the joint effects between features are

naturally accounted. These merits give HDS an edge to outperform the existing screening methods

when faced with contaminated data. The promising performance of the method is supported by

both theories and numerical examples.

Key words and phrases: Contaminated data analysis, joint feature screening, noise detection, alternating optimization strategy, sure screening property

Information

Preprint No.	SS-2024-0248
Manuscript ID	SS-2024-0248
Complete Authors	Liming Wang, Peng Lai, Chen Xu, Xingxiang Li
Corresponding Authors	Xingxiang Li
Emails	lxxwlm2013@xjtu.edu.cn

References

Balakrishnan, S., M. J. Wainwright, and B. Yu (2017). Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics 45(1), 77–120.
Chen, J. and Z. Chen (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika 95(3), 759–771.
Fan, J. and Y. Fan (2008). High dimensional classification using features annealed independence rules. Annals of statistics 36(6), 2605–2637.
Fan, J., Y. Feng, and R. Song (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association 106(494), 544–557.
Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.
Fan, J., R. Samworth, and Y. Wu (2009). Ultrahigh dimensional feature selection: beyond the linear model. Journal of machine learning research 10(Sep), 2013–2038.
Fan, J. and R. Song (2010). Sure independence screening in generalized linear models with np-dimensionality. The Annals of Statistics 38(6), 3567–3604.
Gorst-Rasmussen, A. and T. Scheike (2013). Independent screening for single-index hazard rate models with ultrahigh dimensional features. Journal of the Royal Statistical Society Series B: Statistical Methodology 75(2), 217–245.
Hampel, F. R., E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel (2011). Robust statistics: the approach based on influence functions. John Wiley & Sons.
Jing, K., A. Khalili, and C. Xu (2025). Class-specific joint feature screening in ultrahigh-dimensional mixture regression. Journal of the American Statistical Association, 1–11.
Li, G., H. Peng, J. Zhang, and L. Zhu (2012). Robust rank correlation based screening. The Annals of Statistics 40(3), 1846–1877.
Li, R., W. Zhong, and L. Zhu (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association 107(499), 1129–1139.
Mai, Q. and H. Zou (2015). The fused kolmogorov filter: A nonparametric model-free screening method. The Annals of Statistics 43(4), 1471–1497.
Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, Volume 87. Springer Science & Business Media.
Rousseeuw, P. J. and A. M. Leroy (1987). Robust regression and outlier detection, Volume 589. John wiley & sons.
She, Y. and A. B. Owen (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association 106(494), 626–639.
Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association 104(488), 1512–1524.
Wang, X. and C. Leng (2016). High dimensional ordinary least squares projection for screening variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78(3), 589–611.
Wang, Z., Q. Gu, Y. Ning, and H. Liu (2014). High dimensional expectation-maximization algorithm: Statistical optimization and asymptotic normality. arXiv preprint arXiv:1412.8729.
Weisberg, S. (2005). Applied linear regression, Volume 528. John Wiley & Sons.
Xu, C. and J. Chen (2014). The sparse mle for ultrahigh-dimensional feature screening. Journal of the American Statistical Association 109(507), 1257–1269.
Zang, Q., C. Xu, and K. Burkett (2022). Smle: An r package for joint feature screening in ultrahigh-dimensional glms. arXiv preprint arXiv:2201.03512.
Zhou, T., L. Zhu, C. Xu, and R. Li (2020). Model-free forward screening via cumulative divergence. Journal of the American Statistical Association 115(531), 1393–1405.
Zhu, L.-P., L. Li, R. Li, and L.-X. Zhu (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association 106(496), 1464–1475.

Acknowledgments

Xu’s research was supported by National Key R&D Program of China (2023YFA1008703)

and Major Key Project of PCL (PCL2024A06).

Li’s research supported by the Natural Science Foundation (NSF) of China (12401394), Postdoctoral Fellowship of CPSF

(GZB20240611), China Postdoctoral Science Foundation (2024M752549), Shaanxi Province

Postdoctoral Research Project Funding (2024BSHSDZZ145), and Fundamental Research

Funds for the Central Universities (xzy012024035). Wang’s research was supported by the

NSF of the Jiangsu Higher Education Institutions of China (24KJD110002), Postgraduate Research and Practice Innovation Program of Jiangsu Province (No. KYCX23_1288)

and Qing Lan Project of Jiangsu Province (2022).

Supplementary Materials

The online Supplementary Material contains the comparison of HDS with LTS and

MSMOD, the implementation details of IHT and the final algorithm for HDS, Example S1 and S2 in simulations, and the proofs of all theoretical results in the main text.

Supplementary materials are available for download.

[1] Balakrishnan, S., M. J. Wainwright, and B. Yu (2017). Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics 45(1), 77–120.

[2] Chen, J. and Z. Chen (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika 95(3), 759–771.

[3] Fan, J. and Y. Fan (2008). High dimensional classification using features annealed independence rules. Annals of statistics 36(6), 2605–2637.

[4] Fan, J., Y. Feng, and R. Song (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association 106(494), 544–557.

[5] Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.

[6] Fan, J., R. Samworth, and Y. Wu (2009). Ultrahigh dimensional feature selection: beyond the linear model. Journal of machine learning research 10(Sep), 2013–2038.

[7] Fan, J. and R. Song (2010). Sure independence screening in generalized linear models with np-dimensionality. The Annals of Statistics 38(6), 3567–3604.

[8] Gorst-Rasmussen, A. and T. Scheike (2013). Independent screening for single-index hazard rate models with ultrahigh dimensional features. Journal of the Royal Statistical Society Series B: Statistical Methodology 75(2), 217–245.

[9] Hampel, F. R., E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel (2011). Robust statistics: the approach based on influence functions. John Wiley & Sons.

[10] Jing, K., A. Khalili, and C. Xu (2025). Class-specific joint feature screening in ultrahigh-dimensional mixture regression. Journal of the American Statistical Association, 1–11.

[11] Li, G., H. Peng, J. Zhang, and L. Zhu (2012). Robust rank correlation based screening. The Annals of Statistics 40(3), 1846–1877.

[12] Li, R., W. Zhong, and L. Zhu (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association 107(499), 1129–1139.

[13] Mai, Q. and H. Zou (2015). The fused kolmogorov filter: A nonparametric model-free screening method. The Annals of Statistics 43(4), 1471–1497.

[14] Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, Volume 87. Springer Science & Business Media.

[15] Rousseeuw, P. J. and A. M. Leroy (1987). Robust regression and outlier detection, Volume 589. John wiley & sons.

[16] She, Y. and A. B. Owen (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association 106(494), 626–639.

[17] Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association 104(488), 1512–1524.

[18] Wang, X. and C. Leng (2016). High dimensional ordinary least squares projection for screening variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78(3), 589–611.

[19] Wang, Z., Q. Gu, Y. Ning, and H. Liu (2014). High dimensional expectation-maximization algorithm: Statistical optimization and asymptotic normality. arXiv preprint arXiv:1412.8729.

[20] Weisberg, S. (2005). Applied linear regression, Volume 528. John Wiley & Sons.

[21] Xu, C. and J. Chen (2014). The sparse mle for ultrahigh-dimensional feature screening. Journal of the American Statistical Association 109(507), 1257–1269.

[22] Zang, Q., C. Xu, and K. Burkett (2022). Smle: An r package for joint feature screening in ultrahigh-dimensional glms. arXiv preprint arXiv:2201.03512.

[23] Zhou, T., L. Zhu, C. Xu, and R. Li (2020). Model-free forward screening via cumulative divergence. Journal of the American Statistical Association 115(531), 1393–1405.

[24] Zhu, L.-P., L. Li, R. Li, and L.-X. Zhu (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association 106(496), 1464–1475.