Abstract
Feature screening is an effective tool to eliminate irrelevant features in high-dimensional
analysis. When a high-dimensional dataset is contaminated with noisy observations, the conventional screening methods may lead to a poor screening accuracy.
To tackle this problem, one
practical strategy is to remove noisy observations and irrelevant features simultaneously. In this
paper, we propose a novel hybrid denoising-screening (HDS) procedure for high-dimensional contaminated data. The new method is built upon a dual sample-feature L0 fitting procedure, which
precisely controls both numbers of observations and features to be retained for the analysis. In
the HDS process, only clean observations are selected and the joint effects between features are
naturally accounted. These merits give HDS an edge to outperform the existing screening methods
when faced with contaminated data. The promising performance of the method is supported by
both theories and numerical examples.
Information
| Preprint No. | SS-2024-0248 |
|---|---|
| Manuscript ID | SS-2024-0248 |
| Complete Authors | Liming Wang, Peng Lai, Chen Xu, Xingxiang Li |
| Corresponding Authors | Xingxiang Li |
| Emails | lxxwlm2013@xjtu.edu.cn |
References
- Balakrishnan, S., M. J. Wainwright, and B. Yu (2017). Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics 45(1), 77–120.
- Chen, J. and Z. Chen (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika 95(3), 759–771.
- Fan, J. and Y. Fan (2008). High dimensional classification using features annealed independence rules. Annals of statistics 36(6), 2605–2637.
- Fan, J., Y. Feng, and R. Song (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association 106(494), 544–557.
- Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.
- Fan, J., R. Samworth, and Y. Wu (2009). Ultrahigh dimensional feature selection: beyond the linear model. Journal of machine learning research 10(Sep), 2013–2038.
- Fan, J. and R. Song (2010). Sure independence screening in generalized linear models with np-dimensionality. The Annals of Statistics 38(6), 3567–3604.
- Gorst-Rasmussen, A. and T. Scheike (2013). Independent screening for single-index hazard rate models with ultrahigh dimensional features. Journal of the Royal Statistical Society Series B: Statistical Methodology 75(2), 217–245.
- Hampel, F. R., E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel (2011). Robust statistics: the approach based on influence functions. John Wiley & Sons.
- Jing, K., A. Khalili, and C. Xu (2025). Class-specific joint feature screening in ultrahigh-dimensional mixture regression. Journal of the American Statistical Association, 1–11.
- Li, G., H. Peng, J. Zhang, and L. Zhu (2012). Robust rank correlation based screening. The Annals of Statistics 40(3), 1846–1877.
- Li, R., W. Zhong, and L. Zhu (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association 107(499), 1129–1139.
- Mai, Q. and H. Zou (2015). The fused kolmogorov filter: A nonparametric model-free screening method. The Annals of Statistics 43(4), 1471–1497.
- Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, Volume 87. Springer Science & Business Media.
- Rousseeuw, P. J. and A. M. Leroy (1987). Robust regression and outlier detection, Volume 589. John wiley & sons.
- She, Y. and A. B. Owen (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association 106(494), 626–639.
- Wang, H. (2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association 104(488), 1512–1524.
- Wang, X. and C. Leng (2016). High dimensional ordinary least squares projection for screening variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78(3), 589–611.
- Wang, Z., Q. Gu, Y. Ning, and H. Liu (2014). High dimensional expectation-maximization algorithm: Statistical optimization and asymptotic normality. arXiv preprint arXiv:1412.8729.
- Weisberg, S. (2005). Applied linear regression, Volume 528. John Wiley & Sons.
- Xu, C. and J. Chen (2014). The sparse mle for ultrahigh-dimensional feature screening. Journal of the American Statistical Association 109(507), 1257–1269.
- Zang, Q., C. Xu, and K. Burkett (2022). Smle: An r package for joint feature screening in ultrahigh-dimensional glms. arXiv preprint arXiv:2201.03512.
- Zhou, T., L. Zhu, C. Xu, and R. Li (2020). Model-free forward screening via cumulative divergence. Journal of the American Statistical Association 115(531), 1393–1405.
- Zhu, L.-P., L. Li, R. Li, and L.-X. Zhu (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association 106(496), 1464–1475.
Acknowledgments
Xu’s research was supported by National Key R&D Program of China (2023YFA1008703)
and Major Key Project of PCL (PCL2024A06).
Li’s research supported by the Natural Science Foundation (NSF) of China (12401394), Postdoctoral Fellowship of CPSF
(GZB20240611), China Postdoctoral Science Foundation (2024M752549), Shaanxi Province
Postdoctoral Research Project Funding (2024BSHSDZZ145), and Fundamental Research
Funds for the Central Universities (xzy012024035). Wang’s research was supported by the
NSF of the Jiangsu Higher Education Institutions of China (24KJD110002), Postgraduate Research and Practice Innovation Program of Jiangsu Province (No. KYCX23_1288)
and Qing Lan Project of Jiangsu Province (2022).
Supplementary Materials
The online Supplementary Material contains the comparison of HDS with LTS and
MSMOD, the implementation details of IHT and the final algorithm for HDS, Example S1 and S2 in simulations, and the proofs of all theoretical results in the main text.