New Feature Screening Methods for Massive Interval-censored Failure Time Data

Huiqiong Li, Zhimiao Cao, Jianguo Sun and Niansheng Tang

doi:10.5705/ss.202023.0309

Abstract

Screening important features has become one of the important tasks in

statistical analysis and correspondingly, various screening procedures have been

proposed for various types of studies or data including both complete and incomplete data. However, these methods would be computationally costly or even

infeasible when one faces massive health databases with both high dimensionality

and huge sample size, which have become increasingly popular for comparative

effectiveness and safety studies of medical products. In this paper, we consider

such a type of incomplete data, interval-censored failure time data, that have

not be discussed before and propose two procedures with the use of distance correlation and orthogonal sampling as well as the the jackknife debiased average

technique. The proposed approaches can be easily implemented and their sure

screening and rank consistency properties are established.

Simulation studies

demonstrate that the proposed methods work well for practical situations and

they are applied to the SEER breast cancer data.

Key words and phrases: Distance correlation, Jackknife debiased average, Or- thogonal subsampling, Rank consistency, Sure screening

Information

Preprint No.	SS-2023-0309
Manuscript ID	SS-2023-0309
Complete Authors	Huiqiong Li, Zhimiao Cao, Jianguo Sun, Niansheng Tang
Corresponding Authors	Jianguo Sun
Emails	sunj@missouri.edu

References

Battey, H., Fan, J., Liu, H., Lu, J., and Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. Annals of Statistics 46(3), 1352–1382.
Dai, J., Liu, Y., Chen, J. and Liu, X. (2020). Fast feature selection for interval-valued data through kernel density estimation entropy. International Journal of Machine Learning and Cybernetics 11, 2607–2624.
Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in sparse ultrahigh-dimensional additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.
Fan, J., Li, R., Zhang, C., and Zou, H. (2020). Statistical Foundations of Data Science. Boca
Raton, FL: CRC press.
Fan, J., and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.
Fan, J. , and Song, R. (2010). Sure independence screening in generalized linear models with np-dimensionality. Annals of Statistics 38(6), 3567–3604.
Fan, Y., and Sun, J. (2021). Subsampling from features in large regression to find “winning features”. Statistical Analysis and Data Mining 14, 168–184.
Kawaguchi, E., Suchard, M., Liu, Z., and Li, G. (2020). A surrogate l0 sparse Cox’s regression with applications to sparse high-dimensional massive sample size time-to-event data. Statistics in Medicine 39(6), 675–686.
Kalbfleisch, J., and Prentice, R. (2002). The statistical analysis of failure time data, 2nd edition. John Wiley and Sons, New York.
Kong, E., and Xia, Y. (2019). On the efficiency of online approach to nonparametric smoothing of big data. Statistica Sinica 29, 185–201.
Li, G., Peng, H., Zhang, J., and Zhu, L. (2012a). Robust rank correlation based screening. The Annals of Statistics 40(3), 1846–1877.
Li, R., Zhong, W., and Zhu, L. (2012b). Feature screening via distance correlation learning.
Journal of the American Statistical Association 107(499), 1129–1139.
Li, X., and Xu, C. (2023). Feature screening with conditional rank utility for big-data classification. Journal of the American Statistical Association 119(546), 1385–1395.
Luo, L., and Song, P. (2020). Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society, Series B 82, 69–97.
Ma, P., Mahoney, M., and Yu, B. (2015). A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research 16(1), 861–911.
Pan, R., Zhu, Y., Guo, B., Zhu, X., and Wang, H. (2023). A sequential addressing subsampling method for massive data analysis under memory constraint. IEEE Transactions on Knowledge and Data Engineering 35(9), 9502-9513.
Peng, Y., and Zhang, Q. (2021). Feature Selection for Interval-Valued Data Based on D-S Evidence Theory. IEEE Access 9, 122754–122765.
Schifano, E., Wu, J., Wang, C., Yan, J., and Chen, M. (2016). Online updating of statistical inference in the big data setting. Technometrics 58, 393–403.
Shi, C., Lu, W., and Song, R. (2018). A massive data framework for M-estimators with cubicrate. Journal of the American Statistical Association 113(524), 1698–1709.
Shu, W., Chen, T., Cao, D. and Qian, Q. (2024). Incremental feature selection based on uncertainty measure for dynamic interval-valued data. International Journal of Machine Learning and Cybernetics 15, 1453–1472.
Sun, J. (2006). The Statistical Analysis of Interval-Censored Failure Time Data. New York, NY: Springer.
Sz´ekely, G., Rizzo, M., and Bakirov, N. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics 35, 2769–2794.
Wang, H. (2019). More efficient estimation for logistic regression with optimal subsample. Journal of Machine Learning Research 20, 1–59.
Wang, C., Chen, M., Wu, J., Yan, J., Zhang, Y., and Schifano, E. (2018a). Online updating method with new variables for big data streams. Canadian Journal of Statistics 46, 123– 146.
Wang, L., Elmstedt, J., Wong, W., and Xu, H. (2021a). Orthogonal subsampling for big data linear regression. The Annals of Applied Statistics 15(3), 1273–1290. DOI: 10.1214/21AOAS1462
Wang, Y., Hong, C., Palmer, N., Di, Q., Di, Q., Schwartz, J., Kohane, I., and Cai, T. (2021b).
A fast divide-and-conquer sparse Cox regression. Biostatistics 22(2), 381-401.
Wang, H., Yang, M., and Stufken, J. (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association 114(525), 393–405.
Wang, H., Zhu, R., and Ma, P. (2018b). Optimal subsampling for large sample logistic regression.
Journal of the American Statistical Association 113(522), 829–844.
Wu, Y., and Cook, R. (2015). Penalized regression for interval-censored times of disease progression: selection of HLA markers in psoriatic arthritis. Biometrics 71, 782–791. Wu, J., Chen, M., Schifano, E., and Yan,
J. (2021). Online updating of survival analysis. Journal of Computational and Graphical Statistics 30(4), 1209–1223. doi: 10.1080/10618600.2020.1870481.
Wu, Y., and Yin, G. (2015). Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 102(1), 65-76.
Xue, Y., Wang, H., Yan, J., and Schifano, E. (2019). An online updating approach for testing the proportional hazards assumption with streams of survival data. Biometrics 76(1), 171–182.
Yang, J., Schuemie, M., Ji, X., and Suchard, M. (2024). Massive Parallelization of Massive Sample-Size Survival Analysis. Journal of Computational and Graphical Statistics 33(1), 289–302.
Zhang, J., Du, M., Liu, Y., and Sun, J. (2023). A new model-free feature screening procedure for ultrahigh-dimensional interval-censored failure time data. Statistica Sinica 33, 1809-1830.
Zhang, X., and Feng, Z. (2024). Feature selection based on contradictory state sequence for multi-scale interval valued decision table. Information Sciences 677, 120926.
Zhang, Y., Hua, L., and Huang, J. (2010). A spline-based semiparametric maximum likelihood estimation method for the cox model with interval-censored data. Scandinavian Journal of Statistics 37, 338–354.
Zhao, T., Cheng, G., and Liu, H. (2016). A partially linear framework for massive heterogeneous data. Annals of Statistics 44(4), 1400–1437.
Zhong, W., Zhu, L., Li, R., and Cui, H. (2016). Regularized quantile regression and robust feature screening for single index models. Statistica Sinica 26(1), 69–95.
Zhong, W., Qian, C., Liu, W., Zhu, L., and Li, R. (2023). Feature screening for interval-valued response with application to study association between posted salary and required skills. Journal of the American Statistical Association 118(542), 805–817.
Zhou, Q., Zhou, H., and Cai, J. (2017). case-cohort studies with interval-censored failure time data. Biometrika 104, 17–29.
Zhu, L., Li, L., Li, R., and Zhu, L. (2011). Model-free feature screening for ultrahigh dimensional data. Journal of the American Statistical Association 106, 1464–1475.
Zhu, X., Pan, R., Wu, S., and Wang, H. (2022). Feature screening for massive data analysis by subsampling. Journal of Business & Economic Statistics 40(4), 1892–1903.
Zhu, L., Xu, K., Li, R., and Zhong, W. (2017). Projection correlation between two random vectors. Biometrika 104, 829—843.
Zuo, L., Zhang, H., Wang, H., and Liu, L. (2021). Sampling-based estimation for massive survival data with additive hazards model. Statistics in Medicine 40(2), 441–450. Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University, 2 Cuihu North Road, Kunming City, Yunnan Province, Kunming, 650091, China

Acknowledgments

The authors wish to thank the Co-Editor, Dr. Huixia Wang, the Associate Editor

and two reviewers for their many helpful and insightful comments and suggestions that greatly improved the paper.The research was partially supported by

a grant from the National Key R&D Program of China(Grant Number 2022Y-

FA1003701), a grant from the Natural Science Foundation of China [Grant Number 12261102], and the grants from Yunnan Fundamental Research Project, Chi-

na [Grant Numbers 202201BF070001-004, 202301AS070044,202401AS070152].

Supplementary Materials

The online Supplementary Material includes the three algorithms mentioned

above, some additional simulation results, and the proofs of all the theorems.

Supplementary materials are available for download.

[1] Battey, H., Fan, J., Liu, H., Lu, J., and Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. Annals of Statistics 46(3), 1352–1382.

[2] Dai, J., Liu, Y., Chen, J. and Liu, X. (2020). Fast feature selection for interval-valued data through kernel density estimation entropy. International Journal of Machine Learning and Cybernetics 11, 2607–2624.

[3] Fan, J., Feng, Y., and Song, R. (2011). Nonparametric independence screening in sparse ultrahigh-dimensional additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.

[4] Fan, J., Li, R., Zhang, C., and Zou, H. (2020). Statistical Foundations of Data Science. Boca

[5] Raton, FL: CRC press.

[6] Fan, J., and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911.

[7] Fan, J. , and Song, R. (2010). Sure independence screening in generalized linear models with np-dimensionality. Annals of Statistics 38(6), 3567–3604.

[8] Fan, Y., and Sun, J. (2021). Subsampling from features in large regression to find “winning features”. Statistical Analysis and Data Mining 14, 168–184.

[9] Kawaguchi, E., Suchard, M., Liu, Z., and Li, G. (2020). A surrogate l0 sparse Cox’s regression with applications to sparse high-dimensional massive sample size time-to-event data. Statistics in Medicine 39(6), 675–686.

[10] Kalbfleisch, J., and Prentice, R. (2002). The statistical analysis of failure time data, 2nd edition. John Wiley and Sons, New York.

[11] Kong, E., and Xia, Y. (2019). On the efficiency of online approach to nonparametric smoothing of big data. Statistica Sinica 29, 185–201.

[12] Li, G., Peng, H., Zhang, J., and Zhu, L. (2012a). Robust rank correlation based screening. The Annals of Statistics 40(3), 1846–1877.

[13] Li, R., Zhong, W., and Zhu, L. (2012b). Feature screening via distance correlation learning.

[14] Journal of the American Statistical Association 107(499), 1129–1139.

[15] Li, X., and Xu, C. (2023). Feature screening with conditional rank utility for big-data classification. Journal of the American Statistical Association 119(546), 1385–1395.

[16] Luo, L., and Song, P. (2020). Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of the Royal Statistical Society, Series B 82, 69–97.

[17] Ma, P., Mahoney, M., and Yu, B. (2015). A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research 16(1), 861–911.

[18] Pan, R., Zhu, Y., Guo, B., Zhu, X., and Wang, H. (2023). A sequential addressing subsampling method for massive data analysis under memory constraint. IEEE Transactions on Knowledge and Data Engineering 35(9), 9502-9513.

[19] Peng, Y., and Zhang, Q. (2021). Feature Selection for Interval-Valued Data Based on D-S Evidence Theory. IEEE Access 9, 122754–122765.

[20] Schifano, E., Wu, J., Wang, C., Yan, J., and Chen, M. (2016). Online updating of statistical inference in the big data setting. Technometrics 58, 393–403.

[21] Shi, C., Lu, W., and Song, R. (2018). A massive data framework for M-estimators with cubicrate. Journal of the American Statistical Association 113(524), 1698–1709.

[22] Shu, W., Chen, T., Cao, D. and Qian, Q. (2024). Incremental feature selection based on uncertainty measure for dynamic interval-valued data. International Journal of Machine Learning and Cybernetics 15, 1453–1472.

[23] Sun, J. (2006). The Statistical Analysis of Interval-Censored Failure Time Data. New York, NY: Springer.

[24] Sz´ekely, G., Rizzo, M., and Bakirov, N. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics 35, 2769–2794.

[25] Wang, H. (2019). More efficient estimation for logistic regression with optimal subsample. Journal of Machine Learning Research 20, 1–59.

[26] Wang, C., Chen, M., Wu, J., Yan, J., Zhang, Y., and Schifano, E. (2018a). Online updating method with new variables for big data streams. Canadian Journal of Statistics 46, 123– 146.

[27] Wang, L., Elmstedt, J., Wong, W., and Xu, H. (2021a). Orthogonal subsampling for big data linear regression. The Annals of Applied Statistics 15(3), 1273–1290. DOI: 10.1214/21AOAS1462

[28] Wang, Y., Hong, C., Palmer, N., Di, Q., Di, Q., Schwartz, J., Kohane, I., and Cai, T. (2021b).

[29] A fast divide-and-conquer sparse Cox regression. Biostatistics 22(2), 381-401.

[30] Wang, H., Yang, M., and Stufken, J. (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association 114(525), 393–405.

[31] Wang, H., Zhu, R., and Ma, P. (2018b). Optimal subsampling for large sample logistic regression.

[32] Journal of the American Statistical Association 113(522), 829–844.

[33] Wu, Y., and Cook, R. (2015). Penalized regression for interval-censored times of disease progression: selection of HLA markers in psoriatic arthritis. Biometrics 71, 782–791. Wu, J., Chen, M., Schifano, E., and Yan,

[34] J. (2021). Online updating of survival analysis. Journal of Computational and Graphical Statistics 30(4), 1209–1223. doi: 10.1080/10618600.2020.1870481.

[35] Wu, Y., and Yin, G. (2015). Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 102(1), 65-76.

[36] Xue, Y., Wang, H., Yan, J., and Schifano, E. (2019). An online updating approach for testing the proportional hazards assumption with streams of survival data. Biometrics 76(1), 171–182.

[37] Yang, J., Schuemie, M., Ji, X., and Suchard, M. (2024). Massive Parallelization of Massive Sample-Size Survival Analysis. Journal of Computational and Graphical Statistics 33(1), 289–302.

[38] Zhang, J., Du, M., Liu, Y., and Sun, J. (2023). A new model-free feature screening procedure for ultrahigh-dimensional interval-censored failure time data. Statistica Sinica 33, 1809-1830.

[39] Zhang, X., and Feng, Z. (2024). Feature selection based on contradictory state sequence for multi-scale interval valued decision table. Information Sciences 677, 120926.

[40] Zhang, Y., Hua, L., and Huang, J. (2010). A spline-based semiparametric maximum likelihood estimation method for the cox model with interval-censored data. Scandinavian Journal of Statistics 37, 338–354.

[41] Zhao, T., Cheng, G., and Liu, H. (2016). A partially linear framework for massive heterogeneous data. Annals of Statistics 44(4), 1400–1437.

[42] Zhong, W., Zhu, L., Li, R., and Cui, H. (2016). Regularized quantile regression and robust feature screening for single index models. Statistica Sinica 26(1), 69–95.

[43] Zhong, W., Qian, C., Liu, W., Zhu, L., and Li, R. (2023). Feature screening for interval-valued response with application to study association between posted salary and required skills. Journal of the American Statistical Association 118(542), 805–817.

[44] Zhou, Q., Zhou, H., and Cai, J. (2017). case-cohort studies with interval-censored failure time data. Biometrika 104, 17–29.

[45] Zhu, L., Li, L., Li, R., and Zhu, L. (2011). Model-free feature screening for ultrahigh dimensional data. Journal of the American Statistical Association 106, 1464–1475.

[46] Zhu, X., Pan, R., Wu, S., and Wang, H. (2022). Feature screening for massive data analysis by subsampling. Journal of Business & Economic Statistics 40(4), 1892–1903.

[47] Zhu, L., Xu, K., Li, R., and Zhong, W. (2017). Projection correlation between two random vectors. Biometrika 104, 829—843.

[48] Zuo, L., Zhang, H., Wang, H., and Liu, L. (2021). Sampling-based estimation for massive survival data with additive hazards model. Statistics in Medicine 40(2), 441–450. Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University, 2 Cuihu North Road, Kunming City, Yunnan Province, Kunming, 650091, China