Abstract
We consider feature screening for high-dimensional response data without and
with the existence of confounding factors. First, we introduce kernel covariance
and kernel correlation for high-dimensional associaiton analysis, and further propose partial kernel covariance and partial kernel correlation that can handle situa-
tions with confounding factors. Then, based on the kernel correlation and partial
kernel correlation, we propose two feature screening procedures. Both screening
procedures possess sure screening property and ranking consistency property, and
are complementary to each other by respectively dealing with situations without
and with the existence of confounding factors. The proposed procedures make no
assumptions on model, and are suitable for high-dimensional response variable
and non-Euclidean data. Extensive simulation results and a real data analysis
demonstrate the satisfying performances and advantages of the proposed procedures over existing methods.
Information
| Preprint No. | SS-2023-0290 |
|---|---|
| Manuscript ID | SS-2023-0290 |
| Complete Authors | Yuke Shi, Na Li, Qizhai Li, Dongdong Pan, Jinjuan Wang |
| Corresponding Authors | Jinjuan Wang |
| Emails | wangjinjuan@bit.edu.cn |
References
- Cui, H., Li, R. and Zhong, W. (2015). Model-free feature screening for ultrahigh-dimensional discriminant analysis. Journal of the American Statistical Association 110, 630-641.
- Fan, J. and Lv, J. (2008). Sure independence screening for ultrahighdimensional feature space. Journal of the Royal Statistical Society: Series B 70, 849-911.
- Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality, The Annals of Statistics 38, 3567–3604.
- Fan, J., Feng, Y. and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association 106, 544–557.
- Fan, J., Ma, Y. and Dai, W. (2014). Nonparametric independence screening in sparse ultrahigh-dimensional varying coefficient models. Journal of the American Statistical Association 109, 1270–1284.
- Fukumizu, K., Gretton, A., Sch¨olkopf, B., and Sriperumbudur, B. K.
- (2008). Characteristic kernels on groups and semigroups. Advances in Neural Information Processing Systems 21, 473–480.
- Gratten, J. and Visscher, P. M. (2016). Genetic pleiotropy in complex traits and diseases: implications for genomic medicine. Genome Medicine 8, 13.
- He, D., Zhou, Y. and Zou, H. (2021). On sure screening with multiple responses. Statistica Sinica 31, 1749-1777.
- He, X., Wang, L. and Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. The Annals of Statistics 41, 342–369.
- Huang, J., Horowitz, J. L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics 36, 587-613.
- Li, Q. and Yu, K.(2008). Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genetic Epidemiology 32, 215-226.
- Li, R., Zhong, W. and Zhu, L. (2012) Feature Screening via Distance Correlation Learning. Journal of the American Statistical Association 107, 1129-1139.
- Lyons, R. (2013). Distance covariance in metric spaces. The Annals of Probability 41, 3284-3305.
- Mai, Q. and Zou, H. (2015). The fused Kolmogorov filter: A nonparametric model-free screening method. The Annals of Statistics 43, 1471-1497.
- Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick,
- N. A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38, 904-909.
- Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics 2263-2291.
- Shao, X. and Zhang, J. (2014). Martingale difference correlation and its use in high-dimensional variable screening. Journal of the American Statistical Association 109, 1302-1318.
- Sz´ekely, G. J. and Rizzo, M. L. (2014). Partial distance correlation with methods for dissimilarities. The Annals of Statistics 42, 2382-2412.
- Sz´ekely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics 35, 2769-2794.
- Valdar, W., Solberg, L. C., Gauguier, D., Burnett, S., Klenerman, P., Cookson, W. O., Taylor, M.S., Rawlins, J.N.P., Mott, R., and Flint, J. (2006). Genome-wide genetic association of complex traits in heterogeneous stock mice. Nature Genetics 38, 879-887.
- Wang, L., Zhang, W. and Li, Q. (2020). Assoctests : an R package for genetic association studies. Journal of Statistical Software 94, 1-26.
- Watanabe, K., Stringer, S., Frei, O., Umi´cevi´c Mirkov, M., de Leeuw, C.,
- Polderman, T.J.C., van der Sluis, S., Andreassen, O.A., Neale, B.M.,
- and Posthuma, D.(2019). A global overview of pleiotropy and genetic architecture in complex traits. Nature Genetics 51, 1339-1348.
- Xie, J., Lin, Y., Yan, X. and Tang, N. (2020). Category-adaptive variable screening for ultra-high-dimensional heterogeneous categorical data. Journal of the American Statistical Association 115, 747-760.
- Zhan, X., Plantinga, A., Zhao, N., and Wu, M. C. (2017). A fast smallsample kernel independence test for microbiome community-level association analysis. Biometrics 73, 1453-1463.
- Zhang, W., Yang, L., Tang, L. L., Liu, A., Mills, J. L., Sun, Y., and Li,
- Q. (2017). GATE: an efficient procedure in study of pleiotropic genetic associations. BMC Genomics 18, 1-15.
Acknowledgments
We would like to express our sincere gratitude to Qunqiang Feng for his invaluable contributions to this article. Jinjuan Wang has been supported by
National Natural Science Foundation of China (NSFC) (Grant No. 12101047)
and Beijing Institute of Technology Research Fund Program for Young
Scholars. Qizhai Li has been supported by National Natural Science Foundation of China (NSFC) (Grant No. 12325110) and CAS Project for Young
Scientists in Basic Research (Grant No. YSBR-034).
Supplementary Materials
The proofs of Theorems 1 and 2, as well as additional numerical simulations
on one-dimensional response variable models, two distinct kernels, and other
popular machine learning approaches, can be found in the Supplementary
Material.