Abstract
Joint factor models are commonly adopted to relate unobservable factors with covariates.
Traditional approaches to joint models often assume linear relationships between latent factors and
covariates, require prior knowledge of the number of latent factors, and typically fail to address
heavy-tailedness or high-dimensional covariates. To overcome these challenges, we propose a general factor-covariate model and introduce a new variable selection procedure to broaden the scope of
application and to alleviate the curse of dimensionality. The procedure is unfolded in three steps:
robust estimation of factors via Huber regression, feature screening using an index of mean squared
deviation (MSD) of conditional quantile and false discovery rate (FDR) control based on derandomized quantile knockoffs. To facilitate implementation, we employ smoothing quantile regression and
apply a modified bootstrap-based eigenvalue method to determine the number of factors. Theoretical
justifications on the sure screening property as well as the control of FDR, per family error rate and k
family-wise error rate are provided. The superiority of our proposed procedure over existing methods
is demonstrated by numerical studies on simulated and real datasets.
Information
| Preprint No. | SS-2025-0167 |
|---|---|
| Manuscript ID | SS-2025-0167 |
| Complete Authors | Han Pan, Wei Xiong, Mingyao Ai |
| Corresponding Authors | Mingyao Ai |
| Emails | myai@math.pku.edu.cn |
References
- Ahn, S. C. and A. R. Horenstein (2013). Eigenvalue ratio test for the number of factors. Econometrica 81(3), 1203–1227.
- Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71(1), 135–171.
- Bai, J. and K. Li (2012). Statistical analysis of factor models of high dimension. The Annals of Statistics 40(1), 436–465.
- Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models. Econometrica 70(1), 191–221.
- Bai, Z., K. P. Choi, and Y. Fujikoshi (2018). Consistency of aic and bic in estimating the number of significant components in high-dimensional principal component analysis. The Annals of Statistics 46(3), 1050–1076.
- Barber, R. F. and E. J. Cand`es (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics 43(5), 2055–2085.
- Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 57(1), 289–300.
- Cand`es, E., Y. Fan, L. Janson, and J. Lv (2018). Panning for gold: model-x knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(3), 551–577.
- Desai, K. H. and J. D. Storey (2012). Cross-dimensional inference of dependent high-dimensional data. Journal of the American Statistical Association 107(497), 135–151.
- Dobriban, E. and A. B. Owen (2019). Deterministic parallel analysis: an improved method for selecting factors and principal components. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81(1), 163–183.
- Fan, J., Y. Feng, and R. Song (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association 106(494), 544–557.
- Fan, J., J. Guo, and S. Zheng (2022). Estimating number of factors by adjusted eigenvalues thresholding. Journal of the American Statistical Association 117(538), 852–861.
- Fan, J., Y. Ke, Q. Sun, and W.-X. Zhou (2019). Farmtest: Factor-adjusted robust multiple testing with approximate false discovery control. Journal of the American Statistical Association 114(528), 1880–1893.
- Fan, J., Y. Liao, and M. Mincheva (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75(4), 603–680.
- Fernandes, M., E. Guerre, and E. Horta (2021). Smoothing quantile regressions. Journal of Business & Economic Statistics 39(1), 338–357.
- Fredrickson, B. L., K. M. Grewen, K. A. Coffey, S. B. Algoe, A. M. Firestine, J. M. Arevalo, J. Ma, and S. W.
- Cole (2013). A functional genomic perspective on human well-being. Proceedings of the National Academy of Sciences 110(33), 13684–13689.
- He, X., X. Pan, K. M. Tan, and W.-X. Zhou (2023). Smoothed quantile regression with large-scale inference. Journal of Econometrics 232(2), 367–388.
- He, X., L. Wang, and H. G. Hong (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. The Annals of Statistics 41(1), 342–369.
- Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics 35(1), 73–101.
- Li, G., Y. Li, and C.-L. Tsai (2015). Quantile correlations and quantile autoregressive modeling. Journal of the American Statistical Association 110(509), 246–261.
- Li, Q., G. Cheng, J. Fan, and Y. Wang (2018). Embracing the blessing of dimensionality in factor models. Journal of the American Statistical Association 113(521), 380–389.
- Li, R., W. Zhong, and L. Zhu (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association 107(499), 1129–1139.
- Liu, J., Y. Si, Y. Niu, and R. Zhang (2022). Projection quantile correlation and its use in high-dimensional grouped variable screening. Computational Statistics & Data Analysis 167, 107369.
- Liu, W., Y. Ke, J. Liu, and R. Li (2022). Model-free feature screening and fdr control with knockofffeatures. Journal of the American Statistical Association 117(537), 428–443.
- Onatski, A. (2009). Testing hypotheses about the number of factors in large factor models. Econometrica 77(5), 1447–1479.
- Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics 92(4), 1004–1016.
- Ouyang, M., X. Wang, C. Wang, and X. Song (2018). Bayesian semiparametric failure time models for multivariate censored data with latent variables. Statistics in Medicine 37(28), 4279–4297.
- Owen, A. B. and J. Wang (2016). Bi-cross-validation for factor analysis. Statistical Science 31(1), 119–139.
- Ren, Z., Y. Wei, and E. Cand`es (2023). Derandomizing knockoffs. Journal of the American Statistical Association 118(542), 948–958.
- Roy, J. and X. Lin (2000). Latent variable models for longitudinal data with multiple continuous outcomes. Biometrics 56(4), 1047–1054.
- Shao, X. and J. Zhang (2014). Martingale difference correlation and its use in high-dimensional variable screening. Journal of the American Statistical Association 109(507), 1302–1318.
- Wang, H. (2012). Factor profiled sure independence screening. Biometrika 99(1), 15–28.
- Yu, C., W. Guo, X. Song, and H. Cui (2023). Feature screening with latent responses. Biometrics 79(2), 878–890.
- Zhu, L., L. Li, R. Li, and L. Zhu (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association 106(496), 1464–1475. Han Pan, School of Statistics and Mathematics, Shandong University of Finance and Economics, Jinan, China
Acknowledgments
The authors would like to thank the editor, associate editor and referees for their insightful
comments and suggestions that have significantly improved the paper. The first two authors
was supported by the Funds for Central Universities in UIBE CXTD14-05. Ai’s work was
supported by NSFC grants 12131001 and W2412023, and LMEQF.
Supplementary Materials
The properties of the MSD index under Gaussian distribution, details on the MSD knockoffs
procedure, v-quantile knockoffs and estimation of the MSD index, figures for convolutiontype smoothed quantile loss, as well as all technical proofs, additional results from numerical
studies are provided in the online Supplementary Materials