Conditional Quantile-based Variable Screening with FDR Control in Joint Factor Models

Han Pan, Wei Xiong and Mingyao Ai

doi:10.5705/ss.202025.0167

Abstract

Joint factor models are commonly adopted to relate unobservable factors with covariates.

Traditional approaches to joint models often assume linear relationships between latent factors and

covariates, require prior knowledge of the number of latent factors, and typically fail to address heavytailedness or high-dimensionality. To overcome these challenges, we propose a general factor-covariate

model and introduce a new variable selection procedure to broaden the scope of application and to

alleviate the curse of dimensionality. The procedure is unfolded in three steps: robust estimation of

factors via Huber regression, feature screening using an index of mean squared deviation (MSD) of

conditional quantile and false discovery rate (FDR) control based on derandomized quantile knockoffs. To facilitate implementation, we employ smoothing quantile regression and apply a modified

bootstrap-based eigenvalue method to determine the number of factors. Theoretical justifications on

the sure screening property as well as the control of FDR, per family error rate and k family-wise error

rate are provided. The superiority of our proposed procedure over existing methods is demonstrated

by numerical studies on simulated and real datasets.

Key words and phrases: Derandomized knockoffs, False discovery rate, High-dimensional screening, Joint models, Smoothing quantile regression

Information

Preprint No.	SS-2025-0167
Manuscript ID	SS-2025-0167
Complete Authors	Han Pan, Wei Xiong, Mingyao Ai
Corresponding Authors	Mingyao Ai
Emails	myai@math.pku.edu.cn

References

Ahn, S. C. and A. R. Horenstein (2013). Eigenvalue ratio test for the number of factors. Econometrica 81(3), 1203–1227.
Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71(1), 135–171.
Bai, J. and K. Li (2012). Statistical analysis of factor models of high dimension. The Annals of Statistics 40(1), 436–465.
Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models. Econometrica 70(1), 191–221.
Bai, Z., K. P. Choi, and Y. Fujikoshi (2018). Consistency of aic and bic in estimating the number of significant components in high-dimensional principal component analysis. The Annals of Statistics 46(3), 1050–1076.
Barber, R. F. and E. J. Cand`es (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics 43(5), 2055–2085.
Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 57(1), 289–300.
Cand`es, E., Y. Fan, L. Janson, and J. Lv (2018). Panning for gold: model-x knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(3), 551–577.
Desai, K. H. and J. D. Storey (2012). Cross-dimensional inference of dependent high-dimensional data. Journal of the American Statistical Association 107(497), 135–151.
Dobriban, E. and A. B. Owen (2019). Deterministic parallel analysis: an improved method for selecting factors and principal components. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81(1), 163–183.
Fan, J., Y. Feng, and R. Song (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association 106(494), 544–557.
Fan, J., J. Guo, and S. Zheng (2022). Estimating number of factors by adjusted eigenvalues thresholding. Journal of the American Statistical Association 117(538), 852–861.
Fan, J., Y. Ke, Q. Sun, and W.-X. Zhou (2019). Farmtest: Factor-adjusted robust multiple testing with approximate false discovery control. Journal of the American Statistical Association 114(528), 1880–1893.
Fan, J., Y. Liao, and M. Mincheva (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75(4), 603–680.
Fernandes, M., E. Guerre, and E. Horta (2021). Smoothing quantile regressions. Journal of Business & Economic Statistics 39(1), 338–357.
Fredrickson, B. L., K. M. Grewen, K. A. Coffey, S. B. Algoe, A. M. Firestine, J. M. Arevalo, J. Ma, and S. W.
Cole (2013). A functional genomic perspective on human well-being. Proceedings of the National Academy of Sciences 110(33), 13684–13689.
He, X., X. Pan, K. M. Tan, and W.-X. Zhou (2023). Smoothed quantile regression with large-scale inference. Journal of Econometrics 232(2), 367–388.
He, X., L. Wang, and H. G. Hong (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. The Annals of Statistics 41(1), 342–369.
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics 35(1), 73–101.
Li, G., Y. Li, and C.-L. Tsai (2015). Quantile correlations and quantile autoregressive modeling. Journal of the American Statistical Association 110(509), 246–261.
Li, Q., G. Cheng, J. Fan, and Y. Wang (2018). Embracing the blessing of dimensionality in factor models. Journal of the American Statistical Association 113(521), 380–389.
Li, R., W. Zhong, and L. Zhu (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association 107(499), 1129–1139.
Liu, J., Y. Si, Y. Niu, and R. Zhang (2022). Projection quantile correlation and its use in high-dimensional grouped variable screening. Computational Statistics & Data Analysis 167, 107369.
Liu, W., Y. Ke, J. Liu, and R. Li (2022). Model-free feature screening and fdr control with knockofffeatures. Journal of the American Statistical Association 117(537), 428–443.
Onatski, A. (2009). Testing hypotheses about the number of factors in large factor models. Econometrica 77(5), 1447–1479.
Ouyang, M., X. Wang, C. Wang, and X. Song (2018). Bayesian semiparametric failure time models for multivariate censored data with latent variables. Statistics in Medicine 37(28), 4279–4297.
Owen, A. B. and J. Wang (2016). Bi-cross-validation for factor analysis. Statistical Science 31(1), 119–139.
Ren, Z. and R. F. Barber (2024). Derandomised knockoffs: leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B (Statistical Methodology) 86(1), 122–154.
Ren, Z., Y. Wei, and E. Cand`es (2023). Derandomizing knockoffs. Journal of the American Statistical Association 118(542), 948–958.
Roy, J. and X. Lin (2000). Latent variable models for longitudinal data with multiple continuous outcomes. Biometrics 56(4), 1047–1054.
Shao, X. and J. Zhang (2014). Martingale difference correlation and its use in high-dimensional variable screening. Journal of the American Statistical Association 109(507), 1302–1318.
Wang, H. (2012). Factor profiled sure independence screening. Biometrika 99(1), 15–28.
Yu, C., W. Guo, X. Song, and H. Cui (2023). Feature screening with latent responses. Biometrics 79(2), 878–890.
Zhu, L., L. Li, R. Li, and L. Zhu (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association 106(496), 1464–1475. Han Pan, School of Statistics and Mathematics, Shandong University of Finance and Economics, Jinan, China

Acknowledgments

The authors would like to thank the editor, associate editor and referees for their insightful

comments and suggestions that have significantly improved the paper. The first two authors

was supported by the Funds for Central Universities in UIBE CXTD14-05. Ai’s work was

supported by NSFC grants 12131001, W2412023 and 72595831, and LMEQF.

Supplementary Materials

The properties of the MSD index under Gaussian distribution, details on the MSD knockoffs

procedure, v-quantile knockoffs and estimation of the MSD index, figures for convolutiontype smoothed quantile loss, as well as all technical proofs, additional results from numerical

studies are provided in the online Supplementary Material

Supplementary materials are available for download.

[1] Ahn, S. C. and A. R. Horenstein (2013). Eigenvalue ratio test for the number of factors. Econometrica 81(3), 1203–1227.

[2] Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71(1), 135–171.

[3] Bai, J. and K. Li (2012). Statistical analysis of factor models of high dimension. The Annals of Statistics 40(1), 436–465.

[4] Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models. Econometrica 70(1), 191–221.

[5] Bai, Z., K. P. Choi, and Y. Fujikoshi (2018). Consistency of aic and bic in estimating the number of significant components in high-dimensional principal component analysis. The Annals of Statistics 46(3), 1050–1076.

[6] Barber, R. F. and E. J. Cand`es (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics 43(5), 2055–2085.

[7] Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 57(1), 289–300.

[8] Cand`es, E., Y. Fan, L. Janson, and J. Lv (2018). Panning for gold: model-x knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(3), 551–577.

[9] Desai, K. H. and J. D. Storey (2012). Cross-dimensional inference of dependent high-dimensional data. Journal of the American Statistical Association 107(497), 135–151.

[10] Dobriban, E. and A. B. Owen (2019). Deterministic parallel analysis: an improved method for selecting factors and principal components. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81(1), 163–183.

[11] Fan, J., Y. Feng, and R. Song (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association 106(494), 544–557.

[12] Fan, J., J. Guo, and S. Zheng (2022). Estimating number of factors by adjusted eigenvalues thresholding. Journal of the American Statistical Association 117(538), 852–861.

[13] Fan, J., Y. Ke, Q. Sun, and W.-X. Zhou (2019). Farmtest: Factor-adjusted robust multiple testing with approximate false discovery control. Journal of the American Statistical Association 114(528), 1880–1893.

[14] Fan, J., Y. Liao, and M. Mincheva (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75(4), 603–680.

[15] Fernandes, M., E. Guerre, and E. Horta (2021). Smoothing quantile regressions. Journal of Business & Economic Statistics 39(1), 338–357.

[16] Fredrickson, B. L., K. M. Grewen, K. A. Coffey, S. B. Algoe, A. M. Firestine, J. M. Arevalo, J. Ma, and S. W.

[17] Cole (2013). A functional genomic perspective on human well-being. Proceedings of the National Academy of Sciences 110(33), 13684–13689.

[18] He, X., X. Pan, K. M. Tan, and W.-X. Zhou (2023). Smoothed quantile regression with large-scale inference. Journal of Econometrics 232(2), 367–388.

[19] He, X., L. Wang, and H. G. Hong (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. The Annals of Statistics 41(1), 342–369.

[20] Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics 35(1), 73–101.

[21] Li, G., Y. Li, and C.-L. Tsai (2015). Quantile correlations and quantile autoregressive modeling. Journal of the American Statistical Association 110(509), 246–261.

[22] Li, Q., G. Cheng, J. Fan, and Y. Wang (2018). Embracing the blessing of dimensionality in factor models. Journal of the American Statistical Association 113(521), 380–389.

[23] Li, R., W. Zhong, and L. Zhu (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association 107(499), 1129–1139.

[24] Liu, J., Y. Si, Y. Niu, and R. Zhang (2022). Projection quantile correlation and its use in high-dimensional grouped variable screening. Computational Statistics & Data Analysis 167, 107369.

[25] Liu, W., Y. Ke, J. Liu, and R. Li (2022). Model-free feature screening and fdr control with knockofffeatures. Journal of the American Statistical Association 117(537), 428–443.

[26] Onatski, A. (2009). Testing hypotheses about the number of factors in large factor models. Econometrica 77(5), 1447–1479.

[27] Ouyang, M., X. Wang, C. Wang, and X. Song (2018). Bayesian semiparametric failure time models for multivariate censored data with latent variables. Statistics in Medicine 37(28), 4279–4297.

[28] Owen, A. B. and J. Wang (2016). Bi-cross-validation for factor analysis. Statistical Science 31(1), 119–139.

[29] Ren, Z. and R. F. Barber (2024). Derandomised knockoffs: leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B (Statistical Methodology) 86(1), 122–154.

[30] Ren, Z., Y. Wei, and E. Cand`es (2023). Derandomizing knockoffs. Journal of the American Statistical Association 118(542), 948–958.

[31] Roy, J. and X. Lin (2000). Latent variable models for longitudinal data with multiple continuous outcomes. Biometrics 56(4), 1047–1054.

[32] Shao, X. and J. Zhang (2014). Martingale difference correlation and its use in high-dimensional variable screening. Journal of the American Statistical Association 109(507), 1302–1318.

[33] Wang, H. (2012). Factor profiled sure independence screening. Biometrika 99(1), 15–28.

[34] Yu, C., W. Guo, X. Song, and H. Cui (2023). Feature screening with latent responses. Biometrics 79(2), 878–890.

[35] Zhu, L., L. Li, R. Li, and L. Zhu (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association 106(496), 1464–1475. Han Pan, School of Statistics and Mathematics, Shandong University of Finance and Economics, Jinan, China