Abstract
One of the common challenges faced by researchers in recent data analysis is missing values. In the
context of penalized linear regression, which has been extensively explored over several decades, missing
values introduce bias and yield a non-positive definite covariance matrix of the covariates, rendering
the least square loss function non-convex. In this paper, we propose a novel procedure called the linear
shrinkage positive definite (LPD) modification to address this issue. The LPD modification aims to
modify the covariance matrix of the covariates in order to ensure consistency and positive definiteness.
Employing the new covariance estimator, we are able to transform the penalized regression problem into
a convex one, thereby facilitating the identification of sparse solutions. Notably, the LPD modification
is computationally efficient and can be expressed analytically.
In the presence of missing values,
we establish the selection consistency and prove the convergence rate of the ℓ1-penalized regression
estimator with LPD, showing an ℓ2-error convergence rate of square-root of log p over n by a factor of
(s0)3/2 (s0: the number of non-zero coefficients). To further evaluate the effectiveness of our approach,
we analyze real data from the Genomics of Drug Sensitivity in Cancer (GDSC) dataset. This dataset
provides incomplete measurements of drug sensitivities of cell lines and their protein expressions. We
conduct a series of penalized linear regression models with each sensitivity value serving as a response
variable and protein expressions as explanatory variables
Information
| Preprint No. | SS-2025-0303 |
|---|---|
| Manuscript ID | SS-2025-0303 |
| Complete Authors | Seongoh Park, Seongjin Lee, Nguyen Thi Hai Yen, Nguyen Phuoc Long, Johan Lim |
| Corresponding Authors | Johan Lim |
| Emails | johanlim@snu.ac.kr |
References
- Bach, F. R. (2008). Bolasso: Model consistent lasso estimation through the bootstrap. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 3340. Association for Computing Machinery.
- Bickel, P. J. and E. Levina (2008a). Covariance regularization by thresholding. The Annals of Statistics 36(6), 2577
- – 2604.
- Bickel, P. J. and E. Levina (2008b). Regularized estimation of large covariance matrices. The Annals of Statistics 36(1), 199 – 227.
- Bien, J. and R. J. Tibshirani (2011, 12). Sparse estimation of a covariance matrix. Biometrika 98(4), 807–820.
- Cai, T. T. and A. Zhang (2016). Minimax rate-optimal estimation of high-dimensional covariance matrices with incomplete data. Journal of Multivariate Analysis 150, 55–74.
- Candes, E. and T. Tao (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics 35(6), 2313 – 2351.
- Chatterjee A and S N Lahiri (2011) Bootstrapping lasso estimators Journal of the American Statistical Associa tion 106(494), 608–625.
- Chatterjee, A. and S. N. Lahiri (2013). Rates of convergence of the Adaptive LASSO estimators to the Oracle distribution and higher order refinements by the bootstrap. The Annals of Statistics 41(3), 1232 – 1259.
- Chen, Y. and C. Caramanis (2013, 17–19 Jun). Noisy and missing data regression: Distribution-oblivious support recovery. In S. Dasgupta and D. McAllester (Eds.), Proceedings of the 30th International Conference on Machine
- Learning, Volume 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, pp. 383–391. PMLR.
- Cho, S., S. Katayama, J. Lim, and Y.-G. Choi (2021, Dec). Positive-definite modification of a covariance matrix by minimizing the matrix ℓ∞norm with applications to portfolio optimization. AStA Advances in Statistical Analysis 105(4), 601–627.
- Choi, Y.-G., J. Lim, A. Roy, and J. Park (2019). Fixed support positive-definite modification of covariance matrix estimators via linear shrinkage. Journal of Multivariate Analysis 171, 234–249.
- Coperchini, F., L. Croce, M. Denegri, O. Awwad, S. T. Ngnitejeu, M. Muzza, V. Capelli, F. Latrofa, L. Persani, L. Chiovato, and M. Rotondi (2019, Mar).
- The braf-inhibitor plx4720 inhibits cxcl8 secretion in brafv600e mutated and normal thyroid cells: a further anti-cancer effect of braf-inhibitors. Scientific Reports 9(1), 4390.
- Dabke, K., S. Kreimer, M. R. Jones, and S. J. Parker (2021). A simple optimization workflow to enable precise and accurate imputation of missing values in proteomic data sets. Journal of Proteome Research 20(6), 3214–3229. PMID: 33939434.
- Dai, B., S. Ding, and G. Wahba (2013, 09). Multivariate bernoulli distribution. Bernoulli 19(4), 1465–1483.
- Datta, A. and H. Zou (2017). CoCoLasso for high-dimensional error-in-variables regression. The Annals of Statistics 45(6), 2400 – 2426.
- Daye, Z. J., J. Chen, and H. Li (2012). High-dimensional heteroscedastic regression with an application to eqtl data analysis. Biometrics 68(1), 316–326. D J J B P H L J B l M Kl i S A G t S B tt E L F ld d B M kh j
- (2022). Variable selection with multiply-imputed datasets: Choosing between stacked and grouped methods. Journal of Computational and Graphical Statistics 31(4), 1063–1075.
- Duchi, J. and Y. Singer (2009). Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research 10(99), 2899–2934.
- Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. The Annals of Statistics 32(2), 407 – 499.
- Escribe, C., T. Lu, J. Keller-Baruch, V. Forgetta, B. Xiao, J. B. Richards, S. Bhatnagar, K. Oualkacha, and C. M. T.
- Greenwood (2021). Block coordinate descent algorithm improves variable selection and estimation in error-invariables regression. Genetic Epidemiology 45(8), 874–890.
- Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456), 1348–1360.
- Friedman, J., T. Hastie, H. H¨ofling, and R. Tibshirani (2007). Pathwise coordinate optimization. The Annals of Applied Statistics 1(2), 302 – 332.
- Fu, W. and K. Knight (2000). Asymptotics for lasso-type estimators. The Annals of Statistics 28(5), 1356 – 1378.
- Ghosh, D. and A. M. Chinnaiyan (2005). Classification and selection of biomarkers in genomic data using lasso. Journal of Biomedicine and Biotechnology 2005, 147–154. Gon¸calves, E., R. C. Poulos, Z. Cai, S. Barthorpe, S. S. Manda, N. Lucas, A. Beck, D. Bucio-Noble, M. Dausmann, C. Hall, M. Hecker, J. Koh, H. Lightfoot, S. Mahboob, I. Mali, J. Morris, L. Richardson, A. J. Seneviratne, R. Shepherd, E. Sykes, F. Thomas, S. Valentini, S. G. Williams, Y. Wu, D. Xavier, K. L. MacKenzie, P. G.
- Hains, B. Tully, P. J. Robinson, Q. Zhong, M. J. Garnett, and R. R. Reddel (2022). Pan-cancer proteomic map of 949 human cell lines. Cancer Cell 40(8), 835–849.e8.
- Han, F., J. Lu, and H. Liu (2014). Robust scatter matrix estimation for high dimensional distributions with heavy tails Technical report Princeton University
- Han, Y. and R. S. Tsay (2020). High-dimensional linear regression for dependent data with applications to nowcasting. Statistica Sinica 30(4), 1797–1827.
- Heymans, M. W., S. van Buuren, D. L. Knol, W. van Mechelen, and H. C. de Vet (2007, Jul). Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Medical Research Methodology 7(1), 33.
- Hoerl, A. E. and R. W. Kennard (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67.
- James, G. M. and P. Radchenko (2009, 06). A generalized Dantzig selector with shrinkage tuning. Biometrika 96(2), 323–337.
- Karpievitch, Y. V., A. R. Dabney, and R. D. Smith (2012, Nov). Normalization and missing value imputation for label-free lc-ms analysis. BMC Bioinformatics 13(16), S5.
- Kolar, M. and E. P. Xing (2012). Estimating sparse precision matrices from data with missing values. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, USA, pp. 635–642. Omnipress.
- Koltchinskii, V. and K. Lounici (2017). Concentration inequalities and moment bounds for sample covariance operators. Bernoulli 23(1), 110 – 133.
- Kumar, S., P. P. Kushwaha, and S. Gupta (2019). Emerging targets in cancer drug resistance. Cancer Drug Resistance 2(2), 161–177.
- Lachenbruch, P. A. (2011). Variable selection when missing values are present: a case study. Statistical Methods in Medical Research 20(4), 429–444. PMID: 20442196.
- Lam, C. and J. Fan (2009). Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics 37(6B), 4254 – 4278. L f d J L Li d T Zh
- (2008) S li l i i t t d di t I D K ll D S h Y. Bengio, and L. Bottou (Eds.), Advances in Neural Information Processing Systems, Volume 21. Curran
- Associates, Inc.
- Ledoit, O. and M. Wolf (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88(2), 365–411.
- Lee, J. D., Y. Sun, and J. E. Taylor (2015). On model selection consistency of regularized M-estimators. Electronic Journal of Statistics 9(1), 608 – 642.
- Lee, K. E., N. Sha, E. R. Dougherty, M. Vannucci, and B. K. Mallick (2003, 01). Gene selection: a Bayesian variable selection approach. Bioinformatics 19(1), 90–97.
- Li, Y., H. Yang, H. Yu, H. Huang, and Y. Shen (2023, 04). Penalized weighted least-squares estimate for variable selection on correlated multiply imputed data. Journal of the Royal Statistical Society Series C: Applied Statistics. qlad028.
- Liang, H. and R. Li (2009). Variable selection for partially linear models with measurement errors. Journal of the American Statistical Association 104(485), 234–248. PMID: 20046976.
- Liu, H., L. Wang, and T. Zhao (2014). Sparse covariance matrix estimation with eigenvalue constraints. Journal of Computational and Graphical Statistics 23(2), 439–459. PMID: 25620866.
- Loh, P.-L. and M. J. Wainwright (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics 40(3), 1637 – 1664.
- Loh, P.-L. and M. J. Wainwright (2013). Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger (Eds.), Advances in Neural Information Processing Systems, Volume 26. Curran Associates, Inc.
- Loh, P.-L. and M. J. Wainwright (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics 45(6), 2455 – 2482. L Q d B A J h (2015 02) V i bl l ti i th f i i d t li d i t ti Biostatistics 16(3), 596–610.
- Lounici, K. (2014, 08). High-dimensional covariance matrix estimation with missing observations. Bernoulli 20(3), 1029–1058.
- Meinshausen, N. and P. Bhlmann (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(4), 417–473.
- Mendelson, S. and N. Zhivotovskiy (2020). Robust covariance estimation under L4 −L2 norm equivalence. The Annals of Statistics 48(3), 1648 – 1664.
- Negahban, S. N., P. Ravikumar, M. J. Wainwright, and B. Yu (2012). A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers. Statistical Science 27(4), 538 – 557.
- Omenn, G. S., Y. Guan, and R. Menon (2014). A new class of protein cancer biomarker candidates: Differentially expressed splice variants of erbb2 (her2/neu) and erbb1 (egfr) in breast cancer cell lines. Journal of Proteomics 107, 103–112. Special Issue: ”20 years of Proteomics” in memory of Vitaliano Pallini.
- Osborne, M. R., B. Presnell, and B. A. Turlach (2000). On the lasso and its dual. Journal of Computational and Graphical Statistics 9(2), 319–337.
- Park, S. and J. Lim (2019). Non-asymptotic rate for high-dimensional covariance estimation with non-independent missing observations. Statistics & Probability Letters 153, 113–123.
- Park, S., X. Wang, and J. Lim (2021). Estimating high-dimensional covariance and precision matrices under general missing dependence. Electronic Journal of Statistics 15(2), 4868 – 4915.
- Park, S., X. Wang, and J. Lim (2023). Sparse HansonWright inequality for a bilinear form of sub-gaussian variables. Stat 12(1), e539.
- Pavez, E. and A. Ortega (2021). Covariance matrix estimation with non uniform and data dependent missing observations. IEEE Transactions on Information Theory 67(2), 1201–1215. expression associations in cancer cells. PLOS ONE 12(4), 1–6.
- Romeo, G. and M. Thoresen (2019). Model selection in high-dimensional noisy data: a simulation study. Journal of Statistical Computation and Simulation 89(11), 2031–2050.
- Rosenbaum, M. and A. B. Tsybakov (2010). Sparse recovery under matrix uncertainty. The Annals of Statistics 38(5), 2620 – 2651.
- Rothman, A. J. (2012, 06). Positive definite estimators of large covariance matrices. Biometrika 99(3), 733–740.
- Sørensen, Ø. (2019). hdme: High-dimensional regression with measurement error. Journal of Open Source Software 4(37), 1404.
- Sørensen, Ø., A. Frigessi, and M. Thoresen (2015). Measurement error in lasso: Impact and likelihood bias correction. Statistica Sinica 25, 809829.
- Stdler, N. and P. Bhlmann (2010). Missing values: Sparse inverse covariance estimation and extension to sparse regression. Statistics and Computing 22(1), 219235.
- Takada, M., H. Fujisawa, and T. Nishikawa (2019, 7). Hmlasso: Lasso with high missing rate. In Proceedings of the
- Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 3541–3547. International
- Joint Conferences on Artificial Intelligence Organization.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288.
- van de Geer, S. A. and P. B¨uhlmann (2009). On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics 3, 1360 – 1392.
- Vershynin, R. (2011). Introduction to the non-asymptotic analysis of random matrices.
- Vis, D. J., L. Bombardelli, H. Lightfoot, F. Iorio, M. J. Garnett, and L. F. Wessels (2016). Multilevel models improve precision and speed of ic50 estimates. Pharmacogenomics 17(7), 691–700. PMID: 27180993.
- Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1 -constrained quadratic programming (lasso). IEEE Transactions on Information Theory 55(5), 2183–2202.
- Wan, Y., S. Datta, D. Conklin, and M. Kong (2015). Variable selection models based on multiple imputation with an application for predicting median effective dose and maximum effect. Journal of Statistical Computation and Simulation 85(9), 1902–1916.
- Wang, Y., J. Wang, S. Balakrishnan, and A. Singh (2019). Rate optimal estimation and confidence intervals for high-dimensional regression with missing covariates. Journal of Multivariate Analysis 174, 104526.
- Webb-Robertson, B.-J. M., H. K. Wiberg, M. M. Matzke, J. N. Brown, J. Wang, J. E. McDermott, R. D. Smith,
- K. D. Rodland, T. O. Metz, J. G. Pounds, and K. M. Waters (2015). Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. Journal of Proteome Research 14(5), 1993–2001. PMID: 25855118.
- Wei, R., J. Wang, M. Su, E. Jia, S. Chen, T. Chen, and Y. Ni (2018, Jan). Missing value imputation approach for mass spectrometry-based metabolomics data. Scientific Reports 8(1), 663.
- Wood, A. M., I. R. White, and P. Royston (2008). How should variable selection be performed with multiply imputed data? Statistics in Medicine 27(17), 3227–3246.
- Wu, L., S. I. Candille, Y. Choi, D. Xie, L. Jiang, J. Li-Pook-Than, H. Tang, and M. Snyder (2013, Jul). Variation and genetic control of protein abundance in humans. Nature 499(7456), 79–82.
- Xiao, L. (2009). Dual averaging method for regularized stochastic learning and online optimization. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta (Eds.), Advances in Neural Information Processing
- Systems, Volume 22. Curran Associates, Inc.
- Xu, Z.-q., Y. Zhang, N. Li, P.-j. Liu, L. Gao, X. Gao, and X.-j. Tie (2017). Efficacy and safety of lapatinib and trastuzumab for her2-positive breast cancer: a systematic review and meta-analysis of randomised controlled trials. BMJ Open 7(3).
- Xue, L., S. Ma, and H. Zou (2012). Positive-definite ℓ1-penalized estimation of large covariance matrices. Journal of the American Statistical Association 107(500), 1480–1491.
- Zhang, J., Y. Li, N. Zhao, and Z. Zheng (2022). L0-regularization for high-dimensional regression with corrupted data. Communications in Statistics - Theory and Methods 0(0), 1–17.
- Zhao, P. and B. Yu (2006). On model selection consistency of lasso. Journal of Machine Learning Research 7(90), 2541–2563.
- Zheng, L. and G. I. Allen (2023). Graphical model inference with erosely measured data. Journal of the American Statistical Association 0(ja), 1–22.
- Zheng, Z., Y. Li, C. Yu, and G. Li (2018). Balanced estimation for high-dimensional measurement error models. Computational Statistics & Data Analysis 126, 78–91.
- Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101(476), 1418–1429.
- Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2), 301–320. Seongoh Park, School of Mathematics, Statistics and Data Science, Sungshin Women’s University
Acknowledgments
We thank the anonymous reviewers for their constructive comments and valuable suggestions,
which have greatly improved the quality of this work.
Seongoh Park was supported by the government of the Republic of Korea (MSIT) and
the National Research Foundation of Korea (NRF-1711200203); the Sungshin Womens University Research Grant of H20240073.
Johan Lim was supported by the government of
the Republic of Korea (MSIT) and the National Research Foundation of Korea (NRF-
Supplementary Materials
The supplementary material available online presents additional simulation results and technical theorems to prove the main results.