Abstract
Ridge regression with random coefficients provides a flexible approach for modeling many small but
nonzero effects in high-dimensional data. We embed this framework in transfer learning by leveraging source
samples from related regression models: the informativeness of each source is captured via the correlation between
its coefficients and those of the target. We propose two weighted estimators—one minimizing estimation risk and
the other minimizing prediction risk—each formed as an optimal blend of target and source ridge estimates. Under
the high-dimensional regime p/n →γ, where p is the number of the predictors and n is the sample size, random
matrix theory yields closed-form limits for these optimal weights and their associated risks. Through simulations
and applications to lipid-trait and colorectal-cancer microbiome prediction, our methods consistently outperform
both target-only and pooled-data ridge regression.
Key words and phrases: Covariate shift; estimation risk; prediction risk; random matrix theory 1
Information
| Preprint No. | SS-2025-0232 |
|---|---|
| Manuscript ID | SS-2025-0232 |
| Complete Authors | Hongzhe Zhang, Hongzhe Li |
| Corresponding Authors | Hongzhe Li |
| Emails | hongzhe@upenn.edu |
References
- Daum´e III, H. (2007). Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 256–263.
- Dobriban, E. and S. Wager (2018). High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics 46(1), 247–279.
- Duchi, J. C. and H. Namkoong (2021). Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics 49(3), 1378–1406.
- Duvallet, C., S. M. Gibbons, T. Gurry, R. A. Irizarry, and E. J. Alm (2017). Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nature communications 8(1), 1784.
- Faquih, T., A. van Hylckama Vlieg, P. Surendran, A. S. Butterworth, R. Li-Gao, R. de Mutsert,
- F. R. Rosendaa, R. Noordam, D. van Heemst, K. W. van Dijk, and D. O. Mook-Kanamori (2023). Robust metabolomic age prediction based on a wide selection of metabolites. medRxiv.
- Ge, J., S. Tang, J. Fan, C. Ma, and C. Jin (2023). Maximum likelihood estimation is all you need for well-specified covariate shift. arXiv preprint arXiv:2311.15961.
- Hachem, W., P. Loubaton, and J. Najim (2007). Deterministic equivalents for certain functionals of large random matrices.
- Hu, Y., M. Li, Q. Lu, et al. (2019). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature genetics 51(3), 568–576.
- Lee, S., J. Yang, M. Goddard, P. Visscher, and N. Wray (2012). Estimation of pleiotropy between complex diseases using snp-derived genomic relationships and restricted maximum likelihood. Bioinformatics 28(19), 2540–2542.
- Li, S., T. T. Cai, and H. Li (2022). Transfer learning for high-dimensional linear regression: Prediction, estimation, and minimax optimality. Journal of Royal Statistical Society, series B.
- Marotta, F., R. Mozafari, E. Grassi, A. Lussana, E. Mariella, and P. Provero (2021). Prediction of gene expression from regulatory sequence composition enhances transcriptome-wide association studies. bioRxiv. M´arquez-Luna, C., P.-R. Loh, S. A. T. . D. S. Consortium, S. T. . D. Consortium, and A. L.
- Price (2017). Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genetic epidemiology 41(8), 811–823.
- Mei, S., W. Fei, and S. Zhou (2011). Gene ontology based transfer learning for protein subcellular localization. BMC bioinformatics 12, 44.
- Pan, W. and Q. Yang (2013). Transfer learning in heterogeneous collaborative filtering domains. Artificial intelligence 197, 39–55.
- Rothschild, D., S. Leviatan, A. Hanemann, Y. Cohen, O. Weissbrod, and S. E (2022). An atlas of robust microbiome associations with phenotypic traits based on large-scale cohorts from two continents. PLoS ONE 17(3), e0265756.
- Sheng, Y. and E. Dobriban (2020, 13–18 Jul). One-shot distributed ridge regression in high dimensions. In H. D. III and A. Singh (Eds.), Proceedings of the 37th International Conference on
- Machine Learning, Volume 119 of Proceedings of Machine Learning Research, pp. 8763–8772. PMLR.
- Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference 90(2), 227–244.
- Shin, H.-C., H. R. Roth, M. Gao, et al. (2016). Deep convolutional neural networks for computeraided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35(5), 1285–1298.
- Silverstein, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices. Journal of Multivariate Analysis 55(2), 331–339.
- Torrey, L. and J. Shavlik (2010). Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242–264. IGI Global.
- Turki, T., Z. Wei, and J. T. Wang (2017). Transfer learning approaches to improve drug sensitivity prediction in multiple myeloma patients. IEEE Access 5, 7381–7393.
- Turley, P., R. Walters, O. Maghzian, A. Okbay, J. Lee, M. Fontana, T. Nguyen-Viet, R. Wedow, M. Zacher, N. Furlotte, 23andMe Research Team, S. S. G. A. Consortium, P. Magnusson, S. Oskarsson, M. Johannesson, P. Visscher, D. Laibson, D. Cesarini, B. Neale, and D. Benjamin
- (2018). Multi-trait analysis of genome-wide association summary statistics using mtag. Nat Genet. 50, 229–237.
- Wang, S., X. Shi, M. Wu, and S. Ma (2019). Horizontal and vertical integrative analysis methods for mental disorders omics data. Scientific Reports, 1–12.
- Zhao, B. and H. Zhu (2019). Cross-trait prediction accuracy of high-dimensional ridge-type estimators in genome-wide association studies. arXiv preprint arXiv:1911.10142.
- Zhao, Z., L. G. Fritsche, J. A. Smith, B. Mukherjee, and S. Lee (2022). The construction of cross-population polygenic risk scores using transfer learning. The American Journal of Human Genetics 109(11), 1998–2008.
- Zhou, X., H. Im, and S. Lee (2020). Core greml for estimating covariance between random effects in linear mixed models for complex trait analyses. Nature Communication 11, 4208.
Acknowledgments
We would like to thank Dr. Jiaoyang Huang and Dr. Edgar Dobriban for discussions on random
matrix theorems in the derivations. H.L.’s research is supported partially by NIH grants GM123056
and GM129781.
Supplementary Materials
available online include details of additional lemmas and corollaries, the
proofs of all the lemmas, corollaries and theorems, and parameter estimation for real data analysis.