Empirical Bayes Data Integration for Multi-Response Regression

Antik Chakraborty and Fei Xue

doi:10.5705/ss.202025.0115

Abstract

Motivated by applications in tissue-wide association studies (TWAS), we develop a flexible

and theoretically grounded empirical Bayes approach for integrating data obtained from different

sources. We propose a linear shrinkage estimator that effectively shrinks singular values of a data

matrix. This problem is closely connected to estimating covariance matrices under a specific loss,

for which we develop asymptotically optimal estimators. The basic linear shrinkage estimator is then

extended to a local linear shrinkage estimator, offering greater flexibility. Crucially, the proposed

method works under sparse/dense or low-rank/non low-rank parameter settings unlike well-known

sparse or reduced rank estimators in the literature.

Furthermore, the empirical Bayes approach

offers greater scalability in computation compared to intensive full Bayes procedures. The method

is evaluated through an extensive set of numerical experiments, and applied to a real TWAS data

obtained from the Genotype-Tissue Expression (GTEx) project.

Key words and phrases: Covariance matrix estimation, GTEx, reduced rank, shrinkage, TWAS

Information

Preprint No.	SS-2025-0115
Manuscript ID	SS-2025-0115
Complete Authors	Antik Chakraborty, Fei Xue
Corresponding Authors	Fei Xue
Emails	feixue@purdue.edu

References

Anderson, T. W. (1951). Estimating linear restrictions on regression coefficients for multivariate normal distributions. The Annals of Mathematical Statistics, 327–351.
Bai, R. and M. Ghosh (2018). High-dimensional multivariate posterior consistency under global–local shrinkage priors. Journal of Multivariate Analysis 167, 157–170.
Bai, Z.-D. and J. W. Silverstein (1998). No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices. The Annals of Probability 26(1), 316–345.
Banerjee, T., G. Mukherjee, and D. Paul (2021). Improved shrinkage prediction under a spiked covariance structure. Journal of Machine Learning Research 22(180), 1–40.
Boukehil, D., D. Fourdrinier, F. Mezoued, and W. E. Strawderman (2021). Estimation of the inverse scatter matrix for a scale mixture of wishart matrices under Efron–Morris type losses. Journal of Statistical Planning and Inference 215, 368–387.
Boyle, E. A., Y. I. Li, and J. K. Pritchard (2017). An expanded view of complex traits: from polygenic to omnigenic. Cell 169(7), 1177–1186.
Bunea, F., Y. She, and M. H. Wegkamp (2011). Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics 39(2), 1282–1309.
Bunea, F., Y. She, M. H. Wegkamp, et al. (2012). Joint variable and rank selection for parsimonious estimation of high-dimensional matrices. The Annals of Statistics 40(5), 2359–2388.
Carvalho, C., N. Polson, and J. Scott (2010). The horseshoe estimator for sparse signals. Biometrika 97(2), 465–480.
Chakraborty, A., A. Bhattacharya, and B. K. Mallick (2020). Bayesian sparse multiple regression for simultaneous rank reduction and variable selection. Biometrika 107(1), 205–221.
Chen, L. and J. Z. Huang (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association 107(500), 1533–1545.
Consortium, G. (2020). The GTEx consortium atlas of genetic regulatory effects across human tissues. Science 369(6509), 1318–1330.
Efron, B. and C. Morris (1972). Empirical Bayes on vector observations: An extension of Stein’s method. Biometrika 59(2), 335–347.
Efron, B. and C. Morris (1976). Multivariate empirical Bayes and estimation of covariance matrices. The Annals of Statistics, 22–32.
Fu, J., M. G. Wolfs, P. Deelen, H.-J. Westra, R. S. Fehrmann, G. J. Te Meerman, W. A. Buurman, S. S. Rensen,
H. J. Groen, R. K. Weersma, et al. (2012). Unraveling the regulatory mechanisms underlying tissue-dependent genetic variation of gene expression. PLoS Genetics 8(1), e1002431.
Geweke, J. (1991). Efficient simulation from the multivariate normal and student-t distributions subject to linear constraints and the evaluation of constraint probabilities. In Computing science and statistics: Proceedings of the 23rd symposium on the interface, pp. 571–578. Fairfax, Virginia: Interface Foundation of North America, Inc.
Glynn, P. W. and C.-h. Rhee (2014). Exact estimation for Markov chain equilibrium expectations. Journal of Applied Probability 51(A), 377–389.
Gresle, M. M., M. A. Jordan, J. Stankovich, T. Spelman, L. J. Johnson, L. Laverick, A. Hamlett, L. D. Smith,
V. G. Jokubaitis, J. Baker, et al. (2020). Multiple sclerosis risk variants regulate gene expression in innate and adaptive immune cells. Life Science Alliance 3(7).
Haff, L. (1979). Estimation of the inverse covariance matrix: Random mixtures of the inverse wishart matrix and the identity. The Annals of Statistics, 1264–1276.
Heap, G. A., G. Trynka, R. C. Jansen, M. Bruinenberg, M. A. Swertz, L. C. Dinesen, K. A. Hunt, C. Wijmenga,
D. A. Vanheel, and L. Franke (2009). Complex nature of SNP genotype effects on gene expression in primary human leucocytes. BMC Medical Genomics 2, 1–13.
Hobert, J. P. (2011). The data augmentation algorithm: Theory and methodology. Handbook of Markov Chain Monte Carlo 253 293
Hu, Y., M. Li, Q. Lu, H. Weng, J. Wang, S. M. Zekavat, Z. Yu, B. Li, J. Gu, S. Muchnik, et al. (2019). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature Genetics 51(3), 568–576.
Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis 5(2), 248–264.
Javanmard, A. and A. Montanari (2014). Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research 15(1), 2869–2909.
Josse, J. and S. Wager (2016). Bootstrap-based regularization for low-rank matrix estimation. Journal of Machine Learning Research 17(124), 1–29.
Kim, Y., W. Wang, P. Carbonetto, and M. Stephens (2024). A flexible empirical Bayes approach to multiple linear regression and connections with penalized regression. Journal of Machine Learning Research 25(185), 1–59.
Kubokawa, T. and M. S. Srivastava (2008). Estimation of the precision matrix of a singular wishart distribution and its application in high-dimensional data. Journal of multivariate Analysis 99(9), 1906–1928.
Langfelder, P. and S. Horvath (2008). Wgcna: an r package for weighted correlation network analysis. BMC bioinformatics 9(1), 559.
Ledoit, O. and S. P´ech´e (2011). Eigenvectors of some large sample covariance matrix ensembles. Probability Theory and Related Fields 151(1-2), 233–264.
Ledoit, O. and M. Wolf (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88(2), 365–411.
Ledoit, O. and M. Wolf (2012). Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics 40(2), 1024 – 1060.
Ledoit, O. and M. Wolf (2018). Optimal estimation of a large-dimensional covariance matrix under Stein’s loss. Bernoulli 24(4B), 3791 – 3832.
Ledoit, O. and M. Wolf (2022). Quadratic shrinkage for large covariance matrices. Bernoulli 28(3), 1519–1547.
Lloyd-Jones, L. R., A. Holloway, A. McRae, J. Yang, K. Small, J. Zhao, B. Zeng, A. Bakshi, A. Metspalu, M. Dermitzakis, et al. (2017). The genetic architecture of gene expression in peripheral blood. The American Journal of Human Genetics 100(2), 228–237.
Lonsdale, J., J. Thomas, M. Salvatore, R. Phillips, E. Lo, S. Shad, R. Hasz, G. Walters, F. Garcia, N. Young, et al.
(2013). The genotype-tissue expression (GTEx) project. Nature Genetics 45(6), 580–585.
Mai, J., M. Lu, Q. Gao, J. Zeng, and J. Xiao (2023). Transcriptome-wide association studies: recent advances in methods, applications and available databases. Communications Biology 6(1), 899.
Matsuda, T. and F. Komaki (2015). Singular value shrinkage priors for Bayesian prediction. Biometrika 102(4), 843–854.
Matsuda, T. and F. Komaki (2019). Empirical Bayes matrix completion. Computational Statistics & Data Analysis 137, 195–210.
Matsuda, T. and W. E. Strawderman (2022). Estimation under matrix quadratic loss and matrix superharmonicity. Biometrika 109(2), 503–519.
Narisetty, N. N. and X. He (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics 42(2), 789 – 817.
Park, T. and G. Casella (2008). The Bayesian lasso. Journal of the American Statistical Association 103(482), 681–686.
Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 1617–1642.
Polson, N. G. and J. G. Scott (2010). Shrink globally, act locally: sparse Bayesian regularization and prediction. Bayesian Statistics 9, 501–538.
Silverstein, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices. Journal of Multivariate Analysis 55(2), 331–339.
Silverstein, J. W. and Z. Bai (1995). On the empirical distribution of eigenvalues of a class of large dimensional random matrices. Journal of Multivariate Analysis 54(2), 175–192.
Silverstein, J. W. and S.-I. Choi (1995). Analysis of the limiting spectral distribution of large dimensional random matrices. Journal of Multivariate Analysis 54(2), 295–309.
Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 1135–1151.
Velu, R. and G. C. Reinsel (2013). Multivariate reduced-rank regression: theory and applications, Volume 136. Springer Science & Business Media.
Wainberg, M., N. Sinnott-Armstrong, N. Mancuso, A. N. Barbeira, D. A. Knowles, D. Golan, R. Ermel, A. Ruusalepp, T. Quertermous, K. Hao, et al. (2019). Opportunities and challenges for transcriptome-wide association studies. Nature Genetics 51(4), 592–599.
Wang, J., E. R. Gamazon, B. L. Pierce, B. E. Stranger, H. K. Im, R. D. Gibbons, N. J. Cox, D. L. Nicolae, and
L. S. Chen (2016). Imputing gene expression in uncollected tissues within and beyond GTEx. The American Journal of Human Genetics 98(4), 697–708.
Wang, Y. and S. D. Zhao (2021). Linear shrinkage for predicting responses in large-scale multivariate linear regression. arXiv preprint arXiv:2104.08970.
Xue, F. and H. Li (2022). An empirical bayes regression for multi-tissue eqtl data analysis. arXiv preprint arXiv:2211.13889.
Yuan, M., A. Ekici, Z. Lu, and R. Monteiro (2007). Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(3), 329–346.

Acknowledgments

This work was supported in part by the National Science Foundation under Grant DMS

We are also grateful to Yunlong Liu of Department of Medical and Molecular

Genetics in Indiana University School of Medicine, for his helpful discussions on real data.

Supplementary Materials

Additional simulation results, proofs, and algorithms are provided in the supplementary

material.

Supplementary materials are available for download.

[1] Anderson, T. W. (1951). Estimating linear restrictions on regression coefficients for multivariate normal distributions. The Annals of Mathematical Statistics, 327–351.

[2] Bai, R. and M. Ghosh (2018). High-dimensional multivariate posterior consistency under global–local shrinkage priors. Journal of Multivariate Analysis 167, 157–170.

[3] Bai, Z.-D. and J. W. Silverstein (1998). No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices. The Annals of Probability 26(1), 316–345.

[4] Banerjee, T., G. Mukherjee, and D. Paul (2021). Improved shrinkage prediction under a spiked covariance structure. Journal of Machine Learning Research 22(180), 1–40.

[5] Boukehil, D., D. Fourdrinier, F. Mezoued, and W. E. Strawderman (2021). Estimation of the inverse scatter matrix for a scale mixture of wishart matrices under Efron–Morris type losses. Journal of Statistical Planning and Inference 215, 368–387.

[6] Boyle, E. A., Y. I. Li, and J. K. Pritchard (2017). An expanded view of complex traits: from polygenic to omnigenic. Cell 169(7), 1177–1186.

[7] Bunea, F., Y. She, and M. H. Wegkamp (2011). Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics 39(2), 1282–1309.

[8] Bunea, F., Y. She, M. H. Wegkamp, et al. (2012). Joint variable and rank selection for parsimonious estimation of high-dimensional matrices. The Annals of Statistics 40(5), 2359–2388.

[9] Carvalho, C., N. Polson, and J. Scott (2010). The horseshoe estimator for sparse signals. Biometrika 97(2), 465–480.

[10] Chakraborty, A., A. Bhattacharya, and B. K. Mallick (2020). Bayesian sparse multiple regression for simultaneous rank reduction and variable selection. Biometrika 107(1), 205–221.

[11] Chen, L. and J. Z. Huang (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association 107(500), 1533–1545.

[12] Consortium, G. (2020). The GTEx consortium atlas of genetic regulatory effects across human tissues. Science 369(6509), 1318–1330.

[13] Efron, B. and C. Morris (1972). Empirical Bayes on vector observations: An extension of Stein’s method. Biometrika 59(2), 335–347.

[14] Efron, B. and C. Morris (1976). Multivariate empirical Bayes and estimation of covariance matrices. The Annals of Statistics, 22–32.

[15] Fu, J., M. G. Wolfs, P. Deelen, H.-J. Westra, R. S. Fehrmann, G. J. Te Meerman, W. A. Buurman, S. S. Rensen,

[16] H. J. Groen, R. K. Weersma, et al. (2012). Unraveling the regulatory mechanisms underlying tissue-dependent genetic variation of gene expression. PLoS Genetics 8(1), e1002431.

[17] Geweke, J. (1991). Efficient simulation from the multivariate normal and student-t distributions subject to linear constraints and the evaluation of constraint probabilities. In Computing science and statistics: Proceedings of the 23rd symposium on the interface, pp. 571–578. Fairfax, Virginia: Interface Foundation of North America, Inc.

[18] Glynn, P. W. and C.-h. Rhee (2014). Exact estimation for Markov chain equilibrium expectations. Journal of Applied Probability 51(A), 377–389.

[19] Gresle, M. M., M. A. Jordan, J. Stankovich, T. Spelman, L. J. Johnson, L. Laverick, A. Hamlett, L. D. Smith,

[20] V. G. Jokubaitis, J. Baker, et al. (2020). Multiple sclerosis risk variants regulate gene expression in innate and adaptive immune cells. Life Science Alliance 3(7).

[21] Haff, L. (1979). Estimation of the inverse covariance matrix: Random mixtures of the inverse wishart matrix and the identity. The Annals of Statistics, 1264–1276.

[22] Heap, G. A., G. Trynka, R. C. Jansen, M. Bruinenberg, M. A. Swertz, L. C. Dinesen, K. A. Hunt, C. Wijmenga,

[23] D. A. Vanheel, and L. Franke (2009). Complex nature of SNP genotype effects on gene expression in primary human leucocytes. BMC Medical Genomics 2, 1–13.

[24] Hobert, J. P. (2011). The data augmentation algorithm: Theory and methodology. Handbook of Markov Chain Monte Carlo 253 293

[25] Hu, Y., M. Li, Q. Lu, H. Weng, J. Wang, S. M. Zekavat, Z. Yu, B. Li, J. Gu, S. Muchnik, et al. (2019). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature Genetics 51(3), 568–576.

[26] Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis 5(2), 248–264.

[27] Javanmard, A. and A. Montanari (2014). Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research 15(1), 2869–2909.

[28] Josse, J. and S. Wager (2016). Bootstrap-based regularization for low-rank matrix estimation. Journal of Machine Learning Research 17(124), 1–29.

[29] Kim, Y., W. Wang, P. Carbonetto, and M. Stephens (2024). A flexible empirical Bayes approach to multiple linear regression and connections with penalized regression. Journal of Machine Learning Research 25(185), 1–59.

[30] Kubokawa, T. and M. S. Srivastava (2008). Estimation of the precision matrix of a singular wishart distribution and its application in high-dimensional data. Journal of multivariate Analysis 99(9), 1906–1928.

[31] Langfelder, P. and S. Horvath (2008). Wgcna: an r package for weighted correlation network analysis. BMC bioinformatics 9(1), 559.

[32] Ledoit, O. and S. P´ech´e (2011). Eigenvectors of some large sample covariance matrix ensembles. Probability Theory and Related Fields 151(1-2), 233–264.

[33] Ledoit, O. and M. Wolf (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88(2), 365–411.

[34] Ledoit, O. and M. Wolf (2012). Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics 40(2), 1024 – 1060.

[35] Ledoit, O. and M. Wolf (2018). Optimal estimation of a large-dimensional covariance matrix under Stein’s loss. Bernoulli 24(4B), 3791 – 3832.

[36] Ledoit, O. and M. Wolf (2022). Quadratic shrinkage for large covariance matrices. Bernoulli 28(3), 1519–1547.

[37] Lloyd-Jones, L. R., A. Holloway, A. McRae, J. Yang, K. Small, J. Zhao, B. Zeng, A. Bakshi, A. Metspalu, M. Dermitzakis, et al. (2017). The genetic architecture of gene expression in peripheral blood. The American Journal of Human Genetics 100(2), 228–237.

[38] Lonsdale, J., J. Thomas, M. Salvatore, R. Phillips, E. Lo, S. Shad, R. Hasz, G. Walters, F. Garcia, N. Young, et al.

[39] (2013). The genotype-tissue expression (GTEx) project. Nature Genetics 45(6), 580–585.

[40] Mai, J., M. Lu, Q. Gao, J. Zeng, and J. Xiao (2023). Transcriptome-wide association studies: recent advances in methods, applications and available databases. Communications Biology 6(1), 899.

[41] Matsuda, T. and F. Komaki (2015). Singular value shrinkage priors for Bayesian prediction. Biometrika 102(4), 843–854.

[42] Matsuda, T. and F. Komaki (2019). Empirical Bayes matrix completion. Computational Statistics & Data Analysis 137, 195–210.

[43] Matsuda, T. and W. E. Strawderman (2022). Estimation under matrix quadratic loss and matrix superharmonicity. Biometrika 109(2), 503–519.

[44] Narisetty, N. N. and X. He (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics 42(2), 789 – 817.

[45] Park, T. and G. Casella (2008). The Bayesian lasso. Journal of the American Statistical Association 103(482), 681–686.

[46] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 1617–1642.

[47] Polson, N. G. and J. G. Scott (2010). Shrink globally, act locally: sparse Bayesian regularization and prediction. Bayesian Statistics 9, 501–538.

[48] Silverstein, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices. Journal of Multivariate Analysis 55(2), 331–339.

[49] Silverstein, J. W. and Z. Bai (1995). On the empirical distribution of eigenvalues of a class of large dimensional random matrices. Journal of Multivariate Analysis 54(2), 175–192.

[50] Silverstein, J. W. and S.-I. Choi (1995). Analysis of the limiting spectral distribution of large dimensional random matrices. Journal of Multivariate Analysis 54(2), 295–309.

[51] Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 1135–1151.

[52] Velu, R. and G. C. Reinsel (2013). Multivariate reduced-rank regression: theory and applications, Volume 136. Springer Science & Business Media.

[53] Wainberg, M., N. Sinnott-Armstrong, N. Mancuso, A. N. Barbeira, D. A. Knowles, D. Golan, R. Ermel, A. Ruusalepp, T. Quertermous, K. Hao, et al. (2019). Opportunities and challenges for transcriptome-wide association studies. Nature Genetics 51(4), 592–599.

[54] Wang, J., E. R. Gamazon, B. L. Pierce, B. E. Stranger, H. K. Im, R. D. Gibbons, N. J. Cox, D. L. Nicolae, and

[55] L. S. Chen (2016). Imputing gene expression in uncollected tissues within and beyond GTEx. The American Journal of Human Genetics 98(4), 697–708.

[56] Wang, Y. and S. D. Zhao (2021). Linear shrinkage for predicting responses in large-scale multivariate linear regression. arXiv preprint arXiv:2104.08970.

[57] Xue, F. and H. Li (2022). An empirical bayes regression for multi-tissue eqtl data analysis. arXiv preprint arXiv:2211.13889.

[58] Yuan, M., A. Ekici, Z. Lu, and R. Monteiro (2007). Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(3), 329–346.