Abstract
Consider fitting a general parametric regression model, such as a generalized linear model, with individual data. It is common to have summary information,
such as parameter estimates, available from external studies that use similar regression models. Many methods have been developed to incorporate this external
information into internal model fitting to improve parameter estimation. Some
of these methods aim to reduce estimation variance without introducing estimation bias that could result from study population heterogeneity. Others allow
introduction of bias in exchange for substantial variance reduction, based on the
bias-variance trade-off consideration. We take the latter approach and develop
James-Stein shrinkage estimators to integrate the external information. These
estimators can reduce the asymptotic risk compared to not using the external information, regardless of the degree of heterogeneity between internal and external
populations. This is a highly desirable property as it provides a safe passage for
the utility of external information. Few existing methods provide such a guaranteed improvement. We also conduct simulation studies and apply the method to
a prostate cancer dataset to illustrate the numerical performance.
Information
| Preprint No. | SS-2025-0225 |
|---|---|
| Manuscript ID | SS-2025-0225 |
| Complete Authors | Peisong Han, Haoyue Li, Jeremy M. G. Taylor |
| Corresponding Authors | Peisong Han |
| Emails | peisong@umich.edu |
References
- Baranchik, A. (1964). Multiple regression and estimation of the mean of the multivariate normal distribution. pp. Technical Report No. 51, Department of Statistics, Stanford University.
- Chatterjee, N., Y.-H. Chen, P. Maas, and R. J. Carroll (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association 111, 107–117.
- Chen, Z., J. Ning, Y. Shen, and J. Qin (2021). Combining primary cohort data with external aggregate information without assuming comparability. Biometrics 77, 1024–1036.
- Cheng, W., J. M. G. Taylor, P. S. Vokonas, S. K. Park, and B. Mukherjee (2018). Improving estimation and prediction in linear regression incorporating external information from an established reduced model. Statistics in medicine 37, 1515–1530.
- Estes, J. P., B. Mukherjee, and J. M. G. Taylor (2018). Empirical bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statistics in Biosciences 10, 568–586.
- George, E. I. (1986). Minimax multiple shrinkage estimation. The Annals of Statistics 14, 188–205.
- Gu, T., J. M. G. Taylor, W. Cheng, and B. Mukherjee (2019). Synthetic data method to incorporate external information into a current study. Canadian Journal of Statistics 47, 580–603.
- Guo, Z., X. Li, L. Han, and T. Cai (2025). Robust inference for federated meta-learning. Journal of the American Statistical Association 120, 1695–1710.
- Han, L., J. Hou, K. Cho, R. Duan, and T. Cai (2025). Federated adaptive causal estimation (face) of target treatment effects. Journal of the American Statistical Association 120, 1503–1516.
- Han, P. and J. F. Lawless (2016). Discussion of “constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources”. Journal of the American Statistical Association 111, 118–121.
- Han, P. and J. F. Lawless (2019). Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statistica Sinica 29, 1321–1342.
- Han, P., H. Li, S. Park, B. Mukherjee, and J. M. G. Taylor (2024). Improving prediction of linear regression models by integrating external information from heterogeneous populations: James–stein estimators. Biometrics 80, 10.1093/biomtc/ujae072.
- Hansen, B. (2015). Shrinkage efficiency bounds. Econometric Theory 31, 860–879.
- Hansen, B. E. (2016). Efficient shrinkage in parametric models. Journal of Econometrics 190, 115–132.
- Hector, E. C. and R. Martin (2024). Turning the information-sharing dial: efficient inference from different data sources. Electronic Journal of Statistics 18, 2974–3020.
- Huang, C.-Y., J. Qin, and H.-T. Tsai (2016). Efficient estimation of the cox model with auxiliary subgroup survival information. Journal of the American Statistical Association 111, 787– 799.
- Imbens, G. W. and T. Lancaster (1994). Combining micro and macro data in microeconometric models. Review of Economic Studies 61, 655–680.
- James, W. and C. Stein (1961). Estimation with quadratic loss. In Proceedings fo the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, pp. 361–379. University of California Press.
- Ki, Y.-C. F. (1992). Multiple shrinkage estimators in multiple linear regression. Communications in Statistics - Theory and Methods 21, 111–136.
- Kundu, P., R. Tang, and N. Chatterjee (2019). Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika 106, 567–585.
- Li, S., T. Cai, and H. Li (2022). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society Series B: Statistical Methodology 84, 149–173.
- Newey, W. K. and D. L. McFadden (1994). Large Sample Estimation and Hypothesis Testing. Handbook of Econometrics, Vol 4. Amsterdam, The Netherlands: Elsevier Science.
- Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika 87, 484–490.
- Taylor, J. M. G., K. Choi, and P. Han (2023). Data integration - exploiting ratios of parameter estimates from a reduced external model. Biometrika 110, 119–134.
- Thompson, I. M., D. P. Ankerst, C. Chi, P. J. Goodman, C. M. Tangen, M. S. Lucia, Z. Feng,
- H. L. Parnes, and C. A. Coltman Jr (2006). Assessing prostate cancer risk: results from the prostate cancer prevention trial. Journal of the National Cancer Institute 98, 529–534.
- Tian, Y. and Y. Feng (2023). Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association 118, 2684–2697.
- Tomlins, S. A., J. R. Day, R. J. Lonigro, D. H. Hovelson, J. Siddiqui, L. P. Kunju, R. L.
- Dunn, S. Meyer, P. Hodge, J. Groskopf, J. T. Wei, and A. M. Chinnaiyan (2016). Urine tmprss2:erg plus pca3 for individualized prostate cancer risk assessment. European Urology 70, 45–53.
- van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press.
- Zhai, Y. and P. Han (2022). Data integration with oracle use of external information from heterogeneous populations. Journal of Computational and Graphical Statistics 31, 1001– 1012.
- Zhai, Y. and P. Han (2024). Integrating external summary information under population heterogeneity and information uncertainty. Electronic Journal of Statistics 18, 5304–5329.
- Zhang, H., L. Deng, M. Schiffman, J. Qin, and K. Yu (2020). Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika 107, 689–703. Biostatistics Innovation Group, Gilead Sciences
Acknowledgments
We would like to thank the Editor, Associate Editor, and two referees for
their helpful comments that improved the quality of this work. This research was partially supported by National Institutes of Health grants CA-
129102 and CA-46592 to Taylor.