Doubly Robust Transfer Learning Under Sub-group Shift for Cohort-level Missing Indicator Covariates

Huali Zhao and Tianying Wang

doi:10.5705/ss.202025.0245

Abstract

Modern biomedical research increasingly relies on integrating multiple

cohort studies, yet faces a critical challenge: indicator covariates such as smoking

status, vaccination records, or diagnostic codes that are entirely absent in some

cohorts due to differences in data collection protocols. This cohort-level missingness violates the assumptions underlying traditional missing data methods, as

the complete absence of covariates across entire populations fundamentally differs from sporadic individual-level missingness. To address this gap, we develop

a doubly robust transfer learning framework based on a novel sub-group shift

assumption, which posits that the conditional distribution of the missing indicator given other variables remains stable across cohorts while allowing marginal

distributions to vary. Our approach combines importance weighting with imputation in augmented estimating equations, achieving robustness to misspecifica-

tion of either the density ratio model or the imputation model. We establish

that the proposed estimator is n1/2-consistent and asymptotically normal under

mild regularity conditions. Through extensive simulations and an application to

UK Biobank data, we demonstrate superior performance compared to existing

approaches. This work provides a rigorous framework for handling cohort-level

missing indicators, addressing a pervasive challenge in large-scale biomedical data

integration.

Key words and phrases: Completely missing, distribution shift, importance weight- ing, model heterogeneity

Information

Preprint No.	SS-2025-0245
Manuscript ID	SS-2025-0245
Complete Authors	Huali Zhao, Tianying Wang
Corresponding Authors	Tianying Wang
Emails	tianyingw0905@outlook.com

References

Allen, N. E., B. Lacey, D. A. Lawlor, J. P. Pell, J. Gallacher, L. Smeeth, P. Elliott, P. M.
Matthews, R. A. Lyons, A. D. Whetton, A. Lucassen, M. E. Hurles, M. Chapman, A. W. Roddam, N. K. Fitzpatrick, A. L. Hansell, R. Hardy, R. E. Marioni, V. B.
O’Donnell, J. Williams, C. M. Lindgren, M. Effingham, J. Sellors, J. Danesh, and R. Collins
(2024). Prospective study design and data analysis in UK Biobank. Science Translational Medicine 16(729), eadf4428.
Amatruda, J. M., M. C. Statt, and S. L. Welle (1993). Total and resting energy expenditure in obese women reduced to ideal body weight. The Journal of Clinical Investigation 92(3), 1236–1242.
Arem, H., J. Reedy, J. Sampson, L. Jiao, A. R. Hollenbeck, H. Risch, S. T. Mayne, and R. Z.
Stolzenberg-Solomon (2013). The Healthy Eating Index 2005 and risk for pancreatic cancer in the NIH–AARP study. Journal of the National Cancer Institute 105(17), 1298–1305.
Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61(4), 962–973.
Blankers, M., E. S. Smit, P. van der Pol, H. de Vries, C. Hoving, and M. van Laar (2016). The missing=smoking assumption: a fallacy in internet-based smoking cessation trials? Nicotine & Tobacco Research 18(1), 25–33.
Bycroft, C., C. Freeman, D. Petkova, G. Band, L. T. Elliott, K. Sharp, A. Motyer, D. Vukcevic,
O. Delaneau, J. O’Connell, et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature 562(7726), 203–209.
Cai, T., M. Li, and M. Liu (2025). Semi-supervised triply robust inductive transfer learning. Journal of the American Statistical Association 120(550), 1037–1047.
Dare, S., D. F. Mackay, and J. P. Pell (2015). Relationship between smoking and obesity: a cross-sectional study of 499,504 middle-aged adults in the UK general population. PLoS One 10(4), e0123579.
Denny, J. C., J. L. Rutter, D. B. Goldstein, A. Philippakis, J. W. Smoller, G. Jenkins, E. Dishman, J. L. McCauley, and All of Us Research Program Investigators (2019). The “All of Us” Research Program. The New England Journal of Medicine 381(7), 668–676.
Desai, R. J., D. H. Solomon, N. Shadick, C. Iannaccone, and S. C. Kim (2016). Identification of smoking using medicare data—a validation study of claims-based algorithms. Pharmacoepidemiology and Drug Safety 25(4), 472–475.
Gaziano, J. M., J. Concato, M. Brophy, L. Fiore, S. Pyarajan, J. Breeling, S. Whitbourne,
J. Deen, C. Shannon, D. Humphries, et al. (2016). Million Veteran Program: A megabiobank to study genetic influences on health and disease. Journal of Clinical Epidemiology 70, 214–223.
Guh, D. P., W. Zhang, N. Bansback, Z. Amarsi, C. L. Birmingham, and A. H. Anis (2009). The incidence of co-morbidities related to obesity and overweight: a systematic review and meta-analysis. BMC Public Health 9, 88.
Hedeker, D., R. J. Mermelstein, and H. Demirtas (2007). Analysis of binary outcomes with missing data: missing=smoking, last observation carried forward, and a little multiple imputation. Addiction 102(10), 1564–1573.
Kennedy, E. H., J. A. Mauro, M. J. Daniels, N. Burns, and D. S. Small (2019). Handling missing data in instrumental variable methods for causal inference. Annual Review of Statistics and Its Application 6(1), 125–148.
Kpotufe, S. and G. Martinet (2021). Marginal singularity and the benefits of labels in covariateshift. The Annals of Statistics 49(6), 3299–3323.
Lee, S.-h., Y. Ma, and J. Zhao (2025). Doubly flexible estimation under label shift. Journal of the American Statistical Association 120(549), 278–290.
Li, S. and L. Zhang (2025). Multi-dimensional domain generalization with low-rank structures. Journal of the American Statistical Association 120(552), 2522–2534.
Li, Y., Y. Wei, and M. Liu (2024). Adaptive learning with blockwise missing and semi-supervised data. arXiv preprint arXiv:2405.18722.
Liu, M., Y. Zhang, K. P. Liao, and T. Cai (2023). Augmented transfer regression learning with semi-non-parametric nuisance models. Journal of Machine Learning Research 24(293), 1–50.
Millard, L. A., M. R. Munaf`o, K. Tilling, R. E. Wootton, and G. Davey Smith (2019). MRpheWAS with stratification and interaction: searching for the causal effects of smoking heaviness identified an effect on facial aging. PLoS Genetics 15(10), e1008353.
Plassier, V., M. Makni, A. Rubashevskii, E. Moulines, and M. Panov (2023). Conformal prediction for federated uncertainty quantification under label shift. In Proceedings of the 40th International Conference on Machine Learning, pp. 27907–27947. PMLR.
Reich, C., A. Ostropolets, P. Ryan, P. Rijnbeek, M. Schuemie, A. Davydov, D. Dymshyts,
and G. Hripcsak (2024). OHDSI standardized vocabularies—a large-scale centralized reference ontology for international data harmonization. Journal of the American Medical Informatics Association 31(3), 583–590.
Reinau, D., C. Surber, S. S. Jick, and C. R. Meier (2014). Epidemiology of basal cell carcinoma in the United Kingdom: incidence, lifestyle factors, and comorbidities. British Journal of Cancer 111(1), 203–206.
Schulz, M.-A., B. T. T. Yeo, J. T. Vogelstein, J. Mourao-Miranada, J. N. Kather, K. Kording, B. Richards, and D. Bzdok (2020). Different scaling of linear models and deep learning in UKBiobank brain images versus machine-learning datasets. Nature Communications 11(1), 4238.
van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge University Press.
Varewyck, M., S. Vansteelandt, M. Eriksson, and E. Goetghebeur (2016). On the practice of ignoring center-patient interactions in evaluating hospital performance. Statistics in Medicine 35(2), 227–238.
Zhai, Y. and P. Han (2022). Data integration with oracle use of external information from heterogeneous populations. Journal of Computational and Graphical Statistics 31(4), 1001–1012.
Zhou, D., M. Li, T. Cai, and M. Liu (2024). Model-assisted and knowledge-guided transfer regression for the underrepresented population. arXiv preprint arXiv:2410.06484.
Zhou, D., M. Liu, M. Li, and T. Cai (2025). Doubly robust augmented model accuracy transfer inference with high dimensional features. Journal of the American Statistical Association 120(549), 524–534. Huali Zhao, Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China;

Acknowledgments

The authors thank the Editor, Associate Editor, and reviewers for insightful and constructive comments, which have greatly helped improve the pa-

per.

We thank all participants and researchers who contributed to the

UK Biobank datasets (www.ukbiobank.ac.uk). This research has been conducted using the UK Biobank Resource under Application Number 207159.

Supplementary Materials

The online Supplementary Material contains the potential nuisance models,

the cross-fitted version of the proposed method, detailed proofs of all technical results, and extended simulation and data analysis results. The R code

is provided on GitHub at: https://github.com/tianyingw/DRTL-comb.

Supplementary materials are available for download.

[1] Allen, N. E., B. Lacey, D. A. Lawlor, J. P. Pell, J. Gallacher, L. Smeeth, P. Elliott, P. M.

[2] Matthews, R. A. Lyons, A. D. Whetton, A. Lucassen, M. E. Hurles, M. Chapman, A. W. Roddam, N. K. Fitzpatrick, A. L. Hansell, R. Hardy, R. E. Marioni, V. B.

[3] O’Donnell, J. Williams, C. M. Lindgren, M. Effingham, J. Sellors, J. Danesh, and R. Collins

[4] (2024). Prospective study design and data analysis in UK Biobank. Science Translational Medicine 16(729), eadf4428.

[5] Amatruda, J. M., M. C. Statt, and S. L. Welle (1993). Total and resting energy expenditure in obese women reduced to ideal body weight. The Journal of Clinical Investigation 92(3), 1236–1242.

[6] Arem, H., J. Reedy, J. Sampson, L. Jiao, A. R. Hollenbeck, H. Risch, S. T. Mayne, and R. Z.

[7] Stolzenberg-Solomon (2013). The Healthy Eating Index 2005 and risk for pancreatic cancer in the NIH–AARP study. Journal of the National Cancer Institute 105(17), 1298–1305.

[8] Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61(4), 962–973.

[9] Blankers, M., E. S. Smit, P. van der Pol, H. de Vries, C. Hoving, and M. van Laar (2016). The missing=smoking assumption: a fallacy in internet-based smoking cessation trials? Nicotine & Tobacco Research 18(1), 25–33.

[10] Bycroft, C., C. Freeman, D. Petkova, G. Band, L. T. Elliott, K. Sharp, A. Motyer, D. Vukcevic,

[11] O. Delaneau, J. O’Connell, et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature 562(7726), 203–209.

[12] Cai, T., M. Li, and M. Liu (2025). Semi-supervised triply robust inductive transfer learning. Journal of the American Statistical Association 120(550), 1037–1047.

[13] Dare, S., D. F. Mackay, and J. P. Pell (2015). Relationship between smoking and obesity: a cross-sectional study of 499,504 middle-aged adults in the UK general population. PLoS One 10(4), e0123579.

[14] Denny, J. C., J. L. Rutter, D. B. Goldstein, A. Philippakis, J. W. Smoller, G. Jenkins, E. Dishman, J. L. McCauley, and All of Us Research Program Investigators (2019). The “All of Us” Research Program. The New England Journal of Medicine 381(7), 668–676.

[15] Desai, R. J., D. H. Solomon, N. Shadick, C. Iannaccone, and S. C. Kim (2016). Identification of smoking using medicare data—a validation study of claims-based algorithms. Pharmacoepidemiology and Drug Safety 25(4), 472–475.

[16] Gaziano, J. M., J. Concato, M. Brophy, L. Fiore, S. Pyarajan, J. Breeling, S. Whitbourne,

[17] J. Deen, C. Shannon, D. Humphries, et al. (2016). Million Veteran Program: A megabiobank to study genetic influences on health and disease. Journal of Clinical Epidemiology 70, 214–223.

[18] Guh, D. P., W. Zhang, N. Bansback, Z. Amarsi, C. L. Birmingham, and A. H. Anis (2009). The incidence of co-morbidities related to obesity and overweight: a systematic review and meta-analysis. BMC Public Health 9, 88.

[19] Hedeker, D., R. J. Mermelstein, and H. Demirtas (2007). Analysis of binary outcomes with missing data: missing=smoking, last observation carried forward, and a little multiple imputation. Addiction 102(10), 1564–1573.

[20] Kennedy, E. H., J. A. Mauro, M. J. Daniels, N. Burns, and D. S. Small (2019). Handling missing data in instrumental variable methods for causal inference. Annual Review of Statistics and Its Application 6(1), 125–148.

[21] Kpotufe, S. and G. Martinet (2021). Marginal singularity and the benefits of labels in covariateshift. The Annals of Statistics 49(6), 3299–3323.

[22] Lee, S.-h., Y. Ma, and J. Zhao (2025). Doubly flexible estimation under label shift. Journal of the American Statistical Association 120(549), 278–290.

[23] Li, S. and L. Zhang (2025). Multi-dimensional domain generalization with low-rank structures. Journal of the American Statistical Association 120(552), 2522–2534.

[24] Li, Y., Y. Wei, and M. Liu (2024). Adaptive learning with blockwise missing and semi-supervised data. arXiv preprint arXiv:2405.18722.

[25] Liu, M., Y. Zhang, K. P. Liao, and T. Cai (2023). Augmented transfer regression learning with semi-non-parametric nuisance models. Journal of Machine Learning Research 24(293), 1–50.

[26] Millard, L. A., M. R. Munaf`o, K. Tilling, R. E. Wootton, and G. Davey Smith (2019). MRpheWAS with stratification and interaction: searching for the causal effects of smoking heaviness identified an effect on facial aging. PLoS Genetics 15(10), e1008353.

[27] Plassier, V., M. Makni, A. Rubashevskii, E. Moulines, and M. Panov (2023). Conformal prediction for federated uncertainty quantification under label shift. In Proceedings of the 40th International Conference on Machine Learning, pp. 27907–27947. PMLR.

[28] Reich, C., A. Ostropolets, P. Ryan, P. Rijnbeek, M. Schuemie, A. Davydov, D. Dymshyts,

[29] and G. Hripcsak (2024). OHDSI standardized vocabularies—a large-scale centralized reference ontology for international data harmonization. Journal of the American Medical Informatics Association 31(3), 583–590.

[30] Reinau, D., C. Surber, S. S. Jick, and C. R. Meier (2014). Epidemiology of basal cell carcinoma in the United Kingdom: incidence, lifestyle factors, and comorbidities. British Journal of Cancer 111(1), 203–206.

[31] Schulz, M.-A., B. T. T. Yeo, J. T. Vogelstein, J. Mourao-Miranada, J. N. Kather, K. Kording, B. Richards, and D. Bzdok (2020). Different scaling of linear models and deep learning in UKBiobank brain images versus machine-learning datasets. Nature Communications 11(1), 4238.

[32] van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge University Press.

[33] Varewyck, M., S. Vansteelandt, M. Eriksson, and E. Goetghebeur (2016). On the practice of ignoring center-patient interactions in evaluating hospital performance. Statistics in Medicine 35(2), 227–238.

[34] Zhai, Y. and P. Han (2022). Data integration with oracle use of external information from heterogeneous populations. Journal of Computational and Graphical Statistics 31(4), 1001–1012.

[35] Zhou, D., M. Li, T. Cai, and M. Liu (2024). Model-assisted and knowledge-guided transfer regression for the underrepresented population. arXiv preprint arXiv:2410.06484.

[36] Zhou, D., M. Liu, M. Li, and T. Cai (2025). Doubly robust augmented model accuracy transfer inference with high dimensional features. Journal of the American Statistical Association 120(549), 524–534. Huali Zhao, Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China;