Abstract
Modern biomedical research increasingly relies on integrating multiple
cohort studies, yet faces a critical challenge: indicator covariates such as smoking
status, vaccination records, or diagnostic codes that are entirely absent in some
cohorts due to differences in data collection protocols. This cohort-level missingness violates the assumptions underlying traditional missing data methods, as
the complete absence of covariates across entire populations fundamentally differs from sporadic individual-level missingness. To address this gap, we develop
a doubly robust transfer learning framework based on a novel sub-group shift
assumption, which posits that the conditional distribution of the missing indicator given other variables remains stable across cohorts while allowing marginal
distributions to vary. Our approach combines importance weighting with imputation in augmented estimating equations, achieving robustness to misspecifi-
cation of either the density ratio model or the imputation model. We establish
ORCID IDs: Huali Zhao: 0009-0001-2358-8113
Tianying Wang: 0000-0002-2826-5364
that the proposed estimator is n1/2-consistent and asymptotically normal under
mild regularity conditions. Through extensive simulations and an application to
UK Biobank data, we demonstrate superior performance compared to existing
approaches. This work provides a rigorous framework for handling cohort-level
missing indicators, addressing a pervasive challenge in large-scale biomedical data
integration.
Information
| Preprint No. | SS-2025-0245 |
|---|---|
| Manuscript ID | SS-2025-0245 |
| Complete Authors | Huali Zhao, Tianying Wang |
| Corresponding Authors | Tianying Wang |
| Emails | tianyingw0905@outlook.com |
References
- Allen, N. E., B. Lacey, D. A. Lawlor, J. P. Pell, J. Gallacher, L. Smeeth, P. Elliott, P. M.
- Matthews, R. A. Lyons, A. D. Whetton, A. Lucassen, M. E. Hurles, M. Chapman, A. W. Roddam, N. K. Fitzpatrick, A. L. Hansell, R. Hardy, R. E. Marioni, V. B.
- O’Donnell, J. Williams, C. M. Lindgren, M. Effingham, J. Sellors, J. Danesh, and R. Collins
- (2024). Prospective study design and data analysis in uk biobank. Science Translational Medicine 16(729), eadf4428.
- Amatruda, J. M., M. C. Statt, S. L. Welle, et al. (1993). Total and resting energy expenditure in obese women reduced to ideal body weight. The Journal of Clinical Investigation 92(3), 1236–1242.
- Arem, H., J. Reedy, J. Sampson, L. Jiao, A. R. Hollenbeck, H. Risch, S. T. Mayne, and R. Z.
- Stolzenberg-Solomon (2013). The healthy eating index 2005 and risk for pancreatic cancer in the nih–aarp study. Journal of the National Cancer Institute 105(17), 1298–1305.
- Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61(4), 962–973.
- Blankers, M., E. S. Smit, P. van der Pol, H. de Vries, C. Hoving, and M. van Laar (2016). The missing= smoking assumption: a fallacy in internet-based smoking cessation trials? Nicotine & Tobacco Research 18(1), 25–33.
- Bycroft, C., C. Freeman, D. Petkova, G. Band, L. T. Elliott, K. Sharp, A. Motyer, D. Vukcevic,
- O. Delaneau, J. O’Connell, et al. (2018). The uk biobank resource with deep phenotyping and genomic data. Nature 562(7726), 203–209.
- Cai, T., M. Li, and M. Liu (2025). Semi-supervised triply robust inductive transfer learning. Journal of the American Statistical Association 120(550), 1037–1047.
- Dare, S., D. F. Mackay, and J. P. Pell (2015). Relationship between smoking and obesity: a cross-sectional study of 499,504 middle-aged adults in the uk general population. PLoS One 10(4), e0123579.
- Denny, J., J. Rutter, D. Goldstein, A. Philippakis, J. Smoller, G. Jenkins, E. Dishman, et al.
- (2019). The” all of us” research program. The New England Journal of Medicine 381(7), 668–676.
- Desai, R. J., D. H. Solomon, N. Shadick, C. Iannaccone, and S. C. Kim (2016). Identification of smoking using medicare data—a validation study of claims-based algorithms. Pharmacoepidemiology and Drug Safety 25(4), 472–475.
- Gaziano, J. M., J. Concato, M. Brophy, L. Fiore, S. Pyarajan, J. Breeling, S. Whitbourne,
- J. Deen, C. Shannon, D. Humphries, et al. (2016). Million veteran program: A megabiobank to study genetic influences on health and disease. Journal of Clinical Epidemiology 70, 214–223.
- Guh, D. P., W. Zhang, N. Bansback, Z. Amarsi, C. L. Birmingham, and A. H. Anis (2009). The incidence of co-morbidities related to obesity and overweight: a systematic review and meta-analysis. BMC Public Health 9, 88.
- Hedeker, D., R. J. Mermelstein, and H. Demirtas (2007). Analysis of binary outcomes with missing data: missing= smoking, last observation carried forward, and a little multiple imputation. Addiction 102(10), 1564–1573.
- Kennedy, E. H., J. A. Mauro, M. J. Daniels, N. Burns, and D. S. Small (2019). Handling missing data in instrumental variable methods for causal inference. Annual Review of Statistics and Its Application 6(1), 125–148.
- Kpotufe, S. and G. Martinet (2021). Marginal singularity and the benefits of labels in covariateshift. The Annals of Statistics 49(6), 3299–3323.
- Lee, S.-h., Y. Ma, and J. Zhao (2025). Doubly flexible estimation under label shift. Journal of the American Statistical Association 120(549), 278–290.
- Li, S. and L. Zhang (2025). Multi-dimensional domain generalization with low-rank structures. Journal of the American Statistical Association, 1–13.
- Li, Y., X. Yang, Y. Wei, and M. Liu (2024). Adaptive and efficient learning with blockwise missing and semi-supervised data. arXiv preprint arXiv:2405.18722.
- Liu, M., Y. Zhang, K. P. Liao, and T. Cai (2023). Augmented transfer regression learning with semi-non-parametric nuisance models. Journal of Machine Learning Research 24(293), 1–50.
- Millard, L. A., M. R. Munafò, K. Tilling, R. E. Wootton, and G. Davey Smith (2019). Mrphewas with stratification and interaction: searching for the causal effects of smoking heaviness identified an effect on facial aging. PLoS Genetics 15(10), e1008353.
- Plassier, V., M. Makni, A. Rubashevskii, E. Moulines, and M. Panov (2023). Conformal prediction for federated uncertainty quantification under label shift. In International Conference on Machine Learning, pp. 27907–27947. PMLR.
- Reich, C., A. Ostropolets, P. Ryan, P. Rijnbeek, M. Schuemie, A. Davydov, D. Dymshyts,
- and G. Hripcsak (2024). Ohdsi standardized vocabularies—a large-scale centralized reference ontology for international data harmonization. Journal of the American Medical Informatics Association 31(3), 583–590.
- Reinau, D., C. Surber, S. S. Jick, and C. R. Meier (2014). Epidemiology of basal cell carcinoma in the united kingdom: incidence, lifestyle factors, and comorbidities. British Journal of Cancer 111(1), 203–206.
- Schulz, M.-A., B. T. Yeo, J. T. Vogelstein, J. Mourao-Miranada, J. N. Kather, K. Kording,
- B. Richards, and D. Bzdok (2020). Different scaling of linear models and deep learning in ukbiobank brain images versus machine-learning datasets. Nature Communications 11(1), 4238.
- Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge university press.
- Varewyck, M., S. Vansteelandt, M. Eriksson, and E. Goetghebeur (2016). On the practice of ignoring center-patient interactions in evaluating hospital performance. Statistics in Medicine 35(2), 227–238.
- Zhai, Y. and P. Han (2022). Data integration with oracle use of external information from heterogeneous populations. Journal of Computational and Graphical Statistics 31(4), 1001– 1012.
- Zhou, D., M. Li, T. Cai, and M. Liu (2024). Model-assisted and knowledge-guided transfer regression for the underrepresented population. arXiv preprint arXiv:2410.06484.
- Zhou, D., M. Liu, M. Li, and T. Cai (2025). Doubly robust augmented model accuracy transfer inference with high dimensional features. Journal of the American Statistical Association 120(549), 524–534. Huali Zhao, Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China;
Acknowledgments
The authors thank the Editor, Associate Editor, and reviewers for insightful and constructive comments, which have greatly helped improve the pa-
per. We thank all participants and researchers who contributed to the UK
Biobank datasets (www.ukbiobank.ac.uk).
Supplementary Materials
Supplemental materials include the potential nuisance models, the crossfitted version of the proposed method, detailed proofs of all technical results,
and extended simulation and data analysis results. The R code is provided
on GitHub at: https://github.com/tianyingw/DRTL-comb.