Abstract
Data collection costs can vary widely across variables in data science tasks. Two-phase designs
can be employed to save data collection costs. This paper considers the two-phase studies where inexpensive variables are collected for all subjects in the first phase, and expensive variables are measured for a
subsample of subjects in the second phase based on a predetermined sampling rule. The estimation efficiency under two-phase designs relies heavily on the sampling rule. Existing literature primarily focuses on
designing sampling rules for estimating a scalar parameter in some parametric models or specific estimating problems. However, real-world scenarios are usually model-unknown and involve two-phase designs
for model-free estimation of a scalar or multi-dimensional parameter. This paper proposes a maximin criterion to design an optimal sampling rule based on semiparametric efficiency bounds. The proposed method
is model-free and applicable to general estimating problems. The resulting sampling rule can minimize the
semiparametric efficiency bound when the parameter is scalar and improve the bound for every component
when the parameter is multi-dimensional. Simulation studies demonstrate that the proposed designs reduce
the variance of the resulting estimator in various settings. The implementation of the proposed design is
illustrated in a real data analysis.
Information
| Preprint No. | SS-2024-0359 |
|---|---|
| Manuscript ID | SS-2024-0359 |
| Complete Authors | Ruoyu Wang, Qihua Wang, Wang Miao |
| Corresponding Authors | Qihua Wang |
| Emails | qhwang@amss.ac.cn |
References
- Bickel, P. J. (1982). On adaptive estimation. The Annals of Statistics, 647–671.
- Chatterjee, N., Y.-H. Chen, and N. E. Breslow (2003). A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association 98(461), 158–168.
- Chen, T. and T. Lumley (2022). Optimal sampling for design-based estimators of regression models. Statistics in Medicine 41(8), 1482–1497.
- Chen, X. and T. M. Christensen (2015). Optimal uniform convergence rates and asymptotic normality for series estimators under weak dependence and weak conditions. Journal of Econometrics 188(2), 447–465.
- Cochran, W. G. (2007). Sampling Techniques (3rd ed.). John Wiley & Sons.
- Cox, D. R. and D. V. Hinkley (1979). Theoretical statistics. CRC Press.
- Dette, H. (1997). Designing experiments with respect to ‘standardized’optimality criteria. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59(1), 97–110.
- Fattorini, L., M. Marcheselli, C. Pisani, and L. Pratelli (2017). Design-based asymptotics for two-phase sampling strategies in environmental surveys. Biometrika 104(1), 195–205.
- Gilbert, P. B., X. Yu, and A. Rotnitzky (2014). Optimal auxiliary-covariate-based two-phase sampling design for semiparametric efficient estimation of a mean or mean difference, with application to clinical trials. Statistics in Medicine 33(6), 901–917.
- Green, D. M., N. E. Breslow, J. B. Beckwith, J. Z. Finklestein, P. E. Grundy, P. R. Thomas, T. Kim, S. J. Shochat, G. M.
- Haase, M. L. Ritchey, P. P. Kelalis, and G. J. D’Angio (1998). Comparison between single-dose and divided-dose administration of dactinomycin and doxorubicin for patients with wilms’ tumor: a report from the national wilms’ tumor study group. Journal of Clinical Oncology 16(1), 237–245.
- Hammer, S. M., D. A. Katzenstein, M. D. Hughes, H. Gundacker, R. T. Schooley, R. H. Haubrich, W. K. Henry, M. M.
- Lederman, J. P. Phair, M. Niu, M. S. Hirsch, and T. C. Merigan (1996). A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter. The New England Journal of Medicine 335, 1081 – 1090.
- Hardle, W., P. Janssen, and R. Serfling (1988). Strong uniform consistency rates for estimators of conditional functionals. The Annals of Statistics, 1428–1449.
- Kiefer, J. (1959). Optimum experimental designs. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 21(2), 272–304.
- Lin, D.-Y., D. Zeng, and Z.-Z. Tang (2013). Quantitative trait analysis in sequencing studies under trait-dependent sampling. Proceedings of the National Academy of Sciences 110(30), 12247–12252.
- Lin, H.-W. and Y.-H. Chen (2014). Adjustment for missing confounders in studies based on observational databases: 2-stage calibration combining propensity scores from primary and validation data. American Journal of Epidemiology 180(3), 308–317.
- Liu, Y., Z. Li, and X. Lin (2022). A minimax optimal ridge-type set test for global hypothesis with applications in whole genome sequencing association studies. Journal of the American Statistical Association 117(538), 897–908.
- Lotspeich, S. C., B. E. Shepherd, G. G. Amorim, P. A. Shaw, and R. Tao (2022). Efficient odds ratio estimation under two-phase sampling using error-prone data from a multi-national hiv research cohort. Biometrics 78(4), 1674–1685.
- McGowan, C. C., P. Cahn, E. Gotuzzo, D. Padgett, J. W. Pape, M. Wolff, M. Schechter, and D. R. Masys (2007). Cohort profile: Caribbean, central and south america network for hiv research (ccasanet) collaboration within the international epidemiologic databases to evaluate aids (iedea) programme. International Journal of Epidemiology 36(5), 969–976.
- McIsaac, M. A. and R. J. Cook (2014). Response-dependent two-phase sampling designs for biomarker studies. Canadian Journal of Statistics 42(2), 268–284.
- McIsaac, M. A. and R. J. Cook (2015). Adaptive sampling in two-phase designs: A biomarker study for progression in arthritis. Statistics in Medicine 34(21), 2899–2912.
- McNamee, R. (2002). Optimal designs of two-stage studies for estimation of sensitivity, specificity and positive predictive value. Statistics in Medicine 21(23), 3609–3625.
- Meinshausen, N. and P. B¨uhlmann (2015). Maximin effects in inhomogeneous large-scale data. The Annals of Statistics 43(4), 1801–1830.
- Nab, L., M. van Smeden, R. de Mutsert, F. R. Rosendaal, and R. H. Groenwold (2021). Sampling strategies for internal validation samples for exposure measurement–error correction: A study of visceral adipose tissue measures replaced by waist circumference measures. American Journal of Epidemiology 190(9), 1935–1947.
- Newey, W. K. (1993). 16 efficient estimation of models with conditional moment restrictions. In Econometrics, Volume 11 of Handbook of Statistics, pp. 419–454. Elsevier.
- Reilly, M. and M. S. Pepe (1995). A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82(2), 299–314.
- Tao, R., D. Zeng, and D.-Y. Lin (2017). Efficient semiparametric inference under two-phase sampling, with applications to genetic association studies. Journal of the American Statistical Association 112(520), 1468–1476.
- Tao, R., D. Zeng, and D.-Y. Lin (2020). Optimal designs of two-phase studies. Journal of the American Statistical Association 115(532), 1946–1959.
- Tsiatis, A. (2007). Semiparametric theory and missing data. Springer Science & Business Media.
- van der Laan, M. J. and J. M. Robins (2012). Unified Methods for Censored Longitudinal Data and Causality. Springer Science & Business Media.
- van der Laan, M. J. and D. Rubin (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics 2(1).
- Yang, S. and P. Ding (2019). Combining multiple observational data sources to estimate causal effects. Journal of the American Statistical Association.
- Zhang, G., L. J. Beesley, B. Mukherjee, and X. Shi (2024). Patient recruitment using electronic health records under selection bias: A two-phase sampling framework. The Annals of Applied Statistics 18(3), 1858.
- Zhou, H., W. Xu, D. Zeng, and J. Cai (2014). Semiparametric inference for data with a continuous outcome from a two-phase probability-dependent sampling scheme. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 1(76), 197–215. Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 55 Zhongguancun East Road, Haidian
- District, Beijing 100190, China
Acknowledgments
We would like to thank the editor, associate editor, and the anonymous reviewer for their
very insightful and helpful comments, which led to a significant improvement of our paper. Wang’s research was supported by the National Natural Science Foundation of China
(General program 12271510), and a grant from the Key Lab of Random Complex Structure
and Data Science, CAS. Miao’s research was supported by the National Key R&D Program
(2022YFA1008100) and the National Natural Science Foundation of China (Genral program
12071015).
Supplementary Materials
contains proofs of the theoretical results; a nonparametric procedure
to estimate the conditional mean and variance simultaneously; a description of the one-step
estimation method based on the data collected in the two phases and the pilot sample; and
additional simulation results.