Abstract
Missing-data is a pervasive problem in regression analysis, compro
mising the accuracy and efficiency of parameter estimates. This paper focuses
on the challenging scenario of missing not at random (MNAR) data, where the
missingness of a value is linked to the value itself.
Traditional approaches to
addressing MNAR data confront a trade-off: imposing stringent assumptions
about the missingness mechanism can enhance efficiency but curtail robustness,
whereas accommodating model misspecification can bolster robustness but at the
expense of efficiency. In addition, assuming a nonparametric MNAR mechanism
will lead to model identifiability issues. We propose a novel approach that overcomes this limitation. Firstly, we address the model identifiability issue using
the shadow variable. Then, by leveraging the sieve method, we can model the
MNAR mechanism nonparametrically. This approach achieves the best of both
worlds: it gains robustness by avoiding strict assumptions about the missingness
mechanism while simultaneously achieving the semiparametric efficiency bound
for the parameter of interest (meaning our estimator has the lowest possible
asymptotic variance). The paper delves into the theoretical framework, outlining
conditions for identifiability, constructing the semiparametric likelihood function,
and rigorously proving the estimator’s semiparametric efficiency. Additionally,
we present an EM-type algorithm for practical implementation, discussing the
E-step and M-step iterations and variance estimation methods. Finally, simulations and a real-data application demonstrate the effectiveness of our proposed
method compared to existing approaches.
Information
| Preprint No. | SS-2024-0204 |
|---|---|
| Manuscript ID | SS-2024-0204 |
| Complete Authors | Qinglong Tian, Donglin Zeng, Jiwei Zhao |
| Corresponding Authors | Jiwei Zhao |
| Emails | jiwei.zhao@wisc.edu |
References
- Butler, S., R. Payne, I. Gunn, J. Burns, and C. Paterson (1984). Correlation between serum ionised calcium and serum albumin concentrations in two hospital populations. Br Med J (Clin Res Ed) 289(6450), 948–950.
- Chen, X., H. Hong, and A. Tarozzi (2008). Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics 36, 808–843.
- Grenander, U. (1981). Abstract Inference. New York: Wiley.
- Johnson, A. E., T. J. Pollard, L. Shen, H. L. Li-Wei, M. Feng, M. Ghassemi, B. Moody,
- P. Szolovits, L. A. Celi, and R. G. Mark (2016). Mimic-iii, a freely accessible critical care database. Scientific data 3(1), 1–9.
- Katz, S. and I. M. Klotz (1953). Interactions of calcium with serum albumin. Archives of biochemistry and biophysics 44(2), 351–361.
- Kim, J. K. and C. L. Yu (2011). A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association 106(493), 157–165.
- Li, M., Y. Ma, and J. Zhao (2022). Efficient estimation in a partially specified nonignorable propensity score model. Computational Statistics & Data Analysis 174, 107322.
- Little, R. J. A. and D. B. Rubin (2002). Statistical analysis with missing data (2 ed.). Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley & Sons, Inc., New York.
- Miao, W., P. Ding, and Z. Geng (2016). Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association 111(516), 1673–1683.
- Miao, W., L. Liu, Y. Li, E. J. Tchetgen Tchetgen, and Z. Geng (2024). Identification and semiparametric efficiency theory of nonignorable missing data with a shadow variable. ACM/JMS Journal of Data Science 1(2), 1–23.
- Miao, W. and E. J. Tchetgen Tchetgen (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika 103(2), 475–482.
- Murphy, S. A. and A. W. Van der Vaart (2000). On profile likelihood. Journal of the American Statistical Association 95(450), 449–465.
- Ridders, C. (1982). Accurate computation of F′(x) and F′(x)F′′(x). Advances in Engineering Software 4(2), 75–76.
- Schumaker, L. (2007). Spline Functions: Basic Theory. Cambridge University Press.
- Shao, J. and J. Zhao (2013). Estimation in longitudinal studies with nonignorable dropout. Statistics and Its Interface 6, 303–313.
- Shen, X. (1997). On methods of sieves and penalization. The Annals of Statistics 25(6), 2555–2591.
- Shetty, S., Y. Ma, and J. Zhao (2023). The pursuit of efficiency versus robustness: A learning experience from analyzing a semiparametric nonignorable propensity score model. Observational Studies 9(1), 97–104.
- Shetty, S., Y. Ma, and J. Zhao (2025). Robust estimation under a semiparametric propensity model for nonignorable missing data. Electronic Journal of Statistics 19(1), 956–981.
- Tang, G., R. J. Little, and T. E. Raghunathan (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika 90(4), 747–764.
- Tang, N., P. Zhao, and H. Zhu (2014). Empirical likelihood for estimating equations with nonignorably missing data. Statistica Sinica 24, 723–747.
- Tsiatis, A. A. (2006). Semiparametric theory and missing data. Springer Series in Statistics.
- Springer, New York.
- Wang, D. and S. X. Chen (2009). Empirical likelihood for estimating equations with missing values. The Annals of Statistics 37, 490–517.
- Wang, S., J. Shao, and J. K. Kim (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica 24, 1097–1116.
- Zhao, J. and C. Chen (2020). A nuisance-free inference procedure accounting for the unknown missingness with application to electronic health records. Entropy 22(10), 1154.
- Zhao, J. and Y. Ma (2018). Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse. Biometrika 105(2), 479–486.
- Zhao, J. and Y. Ma (2022). A versatile estimation procedure without estimating the nonignorable missingness mechanism. Journal of the American Statistical Association 117(540), 1916–1930.
- Zhao, J. and J. Shao (2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association 110(512), 1577–1590.
- Zhao, J. and J. Shao (2017). Approximate conditional likelihood for generalized linear models with general missing data mechanism. Journal of Systems Science and Complexity 30(1), 139–153.
Acknowledgments
Tian is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant RGPIN-2023-03479. Zeng is sup-
ported in part by U.S. National Institutes of Health (R01HL173128). Zhao
is supported in part by U.S. National Science Foundation (DMS 1953526,
2122074 and 2310942), U.S. National Institutes of Health (R01DC021431)
and the American Family Funding Initiative of UW-Madison.
Supplementary Materials
In the online supplementary material, Section S1 provides proofs for all the
lemmas and theorems in the main paper, and Section S2 provides more
details on the Gauss-Hermite Quadrature used in the main paper.