Identification and Efficient Estimation in Regression Analysis with Response Missing Not At Random

Qinglong Tian, Donglin Zeng and Jiwei Zhao

doi:10.5705/ss.202024.0204

Abstract

Missing-data is a pervasive problem in regression analysis, compro

mising the accuracy and efficiency of parameter estimates. This paper focuses

on the challenging scenario of missing not at random (MNAR) data, where the

missingness of a value is linked to the value itself.

Traditional approaches to

addressing MNAR data confront a trade-off: imposing stringent assumptions

about the missingness mechanism can enhance efficiency but curtail robustness,

whereas accommodating model misspecification can bolster robustness but at the

expense of efficiency. In addition, assuming a nonparametric MNAR mechanism

will lead to model identifiability issues. We propose a novel approach that overcomes this limitation. Firstly, we address the model identifiability issue using

the shadow variable. Then, by leveraging the sieve method, we can model the

MNAR mechanism nonparametrically. This approach achieves the best of both

worlds: it gains robustness by avoiding strict assumptions about the missingness

mechanism while simultaneously achieving the semiparametric efficiency bound

for the parameter of interest (meaning our estimator has the lowest possible

asymptotic variance). The paper delves into the theoretical framework, outlining

conditions for identifiability, constructing the semiparametric likelihood function,

and rigorously proving the estimator’s semiparametric efficiency. Additionally,

we present an EM-type algorithm for practical implementation, discussing the

E-step and M-step iterations and variance estimation methods. Finally, simulations and a real-data application demonstrate the effectiveness of our proposed

method compared to existing approaches.

Key words and phrases: Missing Data; Missing Not at Random; Identification; Semiparametric Efficiency; Efficient Estimation; Method of Sieves

Information

Preprint No.	SS-2024-0204
Manuscript ID	SS-2024-0204
Complete Authors	Qinglong Tian, Donglin Zeng, Jiwei Zhao
Corresponding Authors	Jiwei Zhao
Emails	jiwei.zhao@wisc.edu

References

Butler, S., R. Payne, I. Gunn, J. Burns, and C. Paterson (1984). Correlation between serum ionised calcium and serum albumin concentrations in two hospital populations. Br Med J (Clin Res Ed) 289(6450), 948–950.
Chen, X., H. Hong, and A. Tarozzi (2008). Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics 36, 808–843.
Grenander, U. (1981). Abstract Inference. New York: Wiley.
Johnson, A. E., T. J. Pollard, L. Shen, H. L. Li-Wei, M. Feng, M. Ghassemi, B. Moody,
P. Szolovits, L. A. Celi, and R. G. Mark (2016). Mimic-iii, a freely accessible critical care database. Scientific data 3(1), 1–9.
Katz, S. and I. M. Klotz (1953). Interactions of calcium with serum albumin. Archives of biochemistry and biophysics 44(2), 351–361.
Kim, J. K. and C. L. Yu (2011). A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association 106(493), 157–165.
Li, M., Y. Ma, and J. Zhao (2022). Efficient estimation in a partially specified nonignorable propensity score model. Computational Statistics & Data Analysis 174, 107322.
Little, R. J. A. and D. B. Rubin (2002). Statistical analysis with missing data (2 ed.). Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley & Sons, Inc., New York.
Miao, W., P. Ding, and Z. Geng (2016). Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association 111(516), 1673–1683.
Miao, W., L. Liu, Y. Li, E. J. Tchetgen Tchetgen, and Z. Geng (2024). Identification and semiparametric efficiency theory of nonignorable missing data with a shadow variable. ACM/JMS Journal of Data Science 1(2), 1–23.
Miao, W. and E. J. Tchetgen Tchetgen (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika 103(2), 475–482.
Murphy, S. A. and A. W. Van der Vaart (2000). On profile likelihood. Journal of the American Statistical Association 95(450), 449–465.
Ridders, C. (1982). Accurate computation of F′(x) and F′(x)F′′(x). Advances in Engineering Software 4(2), 75–76.
Schumaker, L. (2007). Spline Functions: Basic Theory. Cambridge University Press.
Shao, J. and J. Zhao (2013). Estimation in longitudinal studies with nonignorable dropout. Statistics and Its Interface 6, 303–313.
Shen, X. (1997). On methods of sieves and penalization. The Annals of Statistics 25(6), 2555–2591.
Shetty, S., Y. Ma, and J. Zhao (2023). The pursuit of efficiency versus robustness: A learning experience from analyzing a semiparametric nonignorable propensity score model. Observational Studies 9(1), 97–104.
Shetty, S., Y. Ma, and J. Zhao (2025). Robust estimation under a semiparametric propensity model for nonignorable missing data. Electronic Journal of Statistics 19(1), 956–981.
Tang, G., R. J. Little, and T. E. Raghunathan (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika 90(4), 747–764.
Tang, N., P. Zhao, and H. Zhu (2014). Empirical likelihood for estimating equations with nonignorably missing data. Statistica Sinica 24, 723–747.
Tsiatis, A. A. (2006). Semiparametric theory and missing data. Springer Series in Statistics.
Springer, New York.
Wang, D. and S. X. Chen (2009). Empirical likelihood for estimating equations with missing values. The Annals of Statistics 37, 490–517.
Wang, S., J. Shao, and J. K. Kim (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica 24, 1097–1116.
Zhao, J. and C. Chen (2020). A nuisance-free inference procedure accounting for the unknown missingness with application to electronic health records. Entropy 22(10), 1154.
Zhao, J. and Y. Ma (2018). Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse. Biometrika 105(2), 479–486.
Zhao, J. and Y. Ma (2022). A versatile estimation procedure without estimating the nonignorable missingness mechanism. Journal of the American Statistical Association 117(540), 1916–1930.
Zhao, J. and J. Shao (2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association 110(512), 1577–1590.
Zhao, J. and J. Shao (2017). Approximate conditional likelihood for generalized linear models with general missing data mechanism. Journal of Systems Science and Complexity 30(1), 139–153.

Acknowledgments

Tian is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant RGPIN-2023-03479. Zeng is sup-

ported in part by U.S. National Institutes of Health (R01HL173128). Zhao

is supported in part by U.S. National Science Foundation (DMS 1953526,

2122074 and 2310942), U.S. National Institutes of Health (R01DC021431)

and the American Family Funding Initiative of UW-Madison.

Supplementary Materials

In the online supplementary material, Section S1 provides proofs for all the

lemmas and theorems in the main paper, and Section S2 provides more

details on the Gauss-Hermite Quadrature used in the main paper.

Supplementary materials are available for download.

[1] Butler, S., R. Payne, I. Gunn, J. Burns, and C. Paterson (1984). Correlation between serum ionised calcium and serum albumin concentrations in two hospital populations. Br Med J (Clin Res Ed) 289(6450), 948–950.

[2] Chen, X., H. Hong, and A. Tarozzi (2008). Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics 36, 808–843.

[3] Grenander, U. (1981). Abstract Inference. New York: Wiley.

[4] Johnson, A. E., T. J. Pollard, L. Shen, H. L. Li-Wei, M. Feng, M. Ghassemi, B. Moody,

[5] P. Szolovits, L. A. Celi, and R. G. Mark (2016). Mimic-iii, a freely accessible critical care database. Scientific data 3(1), 1–9.

[6] Katz, S. and I. M. Klotz (1953). Interactions of calcium with serum albumin. Archives of biochemistry and biophysics 44(2), 351–361.

[7] Kim, J. K. and C. L. Yu (2011). A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association 106(493), 157–165.

[8] Li, M., Y. Ma, and J. Zhao (2022). Efficient estimation in a partially specified nonignorable propensity score model. Computational Statistics & Data Analysis 174, 107322.

[9] Little, R. J. A. and D. B. Rubin (2002). Statistical analysis with missing data (2 ed.). Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley & Sons, Inc., New York.

[10] Miao, W., P. Ding, and Z. Geng (2016). Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association 111(516), 1673–1683.

[11] Miao, W., L. Liu, Y. Li, E. J. Tchetgen Tchetgen, and Z. Geng (2024). Identification and semiparametric efficiency theory of nonignorable missing data with a shadow variable. ACM/JMS Journal of Data Science 1(2), 1–23.

[12] Miao, W. and E. J. Tchetgen Tchetgen (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika 103(2), 475–482.

[13] Murphy, S. A. and A. W. Van der Vaart (2000). On profile likelihood. Journal of the American Statistical Association 95(450), 449–465.

[14] Ridders, C. (1982). Accurate computation of F′(x) and F′(x)F′′(x). Advances in Engineering Software 4(2), 75–76.

[15] Schumaker, L. (2007). Spline Functions: Basic Theory. Cambridge University Press.

[16] Shao, J. and J. Zhao (2013). Estimation in longitudinal studies with nonignorable dropout. Statistics and Its Interface 6, 303–313.

[17] Shen, X. (1997). On methods of sieves and penalization. The Annals of Statistics 25(6), 2555–2591.

[18] Shetty, S., Y. Ma, and J. Zhao (2023). The pursuit of efficiency versus robustness: A learning experience from analyzing a semiparametric nonignorable propensity score model. Observational Studies 9(1), 97–104.

[19] Shetty, S., Y. Ma, and J. Zhao (2025). Robust estimation under a semiparametric propensity model for nonignorable missing data. Electronic Journal of Statistics 19(1), 956–981.

[20] Tang, G., R. J. Little, and T. E. Raghunathan (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika 90(4), 747–764.

[21] Tang, N., P. Zhao, and H. Zhu (2014). Empirical likelihood for estimating equations with nonignorably missing data. Statistica Sinica 24, 723–747.

[22] Tsiatis, A. A. (2006). Semiparametric theory and missing data. Springer Series in Statistics.

[23] Springer, New York.

[24] Wang, D. and S. X. Chen (2009). Empirical likelihood for estimating equations with missing values. The Annals of Statistics 37, 490–517.

[25] Wang, S., J. Shao, and J. K. Kim (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica 24, 1097–1116.

[26] Zhao, J. and C. Chen (2020). A nuisance-free inference procedure accounting for the unknown missingness with application to electronic health records. Entropy 22(10), 1154.

[27] Zhao, J. and Y. Ma (2018). Optimal pseudolikelihood estimation in the analysis of multivariate missing data with nonignorable nonresponse. Biometrika 105(2), 479–486.

[28] Zhao, J. and Y. Ma (2022). A versatile estimation procedure without estimating the nonignorable missingness mechanism. Journal of the American Statistical Association 117(540), 1916–1930.

[29] Zhao, J. and J. Shao (2015). Semiparametric pseudo-likelihoods in generalized linear models with nonignorable missing data. Journal of the American Statistical Association 110(512), 1577–1590.

[30] Zhao, J. and J. Shao (2017). Approximate conditional likelihood for generalized linear models with general missing data mechanism. Journal of Systems Science and Complexity 30(1), 139–153.