Abstract
Learning causal relationships among a set of variables, as encoded by a directed
acyclic graph, from observational data is complicated by the presence of unobserved confounders. Instrumental variables (IVs) are a popular remedy for this issue, but most
existing methods either assume the validity of all IVs or postulate a specific form of relationship, such as a linear model, between the primary variables and the IVs. To overcome
these limitations, we introduce a partially linear model for causal discovery and inference that accommodates potentially invalid IVs and allows for general dependence of
the primary variables on the IVs. We establish identification under this semiparametric
model by constructing surrogate valid IVs, and develop a finite-sample procedure for
estimating the causal structures and effects. Theoretically, we show that our procedure
consistently learns the causal structures, yields asymptotically normal estimates, and
effectively controls the false discovery rate in edge recovery. Simulation studies demonstrate the superiority of our method over existing competitors, and an application to
inferring gene regulatory networks in Alzheimer’s disease illustrates its usefulness.
Information
| Preprint No. | SS-2025-0331 |
|---|---|
| Manuscript ID | SS-2025-0331 |
| Complete Authors | Jing Zou, Wei Li, Wei Lin |
| Corresponding Authors | Wei Li |
| Emails | weilistat@ruc.edu.cn |
References
- Agrawal, R., C. Squires, N. Prasad, and C. Uhler (2023). The DeCAMFounder: Nonlinear causal discovery in the presence of hidden variables. Journal of the Royal Statistical Society, Series B 85(5), 1639–1658.
- Angrist, J. D., G. W. Imbens, and D. B. Rubin (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association 91(434), 444–455.
- Barab´asi, A.-L., N. Gulbahce, and J. Loscalzo (2011). Network medicine: A network-based approach to human disease. Nature Reviews Genetics 12(1), 56–68.
- Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics 29(4), 1165–1188.
- Berk, R., L. Brown, A. Buja, K. Zhang, and L. Zhao (2013). Valid post-selection inference. The Annals of Statistics 41(2), 802–837.
- Bloom, G. S. (2014). Amyloid-β and tau: The trigger and bullet in Alzheimer disease pathogenesis. JAMA Neurology 71(4), 505–508.
- Bowden, J., G. Davey Smith, and S. Burgess (2015). Mendelian randomization with invalid instruments: effect estimation and bias detection through egger regression. International Journal of Epidemiology 44(2), 512–525.
- Caner, M. (2009). Lasso-type GMM estimator. Econometric Theory 25(1), 270–290.
- Castro, D. C., I. Walker, and B. Glocker (2020). Causality matters in medical imaging. Nature Communications 11, 3673.
- Chen, L., C. Li, X. Shen, and W. Pan (2024). Discovery and inference of a causal network with hidden confounding. Journal of the American Statistical Association 119(548), 2572–2584.
- Chen, S., Z. Lin, X. Shen, L. Li, and W. Pan (2023). Inference of causal metabolite networks in the presence of invalid instrumental variables with GWAS summary data. Genetic Epidemiology 47(8), 585–599.
- Colombo, D., M. H. Maathuis, M. Kalisch, and T. S. Richardson (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics 40(1), 294–321.
- Dominici, F., A. McDermott, and T. J. Hastie (2004). Improved semiparametric time series models of air pollution and mortality. Journal of the American Statistical Association 99(468), 938–948.
- Engle, R. F., C. W. J. Granger, J. Rice, and A. Weiss (1986). Semiparametric estimates of the relation between weather and electricity sales. Journal of the American Statistical Association 81(394), 310–320.
- Florens, J.-P., J. Johannes, and S. Van Bellegem (2012). Instrumental regression in partially linear models. The Econometrics Journal 15(2), 304–324.
- Frot, B., P. Nandy, and M. H. Maathuis (2019). Robust causal structure learning with some hidden variables. Journal of the Royal Statistical Society, Series B 81(3), 459–487.
- Gradu, P., T. Zrnic, Y. Wang, and M. I. Jordan (2025). Valid inference after causal discovery. Journal of the American Statistical Association 120(550), 1127–1138.
- Guo, Z., H. Kang, T. T. Cai, and D. S. Small (2018). Confidence intervals for causal effects with invalid instruments by using two-stage hard thresholding with voting. Journal of the Royal Statistical
- Society, Series B 80(4), 793–815.
- Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50(4), 1029–1054.
- H¨ardle, W., H. Liang, and J. Gao (2000). Partially Linear Models. Berlin: Springer.
- Heinze-Deml, C., M. H. Maathuis, and N. Meinshausen (2018). Causal structure learning. Annual Review of Statistics and Its Application 5, 371–391.
- Kang, H., A. Zhang, T. T. Cai, and D. S. Small (2016). Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization. Journal of the American Statistical Association 111(513), 132–144.
- Koles´ar, M., R. Chetty, J. Friedman, E. Glaeser, and G. W. Imbens (2015). Identification and inference with many invalid instruments. Journal of Business & Economic Statistics 33(4), 474–484.
- Kuchibhotla, A. K., J. E. Kolassa, and T. A. Kuffner (2022). Post-selection inference. Annual Review of Statistics and Its Application 9, 505–527.
- Li, C., X. Shen, and W. Pan (2023). Inference for a large directed acyclic graph with unspecified interventions. Journal of Machine Learning Research 24(73), 1–48.
- Li, C., X. Shen, and W. Pan (2024). Nonlinear causal discovery with confounders. Journal of the American Statistical Association 119(546), 1205–1214.
- Li, R., W. Zhong, and L. Zhu (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association 107(499), 1129–1139.
- Li, W., R. Duan, and S. Li (2024). Discovery and inference of possibly bi-directional causal relationships with invalid instrumental variables. arXiv preprint arXiv:2407.11646.
- Liang, H., S. Wang, J. M. Robins, and R. J. Carroll (2004). Estimation in partially linear models with missing covariates. Journal of the American Statistical Association 99(466), 357–367.
- Liu, Q., C. V. Zerbinatti, J. Zhang, H.-S. Hoe, B. Wang, S. L. Cole et al. (2007). Amyloid precursor protein regulates brain apolipoprotein E and cholesterol metabolism through lipoprotein receptor LRP1. Neuron 56(1), 66–78.
- Neto, E. C., M. P. Keller, A. D. Attie, and B. S. Yandell (2010). Causal graphical models in systems genetics: A unified framework for joint inference of causal network and genetic architecture for correlated phenotypes. The Annals of Applied Statistics 4(1), 320–339.
- Newey, W. K. (1990). Efficient instrumental variables estimation of nonlinear models. Econometrica 58(4), 809–837.
- Newey, W. K. (1993). Efficient estimation of models with conditional moment restrictions. In Econometrics, Volume 11 of Handbook of Statistics, pp. 419–454. Amsterdam: North-Holland.
- Newey, W. K. and J. L. Powell (2003). Instrumental variable estimation of nonparametric models. Econometrica 71(5), 1565–1578.
- Oates, C. J., J. Q. Smith, and S. Mukherjee (2016). Estimating causal structure using conditional DAG models. Journal of Machine Learning Research 17(54), 1–23.
- O’Brien, R. J. and P. C. Wong (2011). Amyloid precursor protein processing and Alzheimer’s disease. Annual Review of Neuroscience 34, 185–204.
- Ongen, H., A. A. Brown, O. Delaneau, N. I. Panousis, A. C. Nica, G. Consortium et al. (2017). Estimating the causal tissues for complex traits and diseases. Nature Genetics 49(12), 1676–1683.
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge: Cambridge University Press.
- Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. A. R. Ferreira, D. Bender et al. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics 81(3), 559–575.
- Rauch, J. N., G. Luna, E. Guzman, M. Audouard, C. Challis, Y. E. Sibih et al. (2020). LRP1 is a master regulator of tau uptake and spread. Nature 580(7803), 381–385.
- Robins, J. M., S. D. Mark, and W. K. Newey (1992). Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics 48(2), 479–495.
- Robinson, P. M. (1988). Root-N-consistent semiparametric regression. Econometrica 56(4), 931–954.
- Rothenh¨ausler, D., J. Ernest, and P. B¨uhlmann (2018). Causal inference in partially linear structural equation models. The Annals of Statistics 46(6A), 2904–2938.
- Runge, J., P. Nowack, M. Kretschmer, S. Flaxman, and D. Sejdinovic (2019). Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances 5(11), eaau4996.
- Serrano-Pozo, A., S. Das, and B. T. Hyman (2021). APOE and Alzheimer’s disease: Advances in genetics, pathophysiology, and therapeutic approaches. The Lancet Neurology 20(1), 68–80.
- Spirtes, P., C. Glymour, and R. Scheines (2001). Causation, Prediction, and Search (2nd ed.). Cambridge, MA: MIT Press.
- Staiger, D. and J. H. Stock (1997). Instrumental variables regression with weak instruments. Econometrica 65(3), 557–586.
- Stock, J. H., J. H. Wright, and M. Yogo (2002). A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics 20(4), 518–529.
- Sun, B., Z. Liu, and E. J. Tchetgen Tchetgen (2023). Semiparametric efficient G-estimation with invalid instrumental variables. Biometrika 110(4), 953–971.
- Sz´ekely, G. J., M. L. Rizzo, and N. K. Bakirov (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics 35(6), 2769–2794.
- Tchetgen Tchetgen, E. J., J. M. Robins, and A. Rotnitzky (2010). On doubly robust estimation in a semiparametric odds ratio model. Biometrika 97(1), 171–180.
- Triantafillou, S., V. Lagani, C. Heinze-Deml, A. Schmidt, J. Tegner, and I. Tsamardinos (2017). Predicting causal relationships from biological data: Applying automated causal discovery on mass cytometry data of human immune cells. Scientific Reports 7, 12724.
- Wang, X., W. Pan, W. Hu, Y. Tian, and H. Zhang (2015). Conditional distance correlation. Journal of the American Statistical Association 110(512), 1726–1734.
- Windmeijer, F., H. Farbmacher, N. Davies, and G. Davey Smith (2019). On the use of the Lasso for instrumental variables estimation with some invalid instruments. Journal of the American Statistical Association 114(527), 1339–1350.
- Ye, T., J. Shao, and H. Kang (2021). Debiased inverse-variance weighted estimator in two-sample summary-data Mendelian randomization. The Annals of Statistics 49(4), 2079–2100.
- Zhao, Q., J. Wang, G. Hemani, J. Bowden, and D. S. Small (2020). Statistical inference in twosample summary-data Mendelian randomization using robust adjusted profile score. The Annals of Statistics 48(3), 1742–1769.
- Zilinskas, R., C. Li, X. Shen, W. Pan, and T. Yang (2024). Inferring a directed acyclic graph of phenotypes from GWAS summary statistics. Biometrics 80(1), ujad039.
Acknowledgments
We sincerely thank the editor, associate editor, and two reviewers for their valuable comments, which led to a significant improvement of our paper. Zou and
Lin’s research was supported by the National Natural Science Foundation of China
(12171012, 12292980, and 12292981). Li’s research was supported by the National
Natural Science Foundation of China (12471269) and National Key R&D Program
of China (2022YFA1008100). The public computing cloud from Renmin University of China was used to perform the simulation and data analysis.
Supplementary Materials
The Supplementary Material includes examples, proofs of the theoretical results,
details on simulation settings, additional simulation studies, and additional analysis results for the application.