Abstract
Learning directionality between variables is crucial yet challenging, especially for mechanistic
relationships without a priori ordering assumptions. We propose a coefficient of asymmetry to quantify
directional asymmetry using Shannon’s entropy within a generative exposure mapping (GEM) framework. GEMs arise from experiments where a generative function g maps exposure X to outcome Y
through Y = g(X), extended to noise-perturbed GEMs as Y = g(X) + ϵ. Our approach considers a
rich class of generative functions while providing statistical inference for uncertainty quantification—a
gap in existing bivariate causal discovery techniques. We establish large-sample theoretical guarantees
through data-splitting and cross-fitting techniques, implementing fast Fourier transformation-based
density estimation to avoid parameter tuning. The methodology accommodates contamination in outcome measurements. Extensive simulations demonstrate superior performance compared to competing
causal discovery methods. Applied to epigenetic data examining DNA methylation and blood pressure
relationships, our method unveils novel pathways for cardiovascular disease genes FGF5 and HSD11B2.
This framework serves as a discovery tool for improving scientific research rigor, with GEM-induced
asymmetry representing a low-dimensional imprint of underlying causality.
Information
| Preprint No. | SS-2025-0236 |
|---|---|
| Manuscript ID | SS-2025-0236 |
| Complete Authors | Soumik Purkayastha, Peter Xuekun Song |
| Corresponding Authors | Soumik Purkayastha |
| Emails | soumik@pitt.edu |
References
- Aronow, P. M. and C. Samii (2017). Estimating average causal effects under general interference, with application to a social network experiment. Ann. of App. Stat. 11, 1912–1947.
- Audibert, J.-Y. and A. B. Tsybakov (2007, April). Fast learning rates for plug-in classifiers. The Annals of Statistics 35(2).
- Bernacchia, A. and S. Pigolotti (2011). Self-consistent method for density estimation. J. Roy. Stat. Soc.: Series B 73, 407–422. Bl¨obaum, P., D. Janzing, T. Washio, S. Shimizu, and B. Sch¨olkopf (2019, January). Analysis of cause-effect inference by comparing regression errors. PeerJ Computer Science 5, e169.
- Breunig, C. and P. Burauel (2021, Jul). Testability of reverse causality without exogenous variation. Technical Report
- 2107.05936, arXiv.org.
- Chatterjee, S. (2020). A new coefficient of correlation. J. Amer. Stat. Assoc. 116, 2009–2022.
- Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21, 1–68.
- Choi, J., R. Chapkin, and Y. Ni (2020). Bayesian causal structural learning with zero-inflated poisson bayesian networks. In Proc. of the 33rd Int. Conf. on Neur. Inf. Proc. Sys., pp. 5887–5897.
- Cover, T. M. (2005). Elements of Information Theory. India: John Wiley & Sons.
- Cox, D. R. (1990). Role of models in statistical analysis. Stat. Science 5, 169–174.
- Cox, D. R. (1992). Causality: some statistical aspects. J. Roy. Stat. Soc.: Series A 155, 291–301.
- Daniuˇsis, P., D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Sch¨olkopf (2010). Inferring deterministic causal relations. In Proc. of the 26th Conf. on Uncertainty in AI, pp. 143–150.
- Dicorpo, D. A., S. Lent, W. Guan, M. Hivert, and J. S. Pankow (2018). Mendelian randomization suggests causal influence of glycemic traits on DNA methylation. Diabetes 67, 1707.
- Domouzoglou, E. M., K. K. Naka, A. P. Vlahos, M. I. Papafaklis, L. K. Michalis, A. Tsatsoulis, and E. Maratos-Flier
- (2015). Fibroblast growth factors in cardiovascular disease: The emerging role of fgf21. Am. J. Physiol. Heart. Circ. Physiol. 309, 1029–1038.
- Fonollosa, J. A. R. (2019). Conditional Distribution Variability Measures for Causality Detection, pp. 339–347. Springer International Publishing.
- Hannig, J., H. Iyer, R. C. S. Lai, and T. C. M. Lee (2016). Generalized fiducial inference: A review and new results. J. Amer. Stat. Assoc. 111, 1346–1361.
- Hernandez-Avila, M., T. Gonzalez-Cossio, E. Palazuelos, I. Romieu, A. Aro, E. Fishbein, K. E. Peterson, and H. Hu
- (1996). Dietary and environmental determinants of blood and bone lead levels in lactating postpartum women living in mexico city. Env. Health Persp. 104, 1076–1082.
- Hong, X., K. Miao, W. Cao, J. Lv, C. Yu, T. Huang, D. Sun, C. Liao, Y. Pang, Z. Pang, et al. (2023). Association between dna methylation and blood pressure: a 5-year longitudinal twin study. Hypertension 80, 169–181.
- Hoyer, P., D. Janzing, J. M. Mooij, J. Peters, and B. Sch¨olkopf (2008). Nonlinear causal discovery with additive noise
- Imbens, G. W. and J. D. Angrist (1994). Identification and estimation of local average treatment effects. Econometrica 62(2), 467–475.
- Janzing, D., J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniusis, B. Steudel, and B. Scholkopf (2012). Information-geometric approach to inferring causal directions. Artif. Int. 182, 1–31.
- Janzing, D. and B. Sch¨olkopf (2010). Causal inference using the algorithmic markov condition. IEEE Transactions on Information Theory 56(10), 5168–5194.
- Kalainathan, D., O. Goudet, and R. Dutta (2020). Causal discovery toolbox: Uncovering causal relationships in python. Journal of Machine Learning Research 21(37), 1–5.
- Manski, C. F. (2013, February). Identification of treatment response with social interactions. The Econometrics Journal 16, S1–S23.
- Mooij, J. M., J. Peters, D. Janzing, J. Zscheischler, and B. Sch¨olkopf (2016). Distinguishing cause from effect using observational data: Methods and benchmarks. J. Mach. Lear. Res. 17, 1–102.
- Moon, Y.-I., B. Rajagopalan, and U. Lall (1995, September). Estimation of mutual information using kernel density estimators. Physical Review E 52(3), 2318–2321.
- Ni, Y. (2022). Bivariate causal discovery for categorical data via classification with optimal label permutation. In Proc. of the 35th Int. Conf. on Neur. Inf. Proc. Sys., pp. 10837–10848.
- O’Brien, T. A., K. Kashinath, N. R. Cavanaugh, W. D. Collins, and J. P. O’Brien (2016). A fast and objective multidimensional kernel density estimation method: fastKDE. Comp. Stat. & Data Anal. 101, 148–160.
- Orlitsky, A. (2003). Information theory. In Encyclopedia of Physical Science and Technology, pp. 751–769. Elsevier.
- Pearl, J. (2009). Causality: Models, reasoning and inference. Cambridge University Press, England.
- Purkayastha, S. and P. X.-K. Song (2024). fastMI: A fast and consistent copula-based nonparametric estimator of mutual information. J. Mult. Anal., 105270.
- Rahman, T. J., B. M. Mayosi, D. Hall, P. J. Avery, P. M. Stewart, J. M. C. Connell, H. Watkins, and B. Keavney
- (2011). Common variation at the 11-β hydroxysteroid dehydrogenase type 1 gene is associated with left ventricular mass. Circulation: Cardiovascular Genetics 4, 156–162.
- Shannon, C. E. (1948). A mathematical theory of communication. Bell Sys. Tech. J. 27, 379–423.
- Tagasovska, N., V. Chavez-Demoulin, and T. Vatter (2020). Distinguishing cause from effect using quantiles: Bivariate quantile causal discovery. In Proc. of the 37th Int. Conf. on Mach. Lear., pp. 9311–9323.
- Zheng, S., N.-Z. Shi, and Z. Zhang (2012). Generalized measures of correlation for asymmetry, nonlinearity, and beyond. J. Amer. Stat. Assoc. 107, 1239–1252.
Acknowledgments
This work is supported by NSF DMS-2113564 and NIH R01ES033656 (for Song), and the
University of Michigan Rackham Predoctoral Fellowship (for Purkayastha). This research
was further supported in part by the University of Pittsburgh Center for Research Computing
and Data, RRID:SCR022735, through the resources provided. Specifically, this work used
the HTC cluster, which is supported by NIH award number S10OD028483.
Supplementary Materials
• Section I: Proofs.
• Section II: Behavior of ˆCX→Y in NPGEMs.
• Section III: Methylation Data Application.
• Section IV: Data-splitting and Cross-fitting.
• Section V: Resolving ambiguity in causal direction X →Y or Y →X when nature of
generative function is unknown.
• Section VI: Diagnostics.