Addressing Label Noise in Causation Classification via Kernel Embeddings

Pingbo Hu and Grace Y. Yi

doi:10.5705/ss.202023.0202

Abstract

A basic task in causal inference is to determine whether a cause-effect

relationship exists between two sets of variables, akin to a binary classification

problem.

Given a sequence of independent and identically distributed paired

vectors, one can use the kernel mean embedding of probability distributions to

map empirical distributions into a reproducing kernel Hilbert space and then train

a classifier in that feature space to predict the causal direction for future pairs.

This strategy, however, is vulnerable to label noise (mislabeling), a common issue

in causation studies. In this paper, we analyze and quantify mislabeling effects.

We develop a valid learning method that explicitly accounts for label noise and

establish theoretical results accordingly.

Key words and phrases: causation learning, classification, kernel mean embedding, label noise

Information

Preprint No.	SS-2023-0202
Manuscript ID	SS-2023-0202
Complete Authors	Pingbo Hu, Grace Y Yi
Corresponding Authors	Grace Y. Yi
Emails	gyi5@uwo.ca

References

Armstrong, M. A. (1983). Basic Topology. New York: Springer.
Bartlett, P. L., M. I. Jordan, and J. D. McAuliffe (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association 101(473), 138–156.
Boyd, S. P. and L. Vandenberghe (2004). Convex Optimization. Cambridge University Press.
Carroll, R. J., D. Ruppert, L. A. Stefanski, and C. M. Crainiceanu (2006). Measurement Error in Nonlinear Models: A Modern Perspective. Chapman and Hall/CRC.
Conway, J. B. (2019). A Course in Functional Analysis. New York: Springer.
Gouk, H., E. Frank, B. Pfahringer, and M. J. Cree (2021). Regularisation of neural networks by enforcing Lipschitz continuity. Machine Learning 110, 393–416.
Guo, H., B. Wang, and G. Y. Yi (2023). Label correction of crowdsourced noisy annotations with an instance-dependent noise transition model. Advances in Neural Information Processing Systems 36, 347–386.
Guo, H., G. Y. Yi, and B. Wang (2024). Learning from noisy labels via conditional distributionally robust optimization. Advances in Neural Information Processing Systems 37, 82627–82672.
Guyon, I. (2013). Cause-effect pairs kaggle competition, SUP1 data. https://www.kaggle. com/c/cause-effect-pairs/data.
Jain, P. and P. Kar (2017). Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning 10(34), 142–363.
Lopez-Paz, D., K. Muandet, B. Sch¨olkopf, and I. Tolstikhin (2015). Towards a learning theory of cause-effect inference. Proceedings of the 32nd International Conference on Machine Learning 37, 1452–1461.
Mohri, M., A. Rostamizadeh, and A. Talwalkar (2018). Foundations of Machine Learning. MIT press.
Monti, R. P., K. Zhang, and A. Hyv¨arinen (2020). Causal discovery with general non-linear relationships using non-linear ICA. Proceedings of The 35th Uncertainty in Artificial Intelligence Conference 115, 186–195.
Mooij, J. M., J. Peters, D. Janzing, J. Zscheischler, and B. Sch¨olkopf (2016). Distinguishing cause from effect using observational data: methods and benchmarks. The Journal of Machine Learning Research 17(1), 1103–1204.
Muandet, K., K. Fukumizu, B. Sriperumbudur, and B. Sch¨olkopf (2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning 10(1), 1–141.
Neyman, J. (1923). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science 5(4), 465–480.
Potter, J. and I. J. Higginson (2004). Pain experienced by lung cancer patients: a review of prevalence, causes and pathophysiology. Lung Cancer 43(3), 247–257.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66(5), 688–701.
Szab´o, Z., B. K. Sriperumbudur, B. P´oczos, and A. Gretton (2016). Learning theory for distribution regression. The Journal of Machine Learning Research 17(1), 5272–5311.
Tagasovska, N., V. Chavez-Demoulin, and T. Vatter (2020). Distinguishing cause from effect using quantiles: Bivariate quantile causal discovery. Proceedings of the 37th International Conference on Machine Learning 119, 9311–9323.
Vapnik, V. N. (1998). Statistical Learning Theory. New York: Wiley.
Yi, G. Y. (2017). Statistical Analysis with Measurement Error or Misclassification: Strategy, Method and Application. New York: Springer.
Yi, G. Y., A. Delaigle, and P. Gustafson (2021). Handbook of Measurement Error Models. CRC Press.
Zhou, Z.-H. and J.-M. Xu (2007). On the relation between multi-instance learning and semisupervised learning. Proceedings of the 24th International Conference on Machine Learning, 1167–1174.

Acknowledgments

Yi is a Tier 1 Canada Research Chair in Data Science. Her research was

supported by the Canada Research Chairs Program and the Natural Sciences and Engineering Research Council of Canada (NSERC).

Supplementary Materials

The online Supplementary Material contains additional theorems, detailed

technical derivations, extended numerical studies, and supporting material

for the manuscript.

Supplementary materials are available for download.

[1] Armstrong, M. A. (1983). Basic Topology. New York: Springer.

[2] Bartlett, P. L., M. I. Jordan, and J. D. McAuliffe (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association 101(473), 138–156.

[3] Boyd, S. P. and L. Vandenberghe (2004). Convex Optimization. Cambridge University Press.

[4] Carroll, R. J., D. Ruppert, L. A. Stefanski, and C. M. Crainiceanu (2006). Measurement Error in Nonlinear Models: A Modern Perspective. Chapman and Hall/CRC.

[5] Conway, J. B. (2019). A Course in Functional Analysis. New York: Springer.

[6] Gouk, H., E. Frank, B. Pfahringer, and M. J. Cree (2021). Regularisation of neural networks by enforcing Lipschitz continuity. Machine Learning 110, 393–416.

[7] Guo, H., B. Wang, and G. Y. Yi (2023). Label correction of crowdsourced noisy annotations with an instance-dependent noise transition model. Advances in Neural Information Processing Systems 36, 347–386.

[8] Guo, H., G. Y. Yi, and B. Wang (2024). Learning from noisy labels via conditional distributionally robust optimization. Advances in Neural Information Processing Systems 37, 82627–82672.

[9] Guyon, I. (2013). Cause-effect pairs kaggle competition, SUP1 data. https://www.kaggle. com/c/cause-effect-pairs/data.

[10] Jain, P. and P. Kar (2017). Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning 10(34), 142–363.

[11] Lopez-Paz, D., K. Muandet, B. Sch¨olkopf, and I. Tolstikhin (2015). Towards a learning theory of cause-effect inference. Proceedings of the 32nd International Conference on Machine Learning 37, 1452–1461.

[12] Mohri, M., A. Rostamizadeh, and A. Talwalkar (2018). Foundations of Machine Learning. MIT press.

[13] Monti, R. P., K. Zhang, and A. Hyv¨arinen (2020). Causal discovery with general non-linear relationships using non-linear ICA. Proceedings of The 35th Uncertainty in Artificial Intelligence Conference 115, 186–195.

[14] Mooij, J. M., J. Peters, D. Janzing, J. Zscheischler, and B. Sch¨olkopf (2016). Distinguishing cause from effect using observational data: methods and benchmarks. The Journal of Machine Learning Research 17(1), 1103–1204.

[15] Muandet, K., K. Fukumizu, B. Sriperumbudur, and B. Sch¨olkopf (2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning 10(1), 1–141.

[16] Neyman, J. (1923). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science 5(4), 465–480.

[17] Potter, J. and I. J. Higginson (2004). Pain experienced by lung cancer patients: a review of prevalence, causes and pathophysiology. Lung Cancer 43(3), 247–257.

[18] Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66(5), 688–701.

[19] Szab´o, Z., B. K. Sriperumbudur, B. P´oczos, and A. Gretton (2016). Learning theory for distribution regression. The Journal of Machine Learning Research 17(1), 5272–5311.

[20] Tagasovska, N., V. Chavez-Demoulin, and T. Vatter (2020). Distinguishing cause from effect using quantiles: Bivariate quantile causal discovery. Proceedings of the 37th International Conference on Machine Learning 119, 9311–9323.

[21] Vapnik, V. N. (1998). Statistical Learning Theory. New York: Wiley.

[22] Yi, G. Y. (2017). Statistical Analysis with Measurement Error or Misclassification: Strategy, Method and Application. New York: Springer.

[23] Yi, G. Y., A. Delaigle, and P. Gustafson (2021). Handbook of Measurement Error Models. CRC Press.

[24] Zhou, Z.-H. and J.-M. Xu (2007). On the relation between multi-instance learning and semisupervised learning. Proceedings of the 24th International Conference on Machine Learning, 1167–1174.