A Robust Framework for Graph-Based Two-Sample Tests Using Weights

Yichuan Bai and Lynna Chu

doi:10.5705/ss.202023.0408

Abstract

Graph-based tests are a class of non-parametric two-sample tests useful for analyzing high

dimensional data. The test statistics are constructed from similarity graphs (such as K-minimum

spanning tree), and consequently, their performance is sensitive to the structure of the graph. When

the graph has problematic structures (for example, hubs), as is common for high-dimensional data,

this can result in low power and unstable performance among existing graph-based tests.

address this challenge by proposing new test statistics that are robust to problematic structures of

the graph and can provide reliable inferences. We employ an edge-weighting strategy using intrinsic

characteristics of the graph that are computationally simple and efficient to obtain. The limiting

null distribution of the robust test statistics is derived and shown to work well for finite sample

sizes. Simulation studies and data analysis of Chicago taxi-trip travel patterns demonstrate the

new tests’ improved performance across a range of settings.

Key words and phrases: curse of dimensionality, graph-based tests, high-dimensional data, non- parametric tests, robustness, similarity graphs

Information

Preprint No.	SS-2023-0408
Manuscript ID	SS-2023-0408
Complete Authors	Yichuan Bai, Lynna Chu
Corresponding Authors	Yichuan Bai
Emails	ycbai@iastate.edu

References

Banerjee, T., B. B. Bhattacharya, and G. M. and (2025). Bootstrapped edge count tests for nonparametric twosample inference under heterogeneity. Journal of Computational and Graphical Statistics 34(1), 306–317.
Banerjee, T., B. B. Bhattacharya, and G. Mukherjee (2020). A nearest-neighbor based nonparametric test for viral remodeling in heterogeneous single-cell proteomic data. The Annals of Applied Statistics 14(4), 1777 – 1805.
Baringhaus, L. and C. Franz (2004). On a new multivariate two-sample test. Journal of multivariate analysis 88(1), 190–206.
Biswas, M. and A. K. Ghosh (2014). A nonparametric two-sample test applicable to high dimensional data. Journal of Multivariate Analysis 123, 160–171.
Biswas, M., M. Mukhopadhyay, and A. K. Ghosh (2014). A distribution-free two-sample run test applicable to high-dimensional data. Biometrika 101(4), 913–926.
Chen, H., X. Chen, and Y. Su (2018). A weighted edge-count two-sample test for multivariate and object data. Journal of the American Statistical Association 113(523), 1146–1155.
Chen, H. and J. H. Friedman (2017). A new graph-based two-sample test for multivariate and object data. Journal of the American Statistical Association 112(517), 397–409.
Chen, L. H. and Q.-M. Shao (2005). Stein’s method for normal approximation. An introduction to Stein’s method 4, 1–59.
Chu, L. and H. Chen (2019). Asymptotic distribution-free change-point detection for multivariate and noneuclidean data. The Annals of Statistics 47(1), 382–414.
Friedman, J. H. and L. C. Rafsky (1979). Multivariate generalizations of the wald-wolfowitz and smirnov twosample tests. The Annals of Statistics 7(4), 697–717.
Gretton, A., K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola (2012). A kernel two-sample test. The Journal of Machine Learning Research 13(1), 723–773.
Hall, P. and N. Tajvidi (2002). Permutation tests for equality of distributions in high-dimensional settings. Biometrika 89(2), 359–374.
Henze, N. (1988). A multivariate two-sample test based on the number of nearest neighbor type coincidences. The Annals of Statistics 16(2), 772–783.
Li, J. (2018). Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem. Biometrika 105(3), 529–546.
Liu, R. Y. and K. Singh (1993). A quality index based on data depth and multivariate rank tests. Journal of the American Statistical Association 88(421), 252–260.
Radovanovic, M., A. Nanopoulos, and M. Ivanovic (2010). Hubs in space: Popular nearest neighbors in highdimensional data. Journal of Machine Learning Research 11(sept), 2487–2531.
Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(4), 515–530.
Schilling, M. F. (1986). Multivariate two-sample tests based on nearest neighbors. Journal of the American Statistical Association 81(395), 799–806.
Song, H. and H. Chen (2023, 11). Generalized kernel two-sample tests. Biometrika 111(3), 755–770.
Sz´ekely, G. J., M. L. Rizzo, et al. (2004). Testing for equal distributions in high dimension. InterStat 5(16.10), 1249–1272.
Zhou, D. and H. Chen (2023, 12–15 Jul). A new ranking scheme for modern data and its application to two-sample hypothesis testing. In G. Neu and L. Rosasco (Eds.), Proceedings of Thirty Sixth Conference on Learning
Theory, Volume 195 of Proceedings of Machine Learning Research, pp. 3615–3668. PMLR.
Zhu, C. and X. Shao (2021). Interpoint distance based two sample tests in high dimension. Bernoulli 27(2), 1189 – 1211.
Zhu, Y. and H. Chen (2024a). Limiting distributions of graph-based test statistics on sparse and dense graphs. Bernoulli 30(1), 770 – 796.
Zhu, Y. and H. Chen (2024b). A new robust graph for graph-based methods. arXiv:2307.15205.

Acknowledgments

The authors thank the anonymous referees, an Associate Editor and the Editor for their

constructive comments that improved the quality of this article.

Supplementary Materials

The supplement contains additional simulations, figures, and technical proofs for Lemma

1, Remark 1, Theorem 1, Theorem 2 and Theorem 3.

Supplementary materials are available for download.

[1] Banerjee, T., B. B. Bhattacharya, and G. M. and (2025). Bootstrapped edge count tests for nonparametric twosample inference under heterogeneity. Journal of Computational and Graphical Statistics 34(1), 306–317.

[2] Banerjee, T., B. B. Bhattacharya, and G. Mukherjee (2020). A nearest-neighbor based nonparametric test for viral remodeling in heterogeneous single-cell proteomic data. The Annals of Applied Statistics 14(4), 1777 – 1805.

[3] Baringhaus, L. and C. Franz (2004). On a new multivariate two-sample test. Journal of multivariate analysis 88(1), 190–206.

[4] Biswas, M. and A. K. Ghosh (2014). A nonparametric two-sample test applicable to high dimensional data. Journal of Multivariate Analysis 123, 160–171.

[5] Biswas, M., M. Mukhopadhyay, and A. K. Ghosh (2014). A distribution-free two-sample run test applicable to high-dimensional data. Biometrika 101(4), 913–926.

[6] Chen, H., X. Chen, and Y. Su (2018). A weighted edge-count two-sample test for multivariate and object data. Journal of the American Statistical Association 113(523), 1146–1155.

[7] Chen, H. and J. H. Friedman (2017). A new graph-based two-sample test for multivariate and object data. Journal of the American Statistical Association 112(517), 397–409.

[8] Chen, L. H. and Q.-M. Shao (2005). Stein’s method for normal approximation. An introduction to Stein’s method 4, 1–59.

[9] Chu, L. and H. Chen (2019). Asymptotic distribution-free change-point detection for multivariate and noneuclidean data. The Annals of Statistics 47(1), 382–414.

[10] Friedman, J. H. and L. C. Rafsky (1979). Multivariate generalizations of the wald-wolfowitz and smirnov twosample tests. The Annals of Statistics 7(4), 697–717.

[11] Gretton, A., K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola (2012). A kernel two-sample test. The Journal of Machine Learning Research 13(1), 723–773.

[12] Hall, P. and N. Tajvidi (2002). Permutation tests for equality of distributions in high-dimensional settings. Biometrika 89(2), 359–374.

[13] Henze, N. (1988). A multivariate two-sample test based on the number of nearest neighbor type coincidences. The Annals of Statistics 16(2), 772–783.

[14] Li, J. (2018). Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem. Biometrika 105(3), 529–546.

[15] Liu, R. Y. and K. Singh (1993). A quality index based on data depth and multivariate rank tests. Journal of the American Statistical Association 88(421), 252–260.

[16] Radovanovic, M., A. Nanopoulos, and M. Ivanovic (2010). Hubs in space: Popular nearest neighbors in highdimensional data. Journal of Machine Learning Research 11(sept), 2487–2531.

[17] Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(4), 515–530.

[18] Schilling, M. F. (1986). Multivariate two-sample tests based on nearest neighbors. Journal of the American Statistical Association 81(395), 799–806.

[19] Song, H. and H. Chen (2023, 11). Generalized kernel two-sample tests. Biometrika 111(3), 755–770.

[20] Sz´ekely, G. J., M. L. Rizzo, et al. (2004). Testing for equal distributions in high dimension. InterStat 5(16.10), 1249–1272.

[21] Zhou, D. and H. Chen (2023, 12–15 Jul). A new ranking scheme for modern data and its application to two-sample hypothesis testing. In G. Neu and L. Rosasco (Eds.), Proceedings of Thirty Sixth Conference on Learning

[22] Theory, Volume 195 of Proceedings of Machine Learning Research, pp. 3615–3668. PMLR.

[23] Zhu, C. and X. Shao (2021). Interpoint distance based two sample tests in high dimension. Bernoulli 27(2), 1189 – 1211.

[24] Zhu, Y. and H. Chen (2024a). Limiting distributions of graph-based test statistics on sparse and dense graphs. Bernoulli 30(1), 770 – 796.

[25] Zhu, Y. and H. Chen (2024b). A new robust graph for graph-based methods. arXiv:2307.15205.