Abstract

In this paper, we propose a new homogeneous test for two highdimensional random vectors. Our test is built on a new measure, the so-called

characteristic distance, which can completely characterize the homogeneity of

two distributions. The newly proposed metric has some desirable properties, for

example, it possesses a clear and intuitive probabilistic interpretation, and can

be used to address the high-dimensional distance inference. Theoretically, the

limiting behaviors under the conventional fixed dimension and high-dimensional

distance inference are thoroughly investigated. Simulation studies and real data

analysis are presented to illustrate the finite-sample performance of the proposed

test statistic.

Key words and phrases: Characteristic distance, High Dimensionality, Test of Homogeneity, U-statistic, Permutation procedure

Information

Preprint No.SS-2023-0299
Manuscript IDSS-2023-0299
Complete AuthorsXu Li, Gongming Shi, Baoxue Zhang
Corresponding AuthorsBaoxue Zhang
Emailszhangbaoxue@cueb.edu.cn

References

  1. Baringhaus, L. and Franz, C. (2004). On a new multivariate two-sample test. Journal of Multivariate Analysis 88, 190-206.
  2. Barry, W., Nobel, A. and Wright, F. (2005). Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics 21, 1943-1949.
  3. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of Royal Statistical Society: Series B (Statistical Methodology) 57, 289-300.
  4. Bickel, P. J. (1969). A distribution free version of the Smirnov two sample test in the p-variate case. Annals of Mathematical Statistics 40, 1-23.
  5. Biswas, M. and Ghosh, A. K. (2014). A nonparametric two-sample test applicable to high dimensional data. Journal of Multivariate Analysis 123, 160-171.
  6. Chakraborty, S. and Zhang, X. (2021). A new framework for distance and kernel-based metrics in high dimensions. Electronic Journal of Statistics 15, 5455-5522.
  7. Chen, S. X., Zhang, L. X. and Zhong, P. S. (2010). Tests for high-dimensional covariance matrices. Journal of the American Statistical Association 105, 810-819.
  8. Darling, D. A. (1957). The Kolmogorov-Smirnov, Cramer-von Mises Tests. Annals of Mathematical Statistics 29, 842-851.
  9. Fern´andez, V. A., Gamero, M. D. J. and Garc´ıa, J. M. (2008). A test for the two-sample problem based on empirical characteristic functions. Computational Statistics and Data Analysis 52, 3730-3748.
  10. Friedman, J. H. and Rafsky, L. C. (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Annals of Statistics 7, 697-717.
  11. Gao, H. J. and Shao, X. F. (2023). Two sample testing in high dimension via maximum mean discrepancy. Journal of Machine Learning Research 24, 14406-14438.
  12. Gentleman, R., Irizarry, R. A., Carey, V. J., Dudoit, S. and Huber, W. (2005). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, New York. MR2201836.
  13. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B. and Smola, A. J. (2012). A kernel two-sample test. Journal of Machine Learning Research 13, 723-773.
  14. Harchaoui, Z., Bach, F., Cappe, O. and Moulines, E. (2013). Kernel-based methods for hypothesis testing: A unified view. IEEE Signal Processing Magazine 30, 87-97.
  15. Kim, I., Balakrishnan, S. and Wasserman, L. (2020). Robust multivariate nonparametric tests via projection averaging. Annals of Statistics 48, 3417-3441.
  16. Koroljuk, V. S. and Borovskich, Y. V. (1994). Theory of U- Statistics. Kluwer Academic Publisher, Amsterdam.
  17. Lee, D. H., Lahiri, S. N. and Sinha, S. (2020). A test of homogeneity of distributions when observations are subject to measurement errors. Biometrics 76, 821-833.
  18. Li, Jun. (2018). Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem. Biometrika 105, 529-546.
  19. Li, X. C. (2009). ALL: A data package. R package version 1.22.0.
  20. Liu, Y. M., Liu, Z. and Zhou, Wang. (2019). A test for equality of two distributions via integrating characteristic functions. Statistica Sinica 29, 1779-1801.
  21. Liu, J. M., Ma, S. G., Xu, W. L. and Zhu, L. P. (2022). A generalized Wilcoxon–Mann–Whitney type test for multivariate data through pairwise distance. Journal of Multivariate Analysis 190, 104946.
  22. Liu, Z., Xia, X. and Zhou, W. (2015). A test for equality of two distributions via jackknife empirical likelihood and characteristic functions. Computational Statistics and Data Analysis 92, 97-114.
  23. Mukhopadhyay, S. and Wang, K. J. (2020). A nonparametric approach to high-dimensional k-sample comparison problems. Biometrika 107, 555-572.
  24. Nettleton, D., Recknor, J. and Reecy, J. M. (2008). Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis. Bioinformatics 24, 192-201.
  25. Pan, W., Tian, Y., Wang, X. and Zhang, H. (2018). Ball Divergence: Nonparametric two sample test. Annals of Statistics 46, 1109-1137.
  26. Qiu, T., Xu, W. L. and Zhu, L. P. (2021). A robust and nonparametric two-sample test in high dimensions. Statistica Sinica 31, 1853-1869.
  27. Ramdas, A., Reddi, S. J., Poczos, B., Singh, A. and Wasserman, L. (2015). On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 3571-3577.
  28. Sarkar, S. and Ghosh, A. K. (2018). On some high-dimensional two-sample tests based on averages of inter-point distances. Stat 7, 1-16.
  29. Sarkar, S., Biswas, R. and Ghosh, A. K. (2020). On some graph-based two-sample tests for high dimension, low sample size data. Machine Learning 109, 279-306.
  30. Sejdinovic, D., Sriperumbudur, B., Gretton, A. and Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Annals of Statistics 41, 2263-2291.
  31. Smirnoff, N. (1939). On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bulletin de lUniversite de Moscow, Serie internationale (Mathematiques) 2, 3-14.
  32. Sz´ekely, G. J. and Rizzo, M. L. (2004). Testing for equal distributions in high dimension. InterStat 5, 1249–1272.
  33. Wald, A. and Wolfowitz, J. (1940). On a test whether two samples are from the same population. Annals of Mathematical Statistics 11, 147-162.
  34. Yan, J. and Zhang, X. Y. (2023). Kernel two-sample tests in high dimensions: interplay between moment discrepancy and dimension-and-sample orders. Biometrika 110, 411-430.
  35. Zhao, J. and Meng, D.Y. (2015). FastMMD: Ensemble of circular discrepancy for efficient twosample test. Neural Computation 27, 1345-1372.
  36. Zhou, W. X., Zheng, C. and Zhang, Z. (2017). Two-sample smooth tests for the equality of distributions. Bernoulli 23, 951-989.
  37. Zhu, C. and Shao, X. (2021). Interpoint distance based two sample tests in high dimension. Bernoulli 27, 1189-1211.

Acknowledgments

Zhang’s work is supported by the National Natural Science Foundation of

China (12271370). Li’s work is supported by the National Natural Science

Foundation of China (12401356), and the Natural Science Foundation of

Shanxi Province, China (202203021222223, 20210302124262, 20210302124531).

Supplementary Materials

Additional supporting materials can be found in the Supplementary Materials, including proof of the theoretical results presented in Sections 2-4, as

well as some additional simulation results.


Supplementary materials are available for download.