Abstract

Heavy-tailed high-dimensional data are common in practice, including disease pre

diction, gene expression analysis, and risk management. Dependence measures for symmetric

α-stable (SαS) random vectors are commonly applied to heavy-tailed data. However, the existing

measures of dependence for symmetric α-stable (SαS) random vectors do not imply independence at zero. To address this problem, we introduce a novel measure of dependence, extended

codifference, for the SαS and non-symmetric α-stable heavy-tailed distribution family, that allows characterizing independence between heavy-tailed variables for 0 < α < 2. We propose an

efficient non-parametric estimator that does not require estimation of tail indices and obtain its

asymptotic distribution. Furthermore, we provide a guideline for selecting a suitable measure

of dependence based on the properties of each measure of association. Finally, we provide several simulation studies for further illustration and apply extended codifference to clustering of

single-cells based on their RNA-seq expression to identify cell types in adipose tissue.

Key words and phrases: Measure of dependence, stable random variables, infinite variance, codifference

Information

Preprint No.SS-2025-0370
Manuscript IDSS-2025-0370
Complete AuthorsVahed Maroufy, Mohsen Rezapour, Mahmoud Zarepour, Bahareh Afhami
Corresponding AuthorsVahed Maroufy
Emailsvahed.maroufy@uth.tmc.edu

References

  1. Abu Alfeilat, H. A., Hassanat, A. B., Lasassmeh, O., Tarawneh, A. S., Alhasanat, M. B., Eyal
  2. Salman, H. S., and Prasath, V. S., (2019). Effects of distance measure choice on k-nearest neighbor classifier performance: a review. Big data, 7(4), 221-248.
  3. Alparslan U.-U., Nolan J.-P., (2016). Measure of dependence for stable distributions. Extremes, 19, 303-323.
  4. Cambanis, S., Soltani, A.R., (1984). Prediction of stable processes: Spectral and moving average representations. Z. Wahrsch. Verw. Gebiete, 66, 593-612.
  5. Cantoni, E., and Ronchetti, E., (2006). A robust approach for skewed and heavy-tailed outcomes in the analysis of health care expenditures. Journal of Health Economics, 25(2) 198-213.
  6. Cooley, D., and Thibaud, E., (2019). Decompositions of dependence for high-dimensional extremes. Biometrika, 106(3), 587-604.
  7. Davis, R.A., Mikosch, T. and Pfaffel, O. (2016). Asymptotic theory for the sample covariance matrix of a heavy-tailed multivariate time series. Stoch. Proc. Appl. 126, 767-799.
  8. Damarackas, J., Paulauskas, V. (2017). Spectral covariance and limit theorems for random fields with infinite variance. Journal of Multivariate Analysis. 153, 156-175.
  9. de Haan, L., Peng, L., (1998). Comparison of tail index estimators. Statistica Neerlandica 52, 60-70.
  10. Fan, J., and Lv, J., (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849-911.
  11. Feller W., (1971). An introduction to probability theory and its applications. Vol. 2. 2nd ed. New York: Wiley.
  12. Fiche, A., Cexus, J. C., Martin, A., and Khenchaf, A., (2013). Features modeling with an α-stable distribution: Application to pattern recognition based on continuous belief functions. Information Fusion, 14(4), 504-520.
  13. Forsythe, G. E., and Golub, G. H. (1965). On the stationary values of a second-degree polynomial on the unit sphere. Journal of the Society for Industrial and Applied Mathematics, 13(4), 1050-1068.
  14. Garel, B., Kodia, B., (2009). Signed symmetric covariation coefficient for alpha-stable dependence modeling. Comptes Rendus de l’Acad´emie des Sciences - Series I, 347-352.
  15. HILL, B. (1975). A simple approach to inference about the tail of a distribution. Annals of Statistics 3, 1163-1174.
  16. Hashimshony, Tamar, Wagner, Florian, Sher, Noa, and Yanai, Itai. (2012). Cel-seq: single-cell RNA-seq by multiplexed linear amplification. Cell reports, 2(3), 666-673.
  17. Heiny, J. and Mikosch, T. (2019). The eigenstructure of the sample covariance matrices of high-dimensional stochastic volatility models with heavy tails. Bernoulli, 25(4B), 3590-3622.
  18. Heiny, J., Mikosch, T., and Yslas, J. (2021). Point process convergence for the off-diagonal entries of sample covariance matrices. The Annals of Applied Probability, 31(2), 538-560.
  19. Huang, W. K., Cooley, D. S., Ebert-Uphoff, I., Chen, C., and Chatterjee, S., (2019). New Exploratory Tools for Extremal Dependence: χ Networks and Annual Extremal Networks. Journal of
  20. Agricultural, Biological and Environmental Statistics, 24(3), 484-501.
  21. Jaitin, Diego Adhemar, Kenigsberg, Ephraim, KerenShaul, Hadas, Elefant, Naama, Paul,
  22. Franziska, Zaretsky, Irina, Mildner, Alexander, Cohen, Nadav, Jung, Steffen, Tanay, Amos,
  23. et al. (2014). Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science, 343(6172) 776-779.
  24. Jiang, Y., Cooley, D., and Wehner, M. F. (2020). Principal Component Analysis for Extremes and Application to US Precipitation. Journal of Climate, 33(15), 6441-6451.
  25. Kiselev, V. Y., Andrews, T. S., and Hemberg, M., (2019). Challenges in unsupervised clustering of single-cell RNA-seq data. Nature Reviews Genetics, 20(5), 273-282.
  26. Knight, K. (1989). On the bootstrap of the sample mean in the infinite variance case. The Annals of Statistics, 17(3), 1168-1175.
  27. Kodia, B., Garel, B. (2014). Estimation and Comparison of Signed Symmetric Covariation Coefficient and Generalized Association Parameter for Alpha-stable Dependence Modeling. Communications in Statistics - Theory and Methods, 43, 24, 5156-5174.
  28. Kokoszka, P.S., Taqqu, M.S., (1993). Asymptotic dependence of moving average type self-similar stable random fields. Nagoya Mathematical Journal, 130, 85-100.
  29. Kuznetsov, V. A. Knott, G. D. and Bonner, R. F. (2002). General statistics of stochastic process of gene expression in eukaryotic cells. Genetics 161, 1321-1332.
  30. Lev S Tsimring (2014). Noise in Biology. Rep. Prog. Phys. 77: 026601.
  31. Newman M.E.J. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46, 323-351.
  32. Nolan, J.-P., (2016). Stable Distributions - Models for heavy-tailed Data Springer New York.
  33. Samorodnitsky, G., and Taqqu, M. S. (1994), Stable Non-Gaussian Random Processes: Stochastic Models with Infinite Variance. Chapman & Hall/CRC, Florida.
  34. Oshlack, Alicia, Robinson, Mark D, Young, Matthew D, et al. (2010). From RNA-seq reads to differential expression result. . Genome biol, 11(12):220.
  35. Oyelade, J., Itunuoluwa I., Funke O., Olufemi A., Efosa U., Faridah A., Moses A., and Ezekiel
  36. A. (2016). Clustering algorithms: their application to gene expression data. Bioinformatics and Biology insights, 10, BBI-S38316.
  37. Pele, D. T., and Stanciulescu, V. N., (2015). On a Class of Alpha-stable Distributions and Its Applications in Estimating Market Risk. The Review of Finance and Banking, 7(2), 007-015.
  38. Prabhakaran, S., Azizi, E., Carr, A., & Pe’er, D. (2016). Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. In International Conference on Machine Learning 1070-1079.
  39. Press, S. J., (1972). Multivariate stable distributions. Journal of Multivariate Analysis, 2, 444-462.
  40. Rajbhandari, P., Arneson, D., Hart, S.K., Ahn, I.S., Diamante, G., Santos, L.C., Zaghari, N.,
  41. Feng, A.C., Thomas, B.J., Vergnes, L. and Lee, S.D., (2019). Single cell analysis reveals immune cell–adipocyte crosstalk regulating the transcription of thermogenic adipocytes. Elife, 8, p.e49501.
  42. Rahimi, , A., and Recht, B. (2007). Random Features for Large-Scale Kernel Machines. In NIPS Vol. 3 No. 4, 1177–1184
  43. Resnick, S., Greenwood, P., (1979). A bivariate stable characterization and domain of attraction. Journal of Multivariate Analysis, 9, 206-221.
  44. Rohrbeck, C., and Cooley, D., (2021). Simulating flood event sets using extremal principal components. arXiv preprint arXiv:2106.00630.
  45. Rosadi, D., (2006). Order identification for Gaussian moving averages using the codifference function. Journal of Statistical Computation and Simulation, 76, 6, 553-559.
  46. Russell, B. T., Cooley, D. S., Porter, W. C., Reich, B. J., and Heald, C. L. (2016). Data mining to investigate the meteorological drivers for extreme ground level ozone events. Annals of Applied Statistics, 10(3), 1673-1698.
  47. Satija, R., Farrell, J. A., Gennert, D., Schier, A. F., and Regev, A., (2015). Spatial reconstruction of single-cell gene expression data. Nature biotechnology, 33(5), 495-502.
  48. Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap. The Annals of Statistics, 9, 1187-1195.
  49. Sohrabi, M., Zarepour, M. (2018). Bootstrapping the mean vector for the observations in the domain of attraction of a multivariate stable law. Statistics, 52, 50-63.
  50. Spjøtvoll, E., (1972). A Note on a Theorem of Forsythe and Golub. SIAM Journal on Applied Mathematics, 23(3), 307-311.
  51. Sz´ekely, G´abor J and Rizzo, Maria L and Bakirov, Nail K (2007). Measuring and testing dependence by correlation of distances. The annals of statistics, 35, 6, 2769-2794.
  52. Xu, C., and Su, Z., (2015). Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics, 31(12), 1974-1980.
  53. Vallejos, Catalina A., John C. Marioni and Richardson, Sylvia. (2015). Basics: Bayesian analysis of single-cell sequencing data. PLoS Computational Biology, 11(6) e1004333.
  54. Von Luxburg, U., (2007). A tutorial on spectral clustering. Statistics and computing, 17(4), 395-416.
  55. Williams, C., and Seeger, M. (2001). Using the Nystr¨om method to speed up kernel machines. In Proceedings of the 14th annual conference on neural information processing systems (No. CONF, pp. 682-688).
  56. Wolf, F. A., Angerer, P., and Theis, F. J. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome biology, 19(1), 1-5.
  57. Zhu, A., Ibrahim, J. G., and Love, M. I., (2019). Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics, 35(12), 2084-2092.
  58. ˙Zak, G., Teuerle, M., Wy loma´nska, A., and Zimroz, R., (2017). Measures of dependence for α-stable distributed processes and its application to diagnostics of local damage in presence of impulsive noise. Shock and Vibration, 2017 (1), 1-9.

Acknowledgments

Dr. Maroufy and Rezapour were in part supported by Dr. Maroufy’s faculty start-up

funds from The University of Texas Health Science Center at Houston. Dr. Zarepur

was supported by NSERC grant RGPIN/ 2018-04008.

Supplementary Materials

The online Supplementary Material contains selected graphs for the simulation example,

along with proofs and detailed derivations of the theorems and remarks.


Supplementary materials are available for download.