Abstract
Heavy-tailed high-dimensional data are common in practice, including disease pre
diction, gene expression analysis, and risk management. Dependence measures for symmetric
α-stable (SαS) random vectors are commonly applied to heavy-tailed data. However, the existing
measures of dependence for symmetric α-stable (SαS) random vectors do not imply independence at zero. To address this problem, we introduce a novel measure of dependence, extended
codifference, for the SαS and non-symmetric α-stable heavy-tailed distribution family, that allows characterizing independence between heavy-tailed variables for 0 < α < 2. We propose an
efficient non-parametric estimator that does not require estimation of tail indices and obtain its
asymptotic distribution. Furthermore, we provide a guideline for selecting a suitable measure
of dependence based on the properties of each measure of association. Finally, we provide several simulation studies for further illustration and apply extended codifference to clustering of
single-cells based on their RNA-seq expression to identify cell types in adipose tissue.
Key words and phrases: Measure of dependence, stable random variables, infinite variance, codifference
Information
| Preprint No. | SS-2025-0370 |
|---|---|
| Manuscript ID | SS-2025-0370 |
| Complete Authors | Vahed Maroufy, Mohsen Rezapour, Mahmoud Zarepour, Bahareh Afhami |
| Corresponding Authors | Vahed Maroufy |
| Emails | vahed.maroufy@uth.tmc.edu |
References
- Abu Alfeilat, H. A., Hassanat, A. B., Lasassmeh, O., Tarawneh, A. S., Alhasanat, M. B., Eyal
- Salman, H. S., and Prasath, V. S., (2019). Effects of distance measure choice on k-nearest neighbor classifier performance: a review. Big data, 7(4), 221-248.
- Alparslan U.-U., Nolan J.-P., (2016). Measure of dependence for stable distributions. Extremes, 19, 303-323.
- Cambanis, S., Soltani, A.R., (1984). Prediction of stable processes: Spectral and moving average representations. Z. Wahrsch. Verw. Gebiete, 66, 593-612.
- Cantoni, E., and Ronchetti, E., (2006). A robust approach for skewed and heavy-tailed outcomes in the analysis of health care expenditures. Journal of Health Economics, 25(2) 198-213.
- Cooley, D., and Thibaud, E., (2019). Decompositions of dependence for high-dimensional extremes. Biometrika, 106(3), 587-604.
- Davis, R.A., Mikosch, T. and Pfaffel, O. (2016). Asymptotic theory for the sample covariance matrix of a heavy-tailed multivariate time series. Stoch. Proc. Appl. 126, 767-799.
- Damarackas, J., Paulauskas, V. (2017). Spectral covariance and limit theorems for random fields with infinite variance. Journal of Multivariate Analysis. 153, 156-175.
- de Haan, L., Peng, L., (1998). Comparison of tail index estimators. Statistica Neerlandica 52, 60-70.
- Fan, J., and Lv, J., (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849-911.
- Feller W., (1971). An introduction to probability theory and its applications. Vol. 2. 2nd ed. New York: Wiley.
- Fiche, A., Cexus, J. C., Martin, A., and Khenchaf, A., (2013). Features modeling with an α-stable distribution: Application to pattern recognition based on continuous belief functions. Information Fusion, 14(4), 504-520.
- Forsythe, G. E., and Golub, G. H. (1965). On the stationary values of a second-degree polynomial on the unit sphere. Journal of the Society for Industrial and Applied Mathematics, 13(4), 1050-1068.
- Garel, B., Kodia, B., (2009). Signed symmetric covariation coefficient for alpha-stable dependence modeling. Comptes Rendus de l’Acad´emie des Sciences - Series I, 347-352.
- HILL, B. (1975). A simple approach to inference about the tail of a distribution. Annals of Statistics 3, 1163-1174.
- Hashimshony, Tamar, Wagner, Florian, Sher, Noa, and Yanai, Itai. (2012). Cel-seq: single-cell RNA-seq by multiplexed linear amplification. Cell reports, 2(3), 666-673.
- Heiny, J. and Mikosch, T. (2019). The eigenstructure of the sample covariance matrices of high-dimensional stochastic volatility models with heavy tails. Bernoulli, 25(4B), 3590-3622.
- Heiny, J., Mikosch, T., and Yslas, J. (2021). Point process convergence for the off-diagonal entries of sample covariance matrices. The Annals of Applied Probability, 31(2), 538-560.
- Huang, W. K., Cooley, D. S., Ebert-Uphoff, I., Chen, C., and Chatterjee, S., (2019). New Exploratory Tools for Extremal Dependence: χ Networks and Annual Extremal Networks. Journal of
- Agricultural, Biological and Environmental Statistics, 24(3), 484-501.
- Jaitin, Diego Adhemar, Kenigsberg, Ephraim, KerenShaul, Hadas, Elefant, Naama, Paul,
- Franziska, Zaretsky, Irina, Mildner, Alexander, Cohen, Nadav, Jung, Steffen, Tanay, Amos,
- et al. (2014). Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science, 343(6172) 776-779.
- Jiang, Y., Cooley, D., and Wehner, M. F. (2020). Principal Component Analysis for Extremes and Application to US Precipitation. Journal of Climate, 33(15), 6441-6451.
- Kiselev, V. Y., Andrews, T. S., and Hemberg, M., (2019). Challenges in unsupervised clustering of single-cell RNA-seq data. Nature Reviews Genetics, 20(5), 273-282.
- Knight, K. (1989). On the bootstrap of the sample mean in the infinite variance case. The Annals of Statistics, 17(3), 1168-1175.
- Kodia, B., Garel, B. (2014). Estimation and Comparison of Signed Symmetric Covariation Coefficient and Generalized Association Parameter for Alpha-stable Dependence Modeling. Communications in Statistics - Theory and Methods, 43, 24, 5156-5174.
- Kokoszka, P.S., Taqqu, M.S., (1993). Asymptotic dependence of moving average type self-similar stable random fields. Nagoya Mathematical Journal, 130, 85-100.
- Kuznetsov, V. A. Knott, G. D. and Bonner, R. F. (2002). General statistics of stochastic process of gene expression in eukaryotic cells. Genetics 161, 1321-1332.
- Lev S Tsimring (2014). Noise in Biology. Rep. Prog. Phys. 77: 026601.
- Newman M.E.J. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46, 323-351.
- Nolan, J.-P., (2016). Stable Distributions - Models for heavy-tailed Data Springer New York.
- Samorodnitsky, G., and Taqqu, M. S. (1994), Stable Non-Gaussian Random Processes: Stochastic Models with Infinite Variance. Chapman & Hall/CRC, Florida.
- Oshlack, Alicia, Robinson, Mark D, Young, Matthew D, et al. (2010). From RNA-seq reads to differential expression result. . Genome biol, 11(12):220.
- Oyelade, J., Itunuoluwa I., Funke O., Olufemi A., Efosa U., Faridah A., Moses A., and Ezekiel
- A. (2016). Clustering algorithms: their application to gene expression data. Bioinformatics and Biology insights, 10, BBI-S38316.
- Pele, D. T., and Stanciulescu, V. N., (2015). On a Class of Alpha-stable Distributions and Its Applications in Estimating Market Risk. The Review of Finance and Banking, 7(2), 007-015.
- Prabhakaran, S., Azizi, E., Carr, A., & Pe’er, D. (2016). Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. In International Conference on Machine Learning 1070-1079.
- Press, S. J., (1972). Multivariate stable distributions. Journal of Multivariate Analysis, 2, 444-462.
- Rajbhandari, P., Arneson, D., Hart, S.K., Ahn, I.S., Diamante, G., Santos, L.C., Zaghari, N.,
- Feng, A.C., Thomas, B.J., Vergnes, L. and Lee, S.D., (2019). Single cell analysis reveals immune cell–adipocyte crosstalk regulating the transcription of thermogenic adipocytes. Elife, 8, p.e49501.
- Rahimi, , A., and Recht, B. (2007). Random Features for Large-Scale Kernel Machines. In NIPS Vol. 3 No. 4, 1177–1184
- Resnick, S., Greenwood, P., (1979). A bivariate stable characterization and domain of attraction. Journal of Multivariate Analysis, 9, 206-221.
- Rohrbeck, C., and Cooley, D., (2021). Simulating flood event sets using extremal principal components. arXiv preprint arXiv:2106.00630.
- Rosadi, D., (2006). Order identification for Gaussian moving averages using the codifference function. Journal of Statistical Computation and Simulation, 76, 6, 553-559.
- Russell, B. T., Cooley, D. S., Porter, W. C., Reich, B. J., and Heald, C. L. (2016). Data mining to investigate the meteorological drivers for extreme ground level ozone events. Annals of Applied Statistics, 10(3), 1673-1698.
- Satija, R., Farrell, J. A., Gennert, D., Schier, A. F., and Regev, A., (2015). Spatial reconstruction of single-cell gene expression data. Nature biotechnology, 33(5), 495-502.
- Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap. The Annals of Statistics, 9, 1187-1195.
- Sohrabi, M., Zarepour, M. (2018). Bootstrapping the mean vector for the observations in the domain of attraction of a multivariate stable law. Statistics, 52, 50-63.
- Spjøtvoll, E., (1972). A Note on a Theorem of Forsythe and Golub. SIAM Journal on Applied Mathematics, 23(3), 307-311.
- Sz´ekely, G´abor J and Rizzo, Maria L and Bakirov, Nail K (2007). Measuring and testing dependence by correlation of distances. The annals of statistics, 35, 6, 2769-2794.
- Xu, C., and Su, Z., (2015). Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics, 31(12), 1974-1980.
- Vallejos, Catalina A., John C. Marioni and Richardson, Sylvia. (2015). Basics: Bayesian analysis of single-cell sequencing data. PLoS Computational Biology, 11(6) e1004333.
- Von Luxburg, U., (2007). A tutorial on spectral clustering. Statistics and computing, 17(4), 395-416.
- Williams, C., and Seeger, M. (2001). Using the Nystr¨om method to speed up kernel machines. In Proceedings of the 14th annual conference on neural information processing systems (No. CONF, pp. 682-688).
- Wolf, F. A., Angerer, P., and Theis, F. J. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome biology, 19(1), 1-5.
- Zhu, A., Ibrahim, J. G., and Love, M. I., (2019). Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics, 35(12), 2084-2092.
- ˙Zak, G., Teuerle, M., Wy loma´nska, A., and Zimroz, R., (2017). Measures of dependence for α-stable distributed processes and its application to diagnostics of local damage in presence of impulsive noise. Shock and Vibration, 2017 (1), 1-9.
Acknowledgments
Dr. Maroufy and Rezapour were in part supported by Dr. Maroufy’s faculty start-up
funds from The University of Texas Health Science Center at Houston. Dr. Zarepur
was supported by NSERC grant RGPIN/ 2018-04008.
Supplementary Materials
The online Supplementary Material contains selected graphs for the simulation example,
along with proofs and detailed derivations of the theorems and remarks.