Grouped Heterogeneous Gaussian Graphical Models for High-Dimensional Clustered Data

Xin Zeng, Shuangge Ma and Qingzhao Zhang

doi:10.5705/ss.202024.0258

Abstract

Clustered data-based analysis has been extensively conducted in vari

ous studies. Recent research has demonstrated that a network-based heterogeneity analysis, which adopts a system perspective and incorporates the intercon-

nections among variables while considering heterogeneity between components,

can provide more informative results compared to approaches based on simpler

statistics.

Moreover, incorporating grouping strategies in analysis can better

delineate the sources of heterogeneity and enable more flexible modeling for clustered data. In this article, we introduce a novel approach called the grouped

heterogeneous Gaussian graphical models (Grouped-HGGM) for network analysis of high-dimensional clustered data. Our approach assumes that clusters can

be divided into distinct groups, and any heterogeneity across clusters is captured

through the cluster-wise mixture probabilities. Unlike most previous approaches

that assume that the number of components is known in advance, an appealing

feature of our method is the automatic determination of the number of components and sparse estimation using a fusion technique. Consistency properties

are rigorously established, and an effective computational algorithm is developed. Extensive simulations demonstrate the practical superiority of the proposed

approach over closely related alternatives. In the analysis of breast cancer data, the proposed approach identifies heterogeneity structures different from the

alternatives.

Key words and phrases: Clustered data, Gaussian graphical models, Grouping strategies, Heterogeneity analysis

Information

Preprint No.	SS-2024-0258
Manuscript ID	SS-2024-0258
Complete Authors	Xin Zeng, Shuangge Ma, Qingzhao Zhang
Corresponding Authors	Qingzhao Zhang
Emails	zhangqingzhao@amss.ac.cn

References

Cai, T. T., Liu, W. and Zhou, H. H. (2016). Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation, Ann. Statist. 44, 455–488.
Chen, X., Feng, Z. and Peng, H. (2023). Estimation and order selection for multivariate exponential power mixture models. J. Multivariate Anal. 195, 105140.
Danaher, P., Wang, P., and Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. J. Roy. Statist. Soc. B 76, 373–397.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360.
Fokkema, M., Smits, N., Zeileis, A., Hothorn, T. and Kelderman, H. (2018). Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees. Behav. Res. Methods 50, 2016–2034.
Galbraith, S., Daniel, J. A. and Vissel, B. (2010). A study of clustered data and approaches to its analysis. J. Neurosci. 30, 10601–10608.
Gao, C., Zhu, Y., Shen, X. and Pan, W. (2016). Estimation of multiple networks in gaussian mixture models. Electron. J. Stat. 10, 1133–1154.
G¨obler, K., Drton, M., Mukherjee, S., & Miloschewski, A. (2024). High-dimensional undirected graphical models for arbitrary mixed data. Electron. J. Stat. 18, 2339–2404.
Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2011). Joint estimation of multiple graphical models. Biometrika 98, 1–15.
Hao, B., Sun, W. W., Liu, Y. and Cheng, G. (2018). Simultaneous clustering and estimation of heterogeneous graphical models. J. Mach. Learn. Res. 18, 7981–8038. KEGG. (Kyoto Encyclopedia of Genes and Genomes). https://www.genome.jp/pathway/hsa05224. Accessed on 7/16/2023.
Li, Y., Xu, S., Ma, S. and Wu, M. (2022). Network-based cancer heterogeneity analysis incorporating multi-view of prior information. Bioinformatics 38, 2855–2862.
McLachlan, G. J. and Peel, D. (2000). Finite mixture models, New York: Wiley.
Pei, Y., Peng, H. and Xu, J. (2022). A latent class Cox model for heterogeneous time-to-event data. J. Econometrics 239, 105351.
Pereda-Fernandez, S. (2021). Copula-based random effects models for clustered data. J. Bus. Econom. Statist. 39, 575–588.
Ren, M., Zhang, S., Zhang, Q. and Ma, S. (2022). Gaussian graphical modelbased heterogeneity analysis via penalized fusion. Biometrics 78, 524–535.
Rodriguez, A., Dunson, D. B. and Gelfand, A. E. (2008). The nested dirichlet process. J. Amer. Statist. Assoc. 103, 1131–1154.
Sugasawa, S. (2021). Grouped heterogeneous mixture modeling for clustered data. J. Amer. Statist. Assoc. 116, 999–1010.
Sung, H., Ferlay, J., Siegel, R. L., Laversanne, M., Soerjomataram, I., Jemal, A., et al. (2021). Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209–249. TCGA. (The Cancer Genome Atlas). https://portal.gdc.cancer.gov/projects/TCGA-BRCA. Accessed on 7/16/2023.
Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. 101, 1566–1581.
Wang, B., Zhang, Y., Sun, W. W. and Fang, Y. (2018). Sparse convex clustering. J. Comput. Graph. Statist. 27, 393–403.
Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38, 894–942. Xin Zeng, Department of Statistics and Data Science, School of Economics, Xiamen University,
Xiamen, China

Acknowledgments

We thank the Editor, Associate Editor, and two reviewers for their careful review and insightful comments. This study is supported by the Hu-

manities and Social Science Foundation of Ministry of Education of China

24YJA910007, NIH CA204120, and NSF 2209685.

Supplementary Materials

Contain the additional computational, theoretical and numerical results in

the online supplementary materials.

Supplementary materials are available for download.

[1] Cai, T. T., Liu, W. and Zhou, H. H. (2016). Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation, Ann. Statist. 44, 455–488.

[2] Chen, X., Feng, Z. and Peng, H. (2023). Estimation and order selection for multivariate exponential power mixture models. J. Multivariate Anal. 195, 105140.

[3] Danaher, P., Wang, P., and Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. J. Roy. Statist. Soc. B 76, 373–397.

[4] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360.

[5] Fokkema, M., Smits, N., Zeileis, A., Hothorn, T. and Kelderman, H. (2018). Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees. Behav. Res. Methods 50, 2016–2034.

[6] Galbraith, S., Daniel, J. A. and Vissel, B. (2010). A study of clustered data and approaches to its analysis. J. Neurosci. 30, 10601–10608.

[7] Gao, C., Zhu, Y., Shen, X. and Pan, W. (2016). Estimation of multiple networks in gaussian mixture models. Electron. J. Stat. 10, 1133–1154.

[8] G¨obler, K., Drton, M., Mukherjee, S., & Miloschewski, A. (2024). High-dimensional undirected graphical models for arbitrary mixed data. Electron. J. Stat. 18, 2339–2404.

[9] Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2011). Joint estimation of multiple graphical models. Biometrika 98, 1–15.

[10] Hao, B., Sun, W. W., Liu, Y. and Cheng, G. (2018). Simultaneous clustering and estimation of heterogeneous graphical models. J. Mach. Learn. Res. 18, 7981–8038. KEGG. (Kyoto Encyclopedia of Genes and Genomes). https://www.genome.jp/pathway/hsa05224. Accessed on 7/16/2023.

[11] Li, Y., Xu, S., Ma, S. and Wu, M. (2022). Network-based cancer heterogeneity analysis incorporating multi-view of prior information. Bioinformatics 38, 2855–2862.

[12] McLachlan, G. J. and Peel, D. (2000). Finite mixture models, New York: Wiley.

[13] Pei, Y., Peng, H. and Xu, J. (2022). A latent class Cox model for heterogeneous time-to-event data. J. Econometrics 239, 105351.

[14] Pereda-Fernandez, S. (2021). Copula-based random effects models for clustered data. J. Bus. Econom. Statist. 39, 575–588.

[15] Ren, M., Zhang, S., Zhang, Q. and Ma, S. (2022). Gaussian graphical modelbased heterogeneity analysis via penalized fusion. Biometrics 78, 524–535.

[16] Rodriguez, A., Dunson, D. B. and Gelfand, A. E. (2008). The nested dirichlet process. J. Amer. Statist. Assoc. 103, 1131–1154.

[17] Sugasawa, S. (2021). Grouped heterogeneous mixture modeling for clustered data. J. Amer. Statist. Assoc. 116, 999–1010.

[18] Sung, H., Ferlay, J., Siegel, R. L., Laversanne, M., Soerjomataram, I., Jemal, A., et al. (2021). Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209–249. TCGA. (The Cancer Genome Atlas). https://portal.gdc.cancer.gov/projects/TCGA-BRCA. Accessed on 7/16/2023.

[19] Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. 101, 1566–1581.

[20] Wang, B., Zhang, Y., Sun, W. W. and Fang, Y. (2018). Sparse convex clustering. J. Comput. Graph. Statist. 27, 393–403.

[21] Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38, 894–942. Xin Zeng, Department of Statistics and Data Science, School of Economics, Xiamen University,

[22] Xiamen, China