Abstract

Under supervised heterogeneity analysis, samples within a population

form groups, and different groups have different regression models.

In most

of the existing analyses, a single level of heterogeneity structure is considered.

Partly motivated by multi-level unsupervised analysis such as hierarchical clustering, we consider multi-level supervised heterogeneity analysis. Consider for

example a two-level analysis. At the higher level, “coarse” information is used,

and samples form a smaller number of groups. At the lower level, “more subtle” information is used, and samples form a larger number of subgroups. To

achieve more lucid interpretations, we further consider the scenario where only

some variables are relevant at each level, different groups (subgroups) have the

same set of relevant variables, and the important variables at the higher level are

nested in those at the lower level. A penalized estimation and selection approach

is developed, and its theoretical and computational properties are established.

Simulation demonstrates competitive performance of the proposed approach. In

the analysis of TCGA breast cancer data, the proposed approach leads to sensible

grouping/subgrouping, identification, and estimation results. Overall, this study

expands the scope of heterogeneity analysis and delivers a practically useful tool.

Information

Preprint No.SS-2025-0147
Manuscript IDSS-2025-0147
Complete AuthorsRuiyue Wang, Sanguo Zhang, Shuangge Ma
Corresponding AuthorsShuangge Ma
Emailsshuangge.ma@yale.edu

References

  1. Braunstein, L. Z., A. G. Taghian, A. Niemierko, L. Salama, A. Capuco, J. R. Bellon, J. S.
  2. Wong, R. S. Punglia, S. M. MacDonald, and J. R. Harris (2017). Breast-cancer subtype, age, and lymph node status as predictors of local recurrence following breast-conserving therapy. Breast cancer research and treatment 161, 173–179.
  3. Fan, J. and J. Lv (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica 20(1), 101–148.
  4. Fan, J. and J. Lv (2011). Nonconcave penalized likelihood with np-dimensionality. IEEE Transactions on Information Theory 57(8), 5467–5484.
  5. Ghafouri-Fard, S., T. Khoshbakht, B. M. Hussen, M. Taheri, and N. Akbari Dilmaghani (2022). A review on the role of ptenp1 in human disorders with an especial focus on tumor suppressor role of this lncrna. Cancer Cell International 22(1), 1–12.
  6. Hao, B., W. W. Sun, Y. Liu, and G. Cheng (2018). Simultaneous clustering and estimation of heterogeneous graphical models. Journal of Machine Learning Research 18(217), 1–58.
  7. He, B., T. Zhong, J. Huang, Y. Liu, Q. Zhang, and S. Ma (2021). Histopathological imagingbased cancer heterogeneity analysis via penalized fusion with model averaging. Biometrics 77(4), 1397–1408.
  8. Huang, J., P. Breheny, and S. Ma (2012). A selective review of group selection in highdimensional models. Statistical Science 27(4), 481–499.
  9. Hui, F. K., D. I. Warton, and S. D. Foster (2015). Multi-species distribution modeling using penalized mixture of regressions. The Annals of Applied Statistics 9(2), 866–882.
  10. Khalili, A. and J. Chen (2007). Variable selection in finite mixture of regression models. Journal of the american Statistical association 102(479), 1025–1038.
  11. Khalili, A. and S. Lin (2013). Regularization in finite mixture of regression models with diverging number of parameters. Biometrics 69(2), 436–446.
  12. Li, R., Q. Zhang, and S. Ma (2023). Regulation-incorporated gene expression network-based heterogeneity analysis. arXiv preprint arXiv:2308.03946.
  13. Lu, S., E. Yakirevich, D. Yang, Y. Xiao, L. J. Wang, and Y. Wang (2021). Wnt family member 9b (wnt9b) is a new sensitive and specific marker for breast cancer. The American journal of surgical pathology 45(12), 1633–1640.
  14. Ma, S. and J. Huang (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association 112(517), 410–423.
  15. Masuda, H., D. Zhang, C. Bartholomeusz, H. Doihara, G. N. Hortobagyi, and N. T. Ueno
  16. (2012). Role of epidermal growth factor receptor in breast cancer. Breast cancer research and treatment 136, 331–345.
  17. McLachlan, G. J., S. X. Lee, and S. I. Rathnayake (2019). Finite mixture models. Annual review of statistics and its application 6, 355–378.
  18. Qin, X., X. Liu, S. Ma, and M. Wu (2024). Supervised bayesian joint graphical model for simultaneous network estimation and subgroup identification. arXiv preprint arXiv:2403.19994.
  19. Reis-Filho, J. S. and L. Pusztai (2011). Gene expression profiling in breast cancer: classification, prognostication, and prediction. The Lancet 378(9805), 1812–1823.
  20. Ren, M., Q. Zhang, S. Zhang, T. Zhong, J. Huang, and S. Ma (2022). Hierarchical cancer heterogeneity analysis based on histopathological imaging features. Biometrics 78(4), 1579–1591.
  21. Ren, M., S. Zhang, Q. Zhang, and S. Ma (2022). Gaussian graphical model-based heterogeneity analysis via penalized fusion. Biometrics 78(2), 524–535. Ruchi Sharma, V., G. Kumar Gupta, A. K Sharma, N. Batra, D. K Sharma, A. Joshi, and
  22. A. K Sharma (2017). Pi3k/akt/mtor intracellular pathway and breast cancer: factors, mechanism and regulation. Current pharmaceutical design 23(11), 1633–1638.
  23. Sharma, M., I. Castro-Piedras, G. E. Simmons Jr, and K. Pruitt (2018). Dishevelled: A masterful conductor of complex wnt signals. Cellular signalling 47, 52–64.
  24. Sun, Y., Z. Luo, and X. Fan (2022). Robust structured heterogeneity analysis approach for high-dimensional data. Statistics in Medicine 41(17), 3229–3259.
  25. Tang, X., F. Xue, and A. Qu (2021). Individualized multidirectional variable selection. Journal of the American Statistical Association 116(535), 1280–1296.
  26. Wang, W. and L. Su (2021). Identifying latent group structures in nonlinear panels. Journal of Econometrics 220(2), 272–295.
  27. Xu, X., M. Zhang, F. Xu, and S. Jiang (2020). Wnt signaling in breast cancer: biological mechanisms, challenges and opportunities. Molecular cancer 19(1), 165.
  28. Yan, J. and J. Huang (2012). Model selection for cox models with time-varying coefficients. Biometrics 68(2), 419–428.
  29. Zhang, Q., S. Zhang, J. Liu, J. Huang, and S. Ma (2016). Penalized integrative analysis under the accelerated failure time model. Statistica Sinica 26(2), 493–508. Ruiyue Wang, School of Mathematical Sciences, University of Chinese Academy of Sciences,
  30. Beijing, China; Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, U.S.A.

Acknowledgments

We thank the editors and reviewers for their careful review and insightful

comments. This work was supported by the National Natural Science Foundation of China No.12571298, Fundamental Research Funds for the Central

Universities, NIH CA204120, and NSF 220968.

Supplementary Materials

Online Supplementary Materials contains additional theoretical (referenced

in Sections 3), numerical (referenced in Section 5 and Section 6), and

methodological (referenced in Sections 7) developments, which is available

on the Statistica Sinica website.


Supplementary materials are available for download.