Reproducible Learning in Large-Scale Multiple Graphical Models

Jia Zhou, Guangming Pan, Zeming Zheng and Changchun Tan

doi:10.5705/ss.202023.0099

Abstract

Reproducible learning of the underlying structure among large-scale

network data is important in many contemporary applications. Despite the fastgrowing literature on this subject, the practical issue of data heterogeneity has

rarely been addressed. In this paper, we propose a new method called the multiple graphical knockofffilter to efficiently recover the underlying sparse connected

structure of a general population from a high-dimensional heterogeneous dataset.

We provide theoretical justification on the asymptotic false discovery rate control, and the theory for the power analysis is also established. To the best of

our knowledge, this is the first formal theoretical result on the power for the

graphical knockoffs procedure. Our new methodology and results are evidenced

by numerical studies.

Key words and phrases: False discovery rate, Heterogeneity, Multiple graphical models, High-dimensionality, Power

Information

Preprint No.	SS-2023-0099
Manuscript ID	SS-2023-0099
Complete Authors	Jia Zhou, Guangming Pan, Zeming Zheng, Changchun Tan
Corresponding Authors	Jia Zhou
Emails	tszhjia@mail.ustc.edu.cn

References

Akter, J., Y. Katai, P. Sultana, H. Takenobu, M. Haruta, R. P. Sugino, K. Mukae, S. Satoh,
T. Wada, M. Ohria, K. Ando, and T. Kamijo (2021). Loss of p53 suppresses replication stress-induced dna damage in atrx-deficient neuroblastoma. Oncogenesis 10(73), 1–12.
Barber, R. F. and E. J. Cand`es (2015). Controlling the false discovery rate via knockoffs. Ann. Statist. 43(5), 2055–2085.
Barber, R. F., E. J. Cand`es, and R. J. Samworth (2020). Robust inference with knockoffs. Ann. Statist. 48(3), 1409–1431.
Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57(1), 289–300.
Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29(4), 1165–1188.
Cai, T., W. Liu, and X. Luo (2011). A constrained ℓ1 minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 115(532), 1861–1872.
Cai, W., L. Su, and H. Yang (2020). Pbrm1 suppresses tumor growth as a novel p53 acetylation reader. Mol. Cell. Oncol. 7(3), e1729680.
Cand`es, E., Y. Fan, L. Janson, and J. Lv (2018). Panning for gold: model-x knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. Ser. B 80(3), 551–577.
Cheng, J., T. Li, E. Levina, and J. Zhu (2017). High-dimensional mixed graphical models. J. Comput. Graph. Statist. 26(2), 367–378.
Dai, C., B. Lin, X. Xing, and J. S. Liu (2023). False discovery rate control via data splitting. J. Amer. Statist. Assoc. 118(544), 2503–2520.
Drton, M. and M. D. Perlman (2007). Multiple testing and error control in gaussian graphical model selection. Statist. Sci. 22(3), 430–449.
Fan, Y., E. Demirkaya, G. Li, and J. Lv (2020). RANK: large-scale inference with graphical nonlinear knockoffs. J. Amer. Statist. Assoc. 115(529), 362–379.
Fan, Y., Y. Kong, D. Li, and Z. Zheng (2015). Innovated interaction screening for highdimensional nonlinear classification. Ann. Statist. 43(3), 1243–1272.
Fan, Y. and J. Lv (2016). Innovated scalable efficient estimation in ultra-large Gaussian graphical models. Ann. Statist. 44(5), 2098–2126.
Fan, Y., J. Lv, M. Sharifvaghefib, and Y. Uematsua (2020). Ipad: Stable interpretable forecasting with knockoffs inference. J. Amer. Statist. Assoc. 115(532), 1822–1834.
Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441.
Giudici, P. and S. Alessanfro (2016). Graphical network models for international financial flows. J. Bus. Econ. Stat. 34(1), 128–138.
Goyal, H., I. Chachoua, P. Christian, W. Vainchenker, and S. N. Constantinescu (2020). A p53jak-stat connection involved inmyeloproliferative neoplasm pathogenesis and progression to secondary acute myeloid leukemia. Blood Rev. 42, 100712.
Guo, J., E. Levina, G. Michailidis, and J. Zhu (2011). Joint estimation of multiple graphical models. Biometrika 98(1), 1–15.
Johnson, K. S., E. F. Conant, and M. S. Soo (2021). Molecular subtypes of breast cancer: A review for breast radiologists. Journal of Breast Imaging 3(1), 12–24.
Lauritzen, S. L. (1996). Graphical Models. Oxford University Press, New York.
Lee, W. and Y. Liu (2015). Joint estimation of multiple precision matrices with common structures. J. Mach. Learn. Res. 16(1), 1035–1062.
Li, J. and M. H. Maathuis (2021). GGM knockofffilter: False discovery rate control for Gaussian graphical models. J. R. Stat. Soc. Ser. B 83(3), 534–558.
Li, W., S. Lin, W. Wang, X. Li, and D. Xu (2015). Kdm3a interacted with p53k372me1 and regulated p53 binding to puma in gastric cancer. Biochem. Bophys. Res. Commun. 467(3), 556–561.
Liu, W. (2013). Gaussian graphical model estimation with false discovery rate control. Ann. Statist. 41(6), 2948–2978.
Liu, W., Y. Ke, J. Liu, and R. Li (2022). Model-free feature screening and FDR control with knockofffeatures. J. Amer. Statist. Assoc. 117(537), 428–443.
Ma, J. and G. Michailidis (2016). Joint structural estimation of multiple graphical models. J. Mach. Learn. Res. 17(166), 1–48.
Ogawara, Y., S. Kishishita, T. Obata, Y. Isazawa, T. Suzuki, K. Tanaka, N. Masuyama, and
Y. Gotoh (2002). Akt enhances mdm2-mediated ubiquitination and degradation of p53. J. Biol. Chem. 277(24), 21843–21850.
Ren, Z., Y. Kang, Y. Fan, and J. Lv (2019). Tuning-free heterogeneous inference in massive network. J. Amer. Statist. Assoc. 114(528), 1908–1925.
Ren, Z., T. Sun, C.-H. Zhang, and H. H. Zhou (2015). Asymptotic normality and optimalities in estimation of large Gaussian graphical models. Ann. Statist. 43(3), 991–1026.
Shin, S.-Y., E. B. Fauman, A.-K. Petersen, J. Krumsiek, and et al. (2014). An atlas of genetic influences on human blood metabolites. Nat. Genet. 46, 543–550.
Wu, Y., J. C. Lin, L. G. Piluso, J. M. Dhahbi, S. Bobadilla, S. R. Spindler, and X. Liu (2014). Phosphorylation of p53 by taf1 inactivates p53-dependent transcription in the dna damage response. Mol. cell 53(1), 63–74.
Zhang, R., Z. Ren, and W. Chen (2018). SILGGM: An extensive R package for efficient statistical inference in large-scale gene networks. PLoS Comput. Biol. 14(8), e1006369.
Zhou, J., Y. Li, Z. Zheng, and D. Li (2022). Reproducible learning in large-scale graphical models. J. Multivariate Anal. 189, 104934.

Acknowledgments

We thank the editor, associate editor, and referees for their insightful comments. Zheng’s research is supported by the National Key Research and De-

velopment Program of China (Grant No. 2022YFA1008000). Pan’s research

is supported by the Ministry of Education, Singapore (Grant No. MOE-

Zhou’s research is supported by the Natural Science

Foundation of Hefei University of Technology (Grant No. JZ2023HGQA0085).

Supplementary Materials

available online include four auxiliary lemmas,

the proofs for all lemmas and Theorems 1-2, and two figures of real data

analysis mentioned in Section 5.

Supplementary materials are available for download.

[1] Akter, J., Y. Katai, P. Sultana, H. Takenobu, M. Haruta, R. P. Sugino, K. Mukae, S. Satoh,

[2] T. Wada, M. Ohria, K. Ando, and T. Kamijo (2021). Loss of p53 suppresses replication stress-induced dna damage in atrx-deficient neuroblastoma. Oncogenesis 10(73), 1–12.

[3] Barber, R. F. and E. J. Cand`es (2015). Controlling the false discovery rate via knockoffs. Ann. Statist. 43(5), 2055–2085.

[4] Barber, R. F., E. J. Cand`es, and R. J. Samworth (2020). Robust inference with knockoffs. Ann. Statist. 48(3), 1409–1431.

[5] Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57(1), 289–300.

[6] Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29(4), 1165–1188.

[7] Cai, T., W. Liu, and X. Luo (2011). A constrained ℓ1 minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 115(532), 1861–1872.

[8] Cai, W., L. Su, and H. Yang (2020). Pbrm1 suppresses tumor growth as a novel p53 acetylation reader. Mol. Cell. Oncol. 7(3), e1729680.

[9] Cand`es, E., Y. Fan, L. Janson, and J. Lv (2018). Panning for gold: model-x knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. Ser. B 80(3), 551–577.

[10] Cheng, J., T. Li, E. Levina, and J. Zhu (2017). High-dimensional mixed graphical models. J. Comput. Graph. Statist. 26(2), 367–378.

[11] Dai, C., B. Lin, X. Xing, and J. S. Liu (2023). False discovery rate control via data splitting. J. Amer. Statist. Assoc. 118(544), 2503–2520.

[12] Drton, M. and M. D. Perlman (2007). Multiple testing and error control in gaussian graphical model selection. Statist. Sci. 22(3), 430–449.

[13] Fan, Y., E. Demirkaya, G. Li, and J. Lv (2020). RANK: large-scale inference with graphical nonlinear knockoffs. J. Amer. Statist. Assoc. 115(529), 362–379.

[14] Fan, Y., Y. Kong, D. Li, and Z. Zheng (2015). Innovated interaction screening for highdimensional nonlinear classification. Ann. Statist. 43(3), 1243–1272.

[15] Fan, Y. and J. Lv (2016). Innovated scalable efficient estimation in ultra-large Gaussian graphical models. Ann. Statist. 44(5), 2098–2126.

[16] Fan, Y., J. Lv, M. Sharifvaghefib, and Y. Uematsua (2020). Ipad: Stable interpretable forecasting with knockoffs inference. J. Amer. Statist. Assoc. 115(532), 1822–1834.

[17] Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441.

[18] Giudici, P. and S. Alessanfro (2016). Graphical network models for international financial flows. J. Bus. Econ. Stat. 34(1), 128–138.

[19] Goyal, H., I. Chachoua, P. Christian, W. Vainchenker, and S. N. Constantinescu (2020). A p53jak-stat connection involved inmyeloproliferative neoplasm pathogenesis and progression to secondary acute myeloid leukemia. Blood Rev. 42, 100712.

[20] Guo, J., E. Levina, G. Michailidis, and J. Zhu (2011). Joint estimation of multiple graphical models. Biometrika 98(1), 1–15.

[21] Johnson, K. S., E. F. Conant, and M. S. Soo (2021). Molecular subtypes of breast cancer: A review for breast radiologists. Journal of Breast Imaging 3(1), 12–24.

[22] Lauritzen, S. L. (1996). Graphical Models. Oxford University Press, New York.

[23] Lee, W. and Y. Liu (2015). Joint estimation of multiple precision matrices with common structures. J. Mach. Learn. Res. 16(1), 1035–1062.

[24] Li, J. and M. H. Maathuis (2021). GGM knockofffilter: False discovery rate control for Gaussian graphical models. J. R. Stat. Soc. Ser. B 83(3), 534–558.

[25] Li, W., S. Lin, W. Wang, X. Li, and D. Xu (2015). Kdm3a interacted with p53k372me1 and regulated p53 binding to puma in gastric cancer. Biochem. Bophys. Res. Commun. 467(3), 556–561.

[26] Liu, W. (2013). Gaussian graphical model estimation with false discovery rate control. Ann. Statist. 41(6), 2948–2978.

[27] Liu, W., Y. Ke, J. Liu, and R. Li (2022). Model-free feature screening and FDR control with knockofffeatures. J. Amer. Statist. Assoc. 117(537), 428–443.

[28] Ma, J. and G. Michailidis (2016). Joint structural estimation of multiple graphical models. J. Mach. Learn. Res. 17(166), 1–48.

[29] Ogawara, Y., S. Kishishita, T. Obata, Y. Isazawa, T. Suzuki, K. Tanaka, N. Masuyama, and

[30] Y. Gotoh (2002). Akt enhances mdm2-mediated ubiquitination and degradation of p53. J. Biol. Chem. 277(24), 21843–21850.

[31] Ren, Z., Y. Kang, Y. Fan, and J. Lv (2019). Tuning-free heterogeneous inference in massive network. J. Amer. Statist. Assoc. 114(528), 1908–1925.

[32] Ren, Z., T. Sun, C.-H. Zhang, and H. H. Zhou (2015). Asymptotic normality and optimalities in estimation of large Gaussian graphical models. Ann. Statist. 43(3), 991–1026.

[33] Shin, S.-Y., E. B. Fauman, A.-K. Petersen, J. Krumsiek, and et al. (2014). An atlas of genetic influences on human blood metabolites. Nat. Genet. 46, 543–550.

[34] Wu, Y., J. C. Lin, L. G. Piluso, J. M. Dhahbi, S. Bobadilla, S. R. Spindler, and X. Liu (2014). Phosphorylation of p53 by taf1 inactivates p53-dependent transcription in the dna damage response. Mol. cell 53(1), 63–74.

[35] Zhang, R., Z. Ren, and W. Chen (2018). SILGGM: An extensive R package for efficient statistical inference in large-scale gene networks. PLoS Comput. Biol. 14(8), e1006369.

[36] Zhou, J., Y. Li, Z. Zheng, and D. Li (2022). Reproducible learning in large-scale graphical models. J. Multivariate Anal. 189, 104934.