Inference of Community Numbers in Partial Networks

Xiyue Zhu, Xiao Han and Qing Yang

doi:10.5705/ss.202024.0305

Abstract

Despite considerable efforts towards community detection from a com

prehensive network perspective, the investigation of this problem becomes less

explored when individuals only have access to their local view. In this paper, we

propose an approach for testing and estimating the unknown number of communities K in the global network model, but only with limited partial information

available. Our procedure constructs a test statistic based on singular values and

eigenvalues of partitioned matrices derived from a centered and rescaled partial

adjacency matrix. We establish the asymptotic null distribution for testing and

demonstrate consistency in estimating K using results from random matrix theory. The effectiveness and usefulness of our proposed method are demonstrated

through extensive simulations, including both directed and undirected graphs, as

well as several real data examples.

Key words and phrases: Partial network, community structure, Tracy-Widom distribution

Information

Preprint No.	SS-2024-0305
Manuscript ID	SS-2024-0305
Complete Authors	Xiyue Zhu, Xiao Han, Qing Yang
Corresponding Authors	Qing Yang
Emails	yangq@ustc.edu.cn

References

Airoldi, E. M., D. M. Blei, S. E. Fienberg, and E. P. Xing (2008). Mixed membership stochastic blockmodels. Journal of Machine Learning Research 9, 1981–2014.
Alidaee, H., E. Auerbach, and M. P. Leung (2020). Recovering network structure from aggregated relational data using penalized regression. arXiv preprint arXiv:2001.06052.
Banerjee, A., A. G. Chandrasekhar, E. Duflo, and M. O. Jackson (2013). The diffusion of microfinance. Science 341(6144), 1236498.
Banerjee, D. and Z. Ma (2017). Optimal hypothesis testing for stochastic block models with growing degrees. arXiv preprint arXiv:1705.05305.
Bickel, P. J. and P. Sarkar (2016). Hypothesis testing for automated community detection in networks. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 253–273.
Breza, E., A. G. Chandrasekhar, S. Lubold, T. H. McCormick, and M. Pan (2023). Consistently estimating network statistics using aggregated relational data. Proceedings of the National Academy of Sciences 120(21), e2207185120.
Chatterjee, S. et al. (2015). Matrix estimation by universal singular value thresholding. Annals of Statistics 43(1), 177–214.
Chaudhuri, K., F. Chung, and A. Tsiatas (2012). Spectral clustering of graphs with general degrees in the extended planted partition model. In Conference on Learning Theory, pp. 35–1. JMLR Workshop and Conference Proceedings.
Chen, K. and J. Lei (2018). Network cross-validation for determining the number of communities in network data. Journal of the American Statistical Association 113(521), 241–251.
Efron, B. and R. J. Tibshirani (1994). An introduction to the bootstrap. CRC press.
Freeman, L. C. (1982). Centered graphs and the structure of ego networks. Mathematical Social Sciences 3(3), 291–304.
Girvan, M. and M. E. Newman (2002). Community structure in social and biological networks. Proceedings of the national academy of sciences 99(12), 7821–7826.
Han, X., Y. R. Wang, Q. Yang, and X. Tong (2024). Individual-centered partial information in social networks. Journal of Machine Learning Research 25(230), 1–60.
Han, X., Q. Yang, and Y. Fan (2023). Universal rank inference via residual subsampling with application to large networks. The Annals of Statistics 51(3), 1109–1133.
Hofman, J. M. and C. H. Wiggins (2008). Bayesian approach to network modularity. Physical review letters 100(25), 258701.
Holland, P. W., K. B. Laskey, and S. Leinhardt (1983). Stochastic blockmodels: First steps. Social networks 5(2), 109–137.
Hwang, N., J. Xu, S. Chatterjee, and S. Bhattacharyya (2024). On the estimation of the number of communities for sparse networks. Journal of the American Statistical Association 119(547), 1895–1910.
Jin, J. (2015). Fast community detection by score. The Annals of Statistics 43(1), 57–89.
Jin, J., Z. T. Ke, S. Luo, and M. Wang (2023). Optimal estimation of the number of network communities. Journal of the American Statistical Association 118(543), 2101–2116.
Karrer, B. and M. E. Newman (2011). Stochastic blockmodels and community structure in networks. Physical review E 83(1), 016107.
Krzakala, F., C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborov´a, and P. Zhang (2013). Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences 110(52), 20935–20940.
Latouche, P., E. Birmele, and C. Ambroise (2012). Variational bayesian inference and complexity control for stochastic block models. Statistical Modelling 12(1), 93–115.
Le, C. M. and E. Levina (2015). Estimating the number of communities in networks by spectral methods. arXiv preprint arXiv:1507.00827.
Lei, J. (2016). A goodness-of-fit test for stochastic block models. The Annals of Statistics 44(1), 401–424. Li, T., E. Levina,
and J. Zhu (2020). Network cross-validation by edge sampling. Biometrika 107(2), 257–276.
Li, T., Y.-J. Wu, E. Levina, and J. Zhu (2023). Link prediction for egocentrically sampled networks. Journal of Computational and Graphical Statistics 32(4), 1296–1319.
McDaid, A. F., T. B. Murphy, N. Friel, and N. J. Hurley (2013). Improved bayesian inference for the stochastic block model with application to large networks. Computational Statistics & Data Analysis 60, 12–31.
Ren, M., S. Zhang, and J. Wang (2023). Consistent estimation of the number of communities via regularized network embedding. Biometrics 79(3), 2404–2416.
Riolo, M. A., G. T. Cantwell, G. Reinert, and M. E. Newman (2017). Efficient method for estimating the number of communities in a network. Physical review E 96(3), 032310.
Rohe, K. (2019). A critical threshold for design effects in network sampling. The Annals of Statistics 47(1), 556–582.
Rohe, K., S. Chatterjee, and B. Yu (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics 39(4), 1878–1915.
Rohe, K., T. Qin, and B. Yu (2016). Co-clustering directed graphs to discover asymmetries and directional communities. Proceedings of the National Academy of Sciences 113(45), 12679–12684.
Saldana, D. F., Y. Yu, and Y. Feng (2017). How many communities are there? Journal of Computational and Graphical Statistics 26(1), 171–181.
Salganik, M. J. and D. D. Heckathorn (2004). 5. sampling and estimation in hidden populations using respondent-driven sampling. Sociological methodology 34(1), 193–240.
Sarkar, P. and P. J. Bickel (2015). Role of normalization in spectral clustering for stochastic blockmodels. The Annals of Statistics 43(3), 962–990.
Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing 17(4), 395–416.
Wang, Y. J. and G. Y. Wong (1987). Stochastic blockmodels for directed graphs. Journal of the American Statistical Association 82(397), 8–19.
Wang, Y. R. and P. J. Bickel (2017). Likelihood-based model selection for stochastic block models. The Annals of Statistics 45(2), 500–528.
Yan, Y., B. Hanlon, S. Roch, and K. Rohe (2020). Asymptotic seed bias in respondent-driven sampling. Electronic Journal of Statistics 14, 1577–1610.
Zachary, W. W. (1977). An information flow model for conflict and fission in small groups. Journal of anthropological research 33(4), 452–473. Xiyue Zhu

Acknowledgments

This work was supported by National Natural Science Foundation of China

(Grant No.12371278), National Natural Science Foundation of China (Grant

No.12571297), National Key R&D Program of China-2022YFA1008000,

National Natural Science Foundation of China (Grant No.12101585), the

Talents Introduction Program of the Chinese Academy of Sciences (Category B), and the Young Elite Scientist Sponsorship Program by Cast (No.

The authors would like to thank the editorial team and reviewers for

their invaluable insights and constructive suggestions, which have significantly improved the quality of this manuscript.

Supplementary Materials

The Supplementary Material contains all technical proofs and some additional real data analysis results.

Supplementary materials are available for download.

[1] Airoldi, E. M., D. M. Blei, S. E. Fienberg, and E. P. Xing (2008). Mixed membership stochastic blockmodels. Journal of Machine Learning Research 9, 1981–2014.

[2] Alidaee, H., E. Auerbach, and M. P. Leung (2020). Recovering network structure from aggregated relational data using penalized regression. arXiv preprint arXiv:2001.06052.

[3] Banerjee, A., A. G. Chandrasekhar, E. Duflo, and M. O. Jackson (2013). The diffusion of microfinance. Science 341(6144), 1236498.

[4] Banerjee, D. and Z. Ma (2017). Optimal hypothesis testing for stochastic block models with growing degrees. arXiv preprint arXiv:1705.05305.

[5] Bickel, P. J. and P. Sarkar (2016). Hypothesis testing for automated community detection in networks. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 253–273.

[6] Breza, E., A. G. Chandrasekhar, S. Lubold, T. H. McCormick, and M. Pan (2023). Consistently estimating network statistics using aggregated relational data. Proceedings of the National Academy of Sciences 120(21), e2207185120.

[7] Chatterjee, S. et al. (2015). Matrix estimation by universal singular value thresholding. Annals of Statistics 43(1), 177–214.

[8] Chaudhuri, K., F. Chung, and A. Tsiatas (2012). Spectral clustering of graphs with general degrees in the extended planted partition model. In Conference on Learning Theory, pp. 35–1. JMLR Workshop and Conference Proceedings.

[9] Chen, K. and J. Lei (2018). Network cross-validation for determining the number of communities in network data. Journal of the American Statistical Association 113(521), 241–251.

[10] Efron, B. and R. J. Tibshirani (1994). An introduction to the bootstrap. CRC press.

[11] Freeman, L. C. (1982). Centered graphs and the structure of ego networks. Mathematical Social Sciences 3(3), 291–304.

[12] Girvan, M. and M. E. Newman (2002). Community structure in social and biological networks. Proceedings of the national academy of sciences 99(12), 7821–7826.

[13] Han, X., Y. R. Wang, Q. Yang, and X. Tong (2024). Individual-centered partial information in social networks. Journal of Machine Learning Research 25(230), 1–60.

[14] Han, X., Q. Yang, and Y. Fan (2023). Universal rank inference via residual subsampling with application to large networks. The Annals of Statistics 51(3), 1109–1133.

[15] Hofman, J. M. and C. H. Wiggins (2008). Bayesian approach to network modularity. Physical review letters 100(25), 258701.

[16] Holland, P. W., K. B. Laskey, and S. Leinhardt (1983). Stochastic blockmodels: First steps. Social networks 5(2), 109–137.

[17] Hwang, N., J. Xu, S. Chatterjee, and S. Bhattacharyya (2024). On the estimation of the number of communities for sparse networks. Journal of the American Statistical Association 119(547), 1895–1910.

[18] Jin, J. (2015). Fast community detection by score. The Annals of Statistics 43(1), 57–89.

[19] Jin, J., Z. T. Ke, S. Luo, and M. Wang (2023). Optimal estimation of the number of network communities. Journal of the American Statistical Association 118(543), 2101–2116.

[20] Karrer, B. and M. E. Newman (2011). Stochastic blockmodels and community structure in networks. Physical review E 83(1), 016107.

[21] Krzakala, F., C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborov´a, and P. Zhang (2013). Spectral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences 110(52), 20935–20940.

[22] Latouche, P., E. Birmele, and C. Ambroise (2012). Variational bayesian inference and complexity control for stochastic block models. Statistical Modelling 12(1), 93–115.

[23] Le, C. M. and E. Levina (2015). Estimating the number of communities in networks by spectral methods. arXiv preprint arXiv:1507.00827.

[24] Lei, J. (2016). A goodness-of-fit test for stochastic block models. The Annals of Statistics 44(1), 401–424. Li, T., E. Levina,

[25] and J. Zhu (2020). Network cross-validation by edge sampling. Biometrika 107(2), 257–276.

[26] Li, T., Y.-J. Wu, E. Levina, and J. Zhu (2023). Link prediction for egocentrically sampled networks. Journal of Computational and Graphical Statistics 32(4), 1296–1319.

[27] McDaid, A. F., T. B. Murphy, N. Friel, and N. J. Hurley (2013). Improved bayesian inference for the stochastic block model with application to large networks. Computational Statistics & Data Analysis 60, 12–31.

[28] Ren, M., S. Zhang, and J. Wang (2023). Consistent estimation of the number of communities via regularized network embedding. Biometrics 79(3), 2404–2416.

[29] Riolo, M. A., G. T. Cantwell, G. Reinert, and M. E. Newman (2017). Efficient method for estimating the number of communities in a network. Physical review E 96(3), 032310.

[30] Rohe, K. (2019). A critical threshold for design effects in network sampling. The Annals of Statistics 47(1), 556–582.

[31] Rohe, K., S. Chatterjee, and B. Yu (2011). Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics 39(4), 1878–1915.

[32] Rohe, K., T. Qin, and B. Yu (2016). Co-clustering directed graphs to discover asymmetries and directional communities. Proceedings of the National Academy of Sciences 113(45), 12679–12684.

[33] Saldana, D. F., Y. Yu, and Y. Feng (2017). How many communities are there? Journal of Computational and Graphical Statistics 26(1), 171–181.

[34] Salganik, M. J. and D. D. Heckathorn (2004). 5. sampling and estimation in hidden populations using respondent-driven sampling. Sociological methodology 34(1), 193–240.

[35] Sarkar, P. and P. J. Bickel (2015). Role of normalization in spectral clustering for stochastic blockmodels. The Annals of Statistics 43(3), 962–990.

[36] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing 17(4), 395–416.

[37] Wang, Y. J. and G. Y. Wong (1987). Stochastic blockmodels for directed graphs. Journal of the American Statistical Association 82(397), 8–19.

[38] Wang, Y. R. and P. J. Bickel (2017). Likelihood-based model selection for stochastic block models. The Annals of Statistics 45(2), 500–528.

[39] Yan, Y., B. Hanlon, S. Roch, and K. Rohe (2020). Asymptotic seed bias in respondent-driven sampling. Electronic Journal of Statistics 14, 1577–1610.

[40] Zachary, W. W. (1977). An information flow model for conflict and fission in small groups. Journal of anthropological research 33(4), 452–473. Xiyue Zhu