Distributed Algorithms for High-Dimensional Statistical Inference and Structure Learning with Heterogeneous Data

Hongru Zhao and Xiaotong Shen

doi:10.5705/ss.202025.0087

Abstract

This paper addresses critical data-sharing issues encountered when

disseminating individual-level data across multiple sites, particularly under stringent privacy constraints and site heterogeneity. In many multi-site clinical trials,

for example, privacy concerns restrict sharing to site-specific summary statistics

rather than raw data, complicating the analysis of global effects relative to individual or site-specific effects. Our contribution offers a robust distributed frame-

work for high-dimensional, heterogeneous data analysis that overcomes these limitations. We develop a heterogeneous model that integrates both global and site-

specific effects, employing nonconvex regularization via difference of convex programming under an ℓ0 constraint to ensure selection consistency. Although the

underlying optimization problem is worst-case NP-hard, our method converges

to the global minimizer in polynomial time with high probability under realistic conditions.

Moreover, by applying ℓ0 penalization exclusively to nuisance

parameters while leaving hypothesized parameters unpenalized, our approach

yields valid statistical inference. This work not only advances methodological

research but also directly addresses the challenges of data sharing in distributed

data environments.

Key words and phrases: Multi-site studies, Inference regularization, Asymptotic analysis

Information

Preprint No.	SS-2025-0087
Manuscript ID	SS-2025-0087
Complete Authors	Hongru Zhao, Xiaotong Shen
Corresponding Authors	Hongru Zhao
Emails	zhao1118@umn.edu

References

Austin, E., W. Pan, and X. Shen (2020). A new semiparametric approach to finite mixture of regressions using penalized regression via fusion. Statistica Sinica 30(2), 783–807.
Barrows Jr, R. C. and P. D. Clayton (1996). Privacy, confidentiality, and electronic medical records. Journal of the American Medical Informatics Association 3(2), 139–148.
Battey, H., J. Fan, H. Liu, J. Lu, and Z. Zhu (2018). Distributed testing and estimation under sparse high dimensional models. Annals of statistics 46(3), 1352–1382.
Beck, A. and M. Teboulle (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1), 183–202.
Bickel, P. J., Y. Ritov, and A. B. Tsybakov (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics 37(4), 1705–1732.
Chen, Y., D. Ge, M. Wang, Z. Wang, Y. Ye, and H. Yin (2017). Strong np-hardness for sparse optimization with concave penalty functions. In International Conference on Machine Learning, pp. 740–747. PMLR.
Chen, Y., Y. Ye, and M. Wang (2019). Approximation hardness for a class of sparse optimization problems. Journal of Machine Learning Research 20(38), 1–27.
Cheng, A., D. Kessler, R. Mackinnon, T. P. Chang, V. M. Nadkarni, E. A. Hunt, J. DuvalArnould, Y. Lin, M. Pusic, and M. Auerbach (2017). Conducting multicenter research in healthcare simulation: Lessons learned from the inspire network. Advances in Simulation 2(1), 1–14.
Daubechies, I., M. Defrise, and C. De Mol (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 57(11), 1413–1457.
Duan, R., Y. Ning, and Y. Chen (2022). Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika 109(1), 67–83.
Fan, J., L. Xue, and H. Zou (2014). Strong oracle optimality of folded concave penalized estimation. Annals of statistics 42(3), 819–849.
Guo, Z., X. Li, L. Han, and T. Cai (2025). Robust inference for federated meta-learning. Journal of the American Statistical Association 120(551), 1695–1710.
Han, L., J. Hou, K. Cho, R. Duan, and T. Cai (2025). Federated adaptive causal estimation (face) of target treatment effects. Journal of the American Statistical Association 120(551), 1503–1516.
Jordan, M. I., J. D. Lee, and Y. Yang (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association 114(526), 668–681.
Khaled, A., K. Mishchenko, and P. Richt´arik (2020). Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp. 4519–4529. PMLR.
Kim, S., W. Pan, and X. Shen (2013). Network-based penalized regression with application to genomic data. Biometrics 69(3), 582–593.
Kim, S., W. Pan, and X. Shen (2014). Penalized regression approaches to testing for quantitative trait-rare variant association. Frontiers in genetics 5, Article 121.
Koneˇcn`y, J., H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, and D. Bacon (2016). Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492.
Li, C., X. Shen, and W. Pan (2023). Inference for a large directed acyclic graph with unspecified interventions. Journal of Machine Learning Research 24(73), 1–48.
Loh, P.-L. and M. J. Wainwright (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics 45(6), 2455–2482.
McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017). Communicationefficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR.
Pan, W., X. Shen, and B. Liu (2013). Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. Journal of machine learning research 14(7), 1865–1889.
Shen, X., W. Pan, and Y. Zhu (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association 107(497), 223–232.
Shen, X., W. Pan, Y. Zhu, and H. Zhou (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics 65(5), 807–832.
Shi, C., R. Song, Z. Chen, and R. Li (2019). Linear hypothesis testing for high dimensional generalized linear models. Annals of statistics 47(5), 2671–2703.
Sidransky, E., M. A. Nalls, J. O. Aasly, J. Aharon-Peretz, G. Annesi, E. R. Barbosa, A. BarShira, D. Berg, J. Bras, A. Brice, et al. (2009). Multicenter analysis of glucocerebrosidase mutations in parkinson’s disease. New England Journal of Medicine 361(17), 1651–1661.
Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press.
Wang, J., M. Kolar, N. Srebro, and T. Zhang (2017). Efficient distributed learning with sparsity. In International conference on machine learning, pp. 3636–3645. PMLR.
Wang, S., T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan (2019). Adaptive federated learning in resource constrained edge computing systems. IEEE journal on selected areas in communications 37(6), 1205–1221.
Wasserman, L. and K. Roeder (2009). High dimensional variable selection. Annals of statistics 37(5A), 2178–2201.
Wu, C., S. Kwon, X. Shen, and W. Pan (2016). A new algorithm and theory for penalized regression-based clustering. Journal of Machine Learning Research 17(188), 1–25.
Wu, C., G. Xu, X. Shen, and W. Pan (2020). A regularization-based adaptive test for highdimensional glms. Journal of Machine Learning Research 21(128), 1–67.
Yu, S., G. Wang, and L. Wang (2025). Distributed heterogeneity learning for generalized partially linear models with spatially varying coefficients. Journal of the American Statistical Association 120(550), 779–793.
Yu, T., S. Ye, and R. Wang (2024). High-dimensional variable selection accounting for heterogeneity in regression coefficients across multiple data sources. Canadian Journal of Statistics 52(3), 900–923.
Zhu, Y., X. Shen, and W. Pan (2020). On high-dimensional constrained maximum likelihood inference. Journal of the American Statistical Association 115(529), 217–230. Hongru Zhao School of Statistics, University of Minnesota, Twin Cities, MN 55455, U.S.A.

Acknowledgments

The authors thank the reviewers for their careful reading and constructive comments, which have helped improve the quality and clarity of the

manuscript. This work was supported in part by NSF grant DMS-2513668

and NIH grants R01AG069895, R01AG065636, R01AG074858, U01AG073079.

Supplementary Materials

In the Supplementary Material, we present the threshold-based selection

Algorithm S2, the proofs of Theorems 1, 2, and 3. In addition, Section S1

extends the heterogeneous linear model in Remark 1 to site-specific heteroskedastic errors and introduces a block coordinate-descent estimator for

jointly estimating coefficients and variances.

Supplementary materials are available for download.

[1] Austin, E., W. Pan, and X. Shen (2020). A new semiparametric approach to finite mixture of regressions using penalized regression via fusion. Statistica Sinica 30(2), 783–807.

[2] Barrows Jr, R. C. and P. D. Clayton (1996). Privacy, confidentiality, and electronic medical records. Journal of the American Medical Informatics Association 3(2), 139–148.

[3] Battey, H., J. Fan, H. Liu, J. Lu, and Z. Zhu (2018). Distributed testing and estimation under sparse high dimensional models. Annals of statistics 46(3), 1352–1382.

[4] Beck, A. and M. Teboulle (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1), 183–202.

[5] Bickel, P. J., Y. Ritov, and A. B. Tsybakov (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics 37(4), 1705–1732.

[6] Chen, Y., D. Ge, M. Wang, Z. Wang, Y. Ye, and H. Yin (2017). Strong np-hardness for sparse optimization with concave penalty functions. In International Conference on Machine Learning, pp. 740–747. PMLR.

[7] Chen, Y., Y. Ye, and M. Wang (2019). Approximation hardness for a class of sparse optimization problems. Journal of Machine Learning Research 20(38), 1–27.

[8] Cheng, A., D. Kessler, R. Mackinnon, T. P. Chang, V. M. Nadkarni, E. A. Hunt, J. DuvalArnould, Y. Lin, M. Pusic, and M. Auerbach (2017). Conducting multicenter research in healthcare simulation: Lessons learned from the inspire network. Advances in Simulation 2(1), 1–14.

[9] Daubechies, I., M. Defrise, and C. De Mol (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 57(11), 1413–1457.

[10] Duan, R., Y. Ning, and Y. Chen (2022). Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika 109(1), 67–83.

[11] Fan, J., L. Xue, and H. Zou (2014). Strong oracle optimality of folded concave penalized estimation. Annals of statistics 42(3), 819–849.

[12] Guo, Z., X. Li, L. Han, and T. Cai (2025). Robust inference for federated meta-learning. Journal of the American Statistical Association 120(551), 1695–1710.

[13] Han, L., J. Hou, K. Cho, R. Duan, and T. Cai (2025). Federated adaptive causal estimation (face) of target treatment effects. Journal of the American Statistical Association 120(551), 1503–1516.

[14] Jordan, M. I., J. D. Lee, and Y. Yang (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association 114(526), 668–681.

[15] Khaled, A., K. Mishchenko, and P. Richt´arik (2020). Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp. 4519–4529. PMLR.

[16] Kim, S., W. Pan, and X. Shen (2013). Network-based penalized regression with application to genomic data. Biometrics 69(3), 582–593.

[17] Kim, S., W. Pan, and X. Shen (2014). Penalized regression approaches to testing for quantitative trait-rare variant association. Frontiers in genetics 5, Article 121.

[18] Koneˇcn`y, J., H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, and D. Bacon (2016). Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492.

[19] Li, C., X. Shen, and W. Pan (2023). Inference for a large directed acyclic graph with unspecified interventions. Journal of Machine Learning Research 24(73), 1–48.

[20] Loh, P.-L. and M. J. Wainwright (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics 45(6), 2455–2482.

[21] McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017). Communicationefficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR.

[22] Pan, W., X. Shen, and B. Liu (2013). Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. Journal of machine learning research 14(7), 1865–1889.

[23] Shen, X., W. Pan, and Y. Zhu (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association 107(497), 223–232.

[24] Shen, X., W. Pan, Y. Zhu, and H. Zhou (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics 65(5), 807–832.

[25] Shi, C., R. Song, Z. Chen, and R. Li (2019). Linear hypothesis testing for high dimensional generalized linear models. Annals of statistics 47(5), 2671–2703.

[26] Sidransky, E., M. A. Nalls, J. O. Aasly, J. Aharon-Peretz, G. Annesi, E. R. Barbosa, A. BarShira, D. Berg, J. Bras, A. Brice, et al. (2009). Multicenter analysis of glucocerebrosidase mutations in parkinson’s disease. New England Journal of Medicine 361(17), 1651–1661.

[27] Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press.

[28] Wang, J., M. Kolar, N. Srebro, and T. Zhang (2017). Efficient distributed learning with sparsity. In International conference on machine learning, pp. 3636–3645. PMLR.

[29] Wang, S., T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan (2019). Adaptive federated learning in resource constrained edge computing systems. IEEE journal on selected areas in communications 37(6), 1205–1221.

[30] Wasserman, L. and K. Roeder (2009). High dimensional variable selection. Annals of statistics 37(5A), 2178–2201.

[31] Wu, C., S. Kwon, X. Shen, and W. Pan (2016). A new algorithm and theory for penalized regression-based clustering. Journal of Machine Learning Research 17(188), 1–25.

[32] Wu, C., G. Xu, X. Shen, and W. Pan (2020). A regularization-based adaptive test for highdimensional glms. Journal of Machine Learning Research 21(128), 1–67.

[33] Yu, S., G. Wang, and L. Wang (2025). Distributed heterogeneity learning for generalized partially linear models with spatially varying coefficients. Journal of the American Statistical Association 120(550), 779–793.

[34] Yu, T., S. Ye, and R. Wang (2024). High-dimensional variable selection accounting for heterogeneity in regression coefficients across multiple data sources. Canadian Journal of Statistics 52(3), 900–923.

[35] Zhu, Y., X. Shen, and W. Pan (2020). On high-dimensional constrained maximum likelihood inference. Journal of the American Statistical Association 115(529), 217–230. Hongru Zhao School of Statistics, University of Minnesota, Twin Cities, MN 55455, U.S.A.