Abstract
This paper addresses critical data-sharing issues encountered when
disseminating individual-level data across multiple sites, particularly under stringent privacy constraints and site heterogeneity. In many multi-site clinical trials,
for example, privacy concerns restrict sharing to site-specific summary statistics
rather than raw data, complicating the analysis of global effects relative to individual or site-specific effects. Our contribution offers a robust distributed frame-
work for high-dimensional, heterogeneous data analysis that overcomes these limitations. We develop a heterogeneous model that integrates both global and site-
specific effects, employing nonconvex regularization via difference of convex programming under an ℓ0 constraint to ensure selection consistency. Although the
underlying optimization problem is worst-case NP-hard, our method converges
to the global minimizer in polynomial time with high probability under realistic conditions.
Moreover, by applying ℓ0 penalization exclusively to nuisance
parameters while leaving hypothesized parameters unpenalized, our approach
yields valid statistical inference. This work not only advances methodological
research but also directly addresses the challenges of data sharing in distributed
data environments.
Information
| Preprint No. | SS-2025-0087 |
|---|---|
| Manuscript ID | SS-2025-0087 |
| Complete Authors | Hongru Zhao, Xiaotong Shen |
| Corresponding Authors | Hongru Zhao |
| Emails | zhao1118@umn.edu |
References
- Austin, E., W. Pan, and X. Shen (2020). A new semiparametric approach to finite mixture of regressions using penalized regression via fusion. Statistica Sinica 30(2), 783–807.
- Barrows Jr, R. C. and P. D. Clayton (1996). Privacy, confidentiality, and electronic medical records. Journal of the American Medical Informatics Association 3(2), 139–148.
- Battey, H., J. Fan, H. Liu, J. Lu, and Z. Zhu (2018). Distributed testing and estimation under sparse high dimensional models. Annals of statistics 46(3), 1352–1382.
- Beck, A. and M. Teboulle (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1), 183–202.
- Bickel, P. J., Y. Ritov, and A. B. Tsybakov (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics 37(4), 1705 – 1732.
- Chen, Y., D. Ge, M. Wang, Z. Wang, Y. Ye, and H. Yin (2017). Strong np-hardness for sparse optimization with concave penalty functions. In International Conference on Machine Learning, pp. 740–747. PMLR.
- Chen, Y., Y. Ye, and M. Wang (2019). Approximation hardness for a class of sparse optimization problems. Journal of Machine Learning Research 20(38), 1–27.
- Cheng, A., D. Kessler, R. Mackinnon, T. P. Chang, V. M. Nadkarni, E. A. Hunt, J. DuvalArnould, Y. Lin, M. Pusic, and M. Auerbach (2017). Conducting multicenter research in healthcare simulation: Lessons learned from the inspire network. Advances in Simulation 2(1), 1–14.
- Daubechies, I., M. Defrise, and C. De Mol (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 57(11), 1413–1457.
- Duan, R., Y. Ning, and Y. Chen (2022). Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika 109(1), 67–83.
- Fan, J., L. Xue, and H. Zou (2014). Strong oracle optimality of folded concave penalized estimation. Annals of statistics 42(3), 819–849.
- Guo, Z., X. Li, L. Han, and T. Cai (2025). Robust inference for federated meta-learning. Journal of the American Statistical Association 0(0), 1–16.
- Han, L., J. Hou, K. Cho, R. Duan, and T. Cai (2025). Federated adaptive causal estimation (face) of target treatment effects. Journal of the American Statistical Association, 1–14.
- Jordan, M. I., J. D. Lee, and Y. Yang (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association 114(526), 668–681.
- Khaled, A., K. Mishchenko, and P. Richt´arik (2020). Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp. 4519–4529. PMLR.
- Kim, S., W. Pan, and X. Shen (2013). Network-based penalized regression with application to genomic data. Biometrics 69(3), 582–593.
- Kim, S., W. Pan, and X. Shen (2014). Penalized regression approaches to testing for quantitative trait-rare variant association. Frontiers in genetics 5, 121.
- Konecn`y, J., H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, and D. Bacon (2016). Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 8.
- Li, C., X. Shen, and W. Pan (2023). Inference for a large directed acyclic graph with unspecified interventions. Journal of Machine Learning Research 24(73), 1–48.
- Loh, P.-L. and M. J. Wainwright (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics 45(6), 2455–2482.
- McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017). Communicationefficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR.
- Pan, W., X. Shen, and B. Liu (2013). Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. Journal of machine learning research 14(7), 1865–1889.
- Shen, X., W. Pan, and Y. Zhu (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association 107(497), 223–232.
- Shen, X., W. Pan, Y. Zhu, and H. Zhou (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics 65(5), 807–832.
- Shi, C., R. Song, Z. Chen, and R. Li (2019). Linear hypothesis testing for high dimensional generalized linear models. Annals of statistics 47(5), 2671–2703.
- Sidransky, E., M. A. Nalls, J. O. Aasly, J. Aharon-Peretz, G. Annesi, E. R. Barbosa, A. BarShira, D. Berg, J. Bras, A. Brice, et al. (2009). Multicenter analysis of glucocerebrosidase mutations in parkinson’s disease. New England Journal of Medicine 361(17), 1651–1661.
- Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press.
- Wang, J., M. Kolar, N. Srebro, and T. Zhang (2017). Efficient distributed learning with sparsity. In International conference on machine learning, pp. 3636–3645. PMLR.
- Wang, S., T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan (2019). Adaptive federated learning in resource constrained edge computing systems. IEEE journal on selected areas in communications 37(6), 1205–1221.
- Wasserman, L. and K. Roeder (2009). High dimensional variable selection. Annals of statistics 37(5A), 2178.
- Wu, C., S. Kwon, X. Shen, and W. Pan (2016). A new algorithm and theory for penalized regression-based clustering. Journal of Machine Learning Research 17(188), 1–25.
- Wu, C., G. Xu, X. Shen, and W. Pan (2020). A regularization-based adaptive test for highdimensional glms. Journal of Machine Learning Research 21(128), 1–67.
- Yu, S., G. Wang, and L. Wang (2024). Distributed heterogeneity learning for generalized partially linear models with spatially varying coefficients. Journal of the American Statistical Association, 1–15.
- Yu, T., S. Ye, and R. Wang (2024). High-dimensional variable selection accounting for heterogeneity in regression coefficients across multiple data sources. Canadian Journal of Statistics 52(3), 900–923.
- Zhu, Y., X. Shen, and W. Pan (2020). On high-dimensional constrained maximum likelihood inference. Journal of the American Statistical Association 115(529), 217–230. Hongru Zhao School of Statistics, University of Minnesota, Twin Cities, MN 55455, U.S.A.
Acknowledgments
The authors thank the reviewers for their careful reading and constructive comments, which have helped improve the quality and clarity of the
manuscript. This work was supported in part by NSF grant DMS-2513668
and NIH grants R01AG069895, R01AG065636, R01AG074858, U01AG073079.
Supplementary Materials
In the Supplementary Material, we present the threshold-based selection
Algorithm S2, the proofs of Theorems 1, 2, and 3. In addition, Section S1
extends the heterogeneous linear model in Remark 1 to site-specific heteroskedastic errors and introduces a block coordinate-descent estimator for
jointly estimating coefficients and variances.