Abstract
Data privacy is important in the AI era, and differential privacy (DP)
is one of the golden solutions. However, DP is typically applicable only if data
have a bounded underlying distribution. We address this limitation by leveraging second-moment information from a small amount of public data. We propose
Public-moment-guided Truncation (PMT), which transforms private data using
the public second-moment matrix and applies a principled truncation whose radius depends only on non-private quantities: data dimension and sample size.
This transformation yields a well-conditioned second-moment matrix, enabling
its inversion with a significantly strengthened ability to resist the DP noise. Furthermore, we demonstrate the applicability of PMT by using penalized and gen-
eralized linear regressions. Specifically, we design new loss functions and algorithms, ensuring that solutions in the transformed space can be mapped back to
This research was partially supported by the National Natural Science Foundation
of China (No.12326615) and the Major Key Project of PCL under Grant PCL2024A06,
and the Independent Research Project of the National Key Laboratory of Big Data and
Decision.
the original domain. We have established improvements in the models’ DP estimation through theoretical error bounds, robustness guarantees, and convergence
results, attributing the gains to the conditioning effect of PMT. Experiments on
synthetic and real datasets confirm that PMT substantially improves the accuracy and stability of DP estimators.
Key words and phrases: Differential privacy, Public data, Data truncation, Pe- nalized regression, Generalized linear model 1
Information
| Preprint No. | SS-2026-0091 |
|---|---|
| Manuscript ID | SS-2026-0091 |
| Complete Authors | Zilong Cao, Xuan Bi, Hai Zhang |
| Corresponding Authors | Hai Zhang |
| Emails | zhanghai@nwu.edu.cn |
References
- Abadi, M., A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang
- (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318.
- Amid, E., A. Ganesh, R. Mathews, S. Ramaswamy, S. Song, T. Steinke, V. M. Suriyakumar,
- O. Thakkar, and A. Thakurta (2022). Public data-assisted mirror descent for private model training. In International Conference on Machine Learning, pp. 517–535. PMLR.
- Avella-Medina, M., C. Bradshaw, and P.-L. Loh (2023). Differentially private inference via noisy optimization. The Annals of Statistics 51(5), 2067–2092.
- Awan, J. and S. Vadhan (2023). Canonical noise distributions and private hypothesis tests. The Annals of Statistics 51(2), 547–572.
- Bassily, R., S. Moran, and A. Nandi (2020). Learning from mixtures of private and public populations. Advances in neural information processing systems 33, 2947–2957.
- Bernstein, G. and D. R. Sheldon (2019). Differentially private bayesian linear regression. Advances in Neural Information Processing Systems 32.
- Bi, X. and X. Shen (2023). Distribution-invariant differential privacy. Journal of econometrics 235(2), 444–453.
- Bie, A., G. Kamath, and V. Singhal (2022). Private estimation with public data. Advances in neural information processing systems 35, 18653–18666.
- Bu, Z., J. Dong, Q. Long, and W. Su (2020a, 07). Deep learning with gaussian differential privacy. Harvard Data Science Review 2020.
- Bu, Z., J. Dong, Q. Long, and W. J. Su (2020b). Deep learning with gaussian differential privacy. Harvard data science review 2020(23), 10–1162.
- Cao, Z., X. Guo, and H. Zhang (2023). Privacy-preserving distributed learning via newton algorithm. Mathematics 11(18), 3807.
- Cortez, P., A. Cerdeira, T. Almeida, F.and Matos, and J. Reis (2009). Wine Quality. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C56S3T.
- Dong, J., A. Roth, and W. J. Su (2022). Gaussian differential privacy. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 84(1), 3–37.
- Dwork, C., F. McSherry, K. Nissim, and A. D. Smith (2006). Calibrating noise to sensitivity in private data analysis. J. Priv. Confidentiality 7, 17–51.
- Dwork, C., A. Roth, et al. (2014). The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9(3–4), 211–407.
- Ferrando, C., J. Gillenwater, and A. Kulesza (2021). Combining public and private data. arXiv preprint arXiv:2111.00115.
- Ganesh, A., M. Haghifam, T. Steinke, and A. Guha Thakurta (2023). Faster differentially private convex optimization via second-order methods. Advances in Neural Information Processing Systems 36, 79426–79438.
- Ivkin, N., D. Rothchild, E. Ullah, I. Stoica, R. Arora, et al. (2019). Communication-efficient distributed sgd with sketching. Advances in Neural Information Processing Systems 32.
- Ji, Z. and C. Elkan (2013). Differential privacy based on importance weighting. Machine learning 93, 163–183.
- Kairouz, P., M. R. Diaz, K. Rush, and A. Thakurta (2021). (nearly) dimension independent private erm with adagrad rates via publicly estimated subspaces. In Conference on Learning Theory, pp. 2717–2746. PMLR.
- Koloskova, A., H. Hendrikx, and S. U. Stich (2023). Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In International Conference on Machine Learning, pp. 17343–17363. PMLR.
- Liu, T., G. Vietri, T. Steinke, J. Ullman, and S. Wu (2021). Leveraging public data for practical private query release. In International Conference on Machine Learning, pp. 6968–6977. PMLR.
- Lohweg, Volker (2012). Banknote authentication. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C55P57.
- Moro, S., P. Rita, and P. Cortez (2014). Bank Marketing. UCI Machine Learning Repository.
- Nandi, A. and R. Bassily (2020). Privately answering classification queries in the agnostic pac model. In Algorithmic Learning Theory, pp. 687–703. PMLR.
- Nasr, M., J. Hayes, T. Steinke, B. Balle, F. Tram`er, M. Jagielski, N. Carlini, and A. Terzis
- (2023). Tight auditing of differentially private machine learning. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 1631–1648.
- Nasr, M., S. Mahloujifar, X. Tang, P. Mittal, and A. Houmansadr (2023). Effectively using public data in privacy preserving machine learning. In International Conference on Machine Learning, pp. 25718–25732. PMLR.
- Sheffet, O. (2017). Differentially private ordinary least squares. In International Conference on Machine Learning, pp. 3105–3114. PMLR.
- Tfekci, P. and H. Kaya (2014). Combined Cycle Power Plant. UCI Machine Learning Repository.
- Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, Volume 48. Cambridge university press.
- Wang, P. and H. Zhang (2019). Distributed logistic regression with differential privacy. Sci. Sin. Inform. doi 10.
- Wang, Y.-X. (2018). Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. arXiv preprint arXiv:1803.02596.
- Zhao, W., X. Zhu, and L. Zhu (2025). Minimax rates of convergence for sliced inverse regression with differential privacy. Computational Statistics & Data Analysis 201, 108041. Zilong Cao, School of Mathematics, Northwest University, Xi’an, China
Supplementary Materials
The supplementary material provides a comparison between PMT-based
and private-data-only inverse second-moment estimation, introduces DP-
RR and DP-LR baselines, and includes the key notation, complete proofs,
additional theoretical discussions, and supplementary experiments.