Distributed Sequential Federated Estimation

Zhanfeng Wang, Xinyu Zhang and Yuan-chin Chang

doi:10.5705/ss.202024.0215

Abstract

When analyzing data stored across multiple sites, concerns about data

security and communication arise. Federated learning, which avoids centralizing data, offers a promising solution to address these concerns. However, inte-

grating information from separate local sites in a statistically sound manner is

crucial, as common averaging methods may lead to information loss due to data

non-homogeneity and incomparable results among sites. By applying sequential

methods in federated learning, integration can be facilitated and the analysis

process can be accelerated, particularly within a distributed computing framework. We propose an efficient data-driven method that maintains the principles

of classical sequential adaptive design. Numerical studies and an application to

COVID-19 data from 32 hospitals in Mexico, using a regression model, illustrate

the effectiveness of our approach.

Key words and phrases: Adaptive sampling; Data Communication; Random av- erage; Sequential sampling

Information

Preprint No.	SS-2024-0215
Manuscript ID	SS-2024-0215
Complete Authors	Zhanfeng Wang, Xinyu Zhang, Yuan-chin Chang
Corresponding Authors	Yuan-chin Chang
Emails	ycchang@sinica.edu.tw

References

Ai, M., J. Yu, H. Zhang, and H. Wang (2021). Optimal subsampling algorithms for big data regressions. Statistica Sinica 31(2), 749–772.
Carlini, N., C. Liu, U. Erlingsson, J. Kos, and D. Song (2019). The secret sharer: Evaluating and testing unintended memorization in neural networks. In Neural Information Processing Systems (NeurIPS), pp. 267–284.
Chang, Y.-c. I. (2011). Sequential estimation in generalized linear models when covariates are subject to errors. Metrika 73, 93–120.
Chang, Y. I. (1999). Strong consistency of maximum quassi-likelihood eistimate in generalized linear models via a last time. Statistics & Probability Letters 45, 237–246.
Chen, Z., Z. Wang, and Y. Chang (2020). Sequential adaptive variables and subject selection for gee methods. Biometrics 76(2), 496–507.
Chen, Z., Z. Wang, and Y. Chang (2023, March). Distributed sequential estimation procedures. Canadian Journal of Statistics 52(1), 271–290.
Chow, Y. and H. Robbins (1965). On the asymptotic theory of fixed-width sequential confidence intervals for the mean. Ann. Math. Statist. 36, 457–462.
Damiani, A., M. Vallati, R. Gatta, N. Dinapoli, A. Jochems, T. Deist, J. v. Soest,
A. Dekker, and V. Valentini (2015). Distributed learning to protect privacy in multicentric clinical studies. In Conference on artificial intelligence in medicine in europe, pp. 65–75. Springer.
Deng, X., V. Joseph, A. Sudjianto, and C. F. Wu (2009). Active learning through sequential design, with applications to detection of money laundering. Journal of the American Statistical Association 104, 969–981.
Feigenbaum, J., Y. Ishai, T. Malkin, K. Nissim, M. J. Strauss, and R. N. Wright (2001). Secure multiparty computation of approximations. In International Colloquium on
Automata, Languages, and Programming, pp. 927–938. Springer.
Hassanein, W. A. and M. M. Seyam (2019). Construction of some compound criteria via a-optimality. Communications in Statistics-Theory and Methods 48(22), 5559–5570.
He, L., W. Li, D. Song, and M. Yang (2024). A systematic view of information-based optimal subdata selection: algorithm development, performance evaluation, and application in financial data. Statistica Sinica 34, 611–636.
Hern´andez-Gardu˜no, E. (2020). Obesity is the comorbidity more strongly associated for covid-19 in mexico. a case-control study. Obes Res Clin Pract. 14(4), 375–379.
Huang, L., Y. Yin, Z. Fu, S. Zhang, H. Deng, and D. Liu (2020). Loadaboost: Loss-based adaboost federated machine learning with reduced computational complexity on iid and non-iid intensive care data. Plos one 15(4), e0230706.
Jones, B., K. Allen-Moyer, and P. Goos (2021). A-optimal versus d-optimal design of screening experiments. Journal of Quality Technology 53(4), 369–382.
Jordan, M., J. Lee, and Y. Yang (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association 114(526), 668–681.
Li, C., P. Zhou, L. Xiong, Q. Wang, and T. Wang (2018). Differentially private distributed online learning. IEEE Transactions on Knowledge and Data Engineering 30(8), 1440–1453.
Li, J., Z. Chen, Z. Wang, and Y. Chang (2020). Active learning in multiple-class classification problems via individualized binary models. Computational Statistics & Data Analysis 145, 106–119.
Limmun, W., J. J. Borkowski, and B. Chomtee (2018). Weighted a-optimality criterion for generating robust mixture designs. Computers & Industrial Engineering 125, 348–356.
Lindell, Y. (2005). Secure multiparty computation for privacy preserving data mining. In Encyclopedia of Data Warehousing and Mining, pp. 1005–1009. IGI global.
Liu, S., Y. Zhi, and S. Ying (2020). Covid-19 and asthma: Reflection during the pandemic. Clin Rev Allergy Immunol. 59(1), 78–88.
L´opez-Fidalgo, J., M. J. Rivas-L´opez, and B. Fern´andez-Garz´on (2007). A-optimality standardized through the coefficient of variation. Communications in Statistics— Theory and Methods 36(4), 781–792.
Louis, R., D. Calmes, A. Frix, and F. Schleich (2020). Covid-19 and asthma. Rev Med Liege. 75(S1), 130–132.
McCullagh, P. and J. A. Nelder (1989). Generalized Linear Models, 2nd Edition. Chapman & Hall, New York.
McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017). Communication-efficient learning of deep networks from decentralized data. Artificial intelligence and statistics. PMLR 54, 1273–1282.
Memon, S. and D. Biswas (2022). Covid-19 and diabetes mellitus: from pathophysiology to clinical management. Cureus 14(11), e31895.
Montgomery, D. C. (2009). Design and Analysis of experiments (7th ed.). Hoboken, NJ, USA: JohnWiley&Sons.
Park, E. and Y. Chang (2016). Multiple-stage sampling procedure for covariate-adjusted response-adaptive designs. Statistical Methods in Medical Research 25(4), 1490– 1511.
Rashedi, J., P. B. Mahdavi, V. Asgharzadeh, M. Pourostadi, K. H. Samadi, A. Vegari,
H. Tayebi-Khosroshahi, and M. Asgharzadeh (2020). Risk factors for covid-19. Infez Med. 28(4), 469–474.
Settles, B. (2010). Active learning literature survey. University of Wisconsin, Madison 52(55-66), 11.
Smucker, B., M. Krzywinski, and N. Altman (2018). Optimal experimental design. Nature Methods 15(8), 559–560. van Sluijs, B., R. J. M. Maas, A. J. van der Linden, T. F. A. de Greef, and W. T. S. Huck
(2022). A microfluidic optimal experimental design platform for forward design of cell-free genetic networks. Nature Communications 13(1), 3626.
Wang, H., M. Yang, and J. Stufken (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association 114(525), 393 – 405.
Wang, H., R. Zhu, and P. Ma (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113(522), 829 – 844.
Woodroofe, M. (1982). Nonlinear renewal theory in sequential analysis. CBMS-NSF regional conference series in applied mathematics.
Woods, D., S. Lewis, J. Eccleston, and K. Russell (2006). Designs for generalized linear models with several variables and model uncertainty. Technometrics 48(2), 284–292.
Yan, F., S. Sundaram, S. Vishwanathan, and Y. Qi (2013). Distributed autonomous online learning: Regrets and intrinsic privacy-preserving properties. IEEE Transactions on Knowledge and Data Engineering 25(11), 2483–2493.
Yao, Y. and H. Wang (2021). A review on optimal subsampling methods for massive datasets. Journal of Data Science 19(1), 151–172.
Yu, J., M. Ai, and Z. Ye (2024). A review on design inspired subsampling for big data. Statistical Papers 65, 467–510.
Yu, J., H. Wang, and M. Ai (2025). A subsampling strategy for aic-based model averaging with generalized linear models. Technometrics 67(1), 122–132.
Yu, J., H. Wang, M. Ai, and H. Zhang (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association 117(537), 265–276.
Zhou, X., D. McClish, and N. Obuchowski (2009). Statistical methods in diagnostic medicine. John Wiley & Sons.

Acknowledgments

This research is supported in part by research grants from National Natural

Science Foundation of China (No. 12371277, 12231017), and National Science and Technology Council of Taiwan (111-2118-M-001-003-MY2). Xinyu

Zhang and Yuan-chin Ivan Chang are co-corresponding authors.

Supplementary Materials

The online Supplementary Material contains a detailed proof of the main

results and additional numerical results.

Supplementary materials are available for download.

[1] Ai, M., J. Yu, H. Zhang, and H. Wang (2021). Optimal subsampling algorithms for big data regressions. Statistica Sinica 31(2), 749–772.

[2] Carlini, N., C. Liu, U. Erlingsson, J. Kos, and D. Song (2019). The secret sharer: Evaluating and testing unintended memorization in neural networks. In Neural Information Processing Systems (NeurIPS), pp. 267–284.

[3] Chang, Y.-c. I. (2011). Sequential estimation in generalized linear models when covariates are subject to errors. Metrika 73, 93–120.

[4] Chang, Y. I. (1999). Strong consistency of maximum quassi-likelihood eistimate in generalized linear models via a last time. Statistics & Probability Letters 45, 237–246.

[5] Chen, Z., Z. Wang, and Y. Chang (2020). Sequential adaptive variables and subject selection for gee methods. Biometrics 76(2), 496–507.

[6] Chen, Z., Z. Wang, and Y. Chang (2023, March). Distributed sequential estimation procedures. Canadian Journal of Statistics 52(1), 271–290.

[7] Chow, Y. and H. Robbins (1965). On the asymptotic theory of fixed-width sequential confidence intervals for the mean. Ann. Math. Statist. 36, 457–462.

[8] Damiani, A., M. Vallati, R. Gatta, N. Dinapoli, A. Jochems, T. Deist, J. v. Soest,

[9] A. Dekker, and V. Valentini (2015). Distributed learning to protect privacy in multicentric clinical studies. In Conference on artificial intelligence in medicine in europe, pp. 65–75. Springer.

[10] Deng, X., V. Joseph, A. Sudjianto, and C. F. Wu (2009). Active learning through sequential design, with applications to detection of money laundering. Journal of the American Statistical Association 104, 969–981.

[11] Feigenbaum, J., Y. Ishai, T. Malkin, K. Nissim, M. J. Strauss, and R. N. Wright (2001). Secure multiparty computation of approximations. In International Colloquium on

[12] Automata, Languages, and Programming, pp. 927–938. Springer.

[13] Hassanein, W. A. and M. M. Seyam (2019). Construction of some compound criteria via a-optimality. Communications in Statistics-Theory and Methods 48(22), 5559–5570.

[14] He, L., W. Li, D. Song, and M. Yang (2024). A systematic view of information-based optimal subdata selection: algorithm development, performance evaluation, and application in financial data. Statistica Sinica 34, 611–636.

[15] Hern´andez-Gardu˜no, E. (2020). Obesity is the comorbidity more strongly associated for covid-19 in mexico. a case-control study. Obes Res Clin Pract. 14(4), 375–379.

[16] Huang, L., Y. Yin, Z. Fu, S. Zhang, H. Deng, and D. Liu (2020). Loadaboost: Loss-based adaboost federated machine learning with reduced computational complexity on iid and non-iid intensive care data. Plos one 15(4), e0230706.

[17] Jones, B., K. Allen-Moyer, and P. Goos (2021). A-optimal versus d-optimal design of screening experiments. Journal of Quality Technology 53(4), 369–382.

[18] Jordan, M., J. Lee, and Y. Yang (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association 114(526), 668–681.

[19] Li, C., P. Zhou, L. Xiong, Q. Wang, and T. Wang (2018). Differentially private distributed online learning. IEEE Transactions on Knowledge and Data Engineering 30(8), 1440–1453.

[20] Li, J., Z. Chen, Z. Wang, and Y. Chang (2020). Active learning in multiple-class classification problems via individualized binary models. Computational Statistics & Data Analysis 145, 106–119.

[21] Limmun, W., J. J. Borkowski, and B. Chomtee (2018). Weighted a-optimality criterion for generating robust mixture designs. Computers & Industrial Engineering 125, 348–356.

[22] Lindell, Y. (2005). Secure multiparty computation for privacy preserving data mining. In Encyclopedia of Data Warehousing and Mining, pp. 1005–1009. IGI global.

[23] Liu, S., Y. Zhi, and S. Ying (2020). Covid-19 and asthma: Reflection during the pandemic. Clin Rev Allergy Immunol. 59(1), 78–88.

[24] L´opez-Fidalgo, J., M. J. Rivas-L´opez, and B. Fern´andez-Garz´on (2007). A-optimality standardized through the coefficient of variation. Communications in Statistics— Theory and Methods 36(4), 781–792.

[25] Louis, R., D. Calmes, A. Frix, and F. Schleich (2020). Covid-19 and asthma. Rev Med Liege. 75(S1), 130–132.

[26] McCullagh, P. and J. A. Nelder (1989). Generalized Linear Models, 2nd Edition. Chapman & Hall, New York.

[27] McMahan, B., E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017). Communication-efficient learning of deep networks from decentralized data. Artificial intelligence and statistics. PMLR 54, 1273–1282.

[28] Memon, S. and D. Biswas (2022). Covid-19 and diabetes mellitus: from pathophysiology to clinical management. Cureus 14(11), e31895.

[29] Montgomery, D. C. (2009). Design and Analysis of experiments (7th ed.). Hoboken, NJ, USA: JohnWiley&Sons.

[30] Park, E. and Y. Chang (2016). Multiple-stage sampling procedure for covariate-adjusted response-adaptive designs. Statistical Methods in Medical Research 25(4), 1490– 1511.

[31] Rashedi, J., P. B. Mahdavi, V. Asgharzadeh, M. Pourostadi, K. H. Samadi, A. Vegari,

[32] H. Tayebi-Khosroshahi, and M. Asgharzadeh (2020). Risk factors for covid-19. Infez Med. 28(4), 469–474.

[33] Settles, B. (2010). Active learning literature survey. University of Wisconsin, Madison 52(55-66), 11.

[34] Smucker, B., M. Krzywinski, and N. Altman (2018). Optimal experimental design. Nature Methods 15(8), 559–560. van Sluijs, B., R. J. M. Maas, A. J. van der Linden, T. F. A. de Greef, and W. T. S. Huck

[35] (2022). A microfluidic optimal experimental design platform for forward design of cell-free genetic networks. Nature Communications 13(1), 3626.

[36] Wang, H., M. Yang, and J. Stufken (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association 114(525), 393 – 405.

[37] Wang, H., R. Zhu, and P. Ma (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113(522), 829 – 844.

[38] Woodroofe, M. (1982). Nonlinear renewal theory in sequential analysis. CBMS-NSF regional conference series in applied mathematics.

[39] Woods, D., S. Lewis, J. Eccleston, and K. Russell (2006). Designs for generalized linear models with several variables and model uncertainty. Technometrics 48(2), 284–292.

[40] Yan, F., S. Sundaram, S. Vishwanathan, and Y. Qi (2013). Distributed autonomous online learning: Regrets and intrinsic privacy-preserving properties. IEEE Transactions on Knowledge and Data Engineering 25(11), 2483–2493.

[41] Yao, Y. and H. Wang (2021). A review on optimal subsampling methods for massive datasets. Journal of Data Science 19(1), 151–172.

[42] Yu, J., M. Ai, and Z. Ye (2024). A review on design inspired subsampling for big data. Statistical Papers 65, 467–510.

[43] Yu, J., H. Wang, and M. Ai (2025). A subsampling strategy for aic-based model averaging with generalized linear models. Technometrics 67(1), 122–132.

[44] Yu, J., H. Wang, M. Ai, and H. Zhang (2022). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association 117(537), 265–276.

[45] Zhou, X., D. McClish, and N. Obuchowski (2009). Statistical methods in diagnostic medicine. John Wiley & Sons.