Abstract

This paper investigates estimation and inference problems for the support vector ma

chine coefficients with high-dimensional streaming data. To handle the non-smooth hinge loss,

the density convolution technique is adopted. We first propose an online lasso estimator that

can be obtained by optimizing a surrogate loss function. Instead of using complete historical

data, the surrogate loss function uses a renewable quadratic form to approximate historical

information. As a result, the estimation procedure only requires newly arrived data and limited

historical information, which can be updated in an online manner. We derive the theoretical error bounds of the proposed online lasso estimator under mild conditions. To eliminate

the inherent bias of the lasso estimator, we further propose an online debiased lasso estimator

and construct a valid inference procedure. We establish the asymptotic normality of the debiased estimator. To numerically compute the proposed online lasso estimators, we consider

the proximal gradient descent algorithm, which is feasible for high-dimensional models and is

computationally efficient. Extensive simulation studies and real-data analysis are conducted to

illustrate the finite-sample performance of the proposed methods.

Information

Preprint No.SS-2025-0083
Manuscript IDSS-2025-0083
Complete AuthorsHaochen Rao, Xu Guo, Heng Lian, Haobo Qi
Corresponding AuthorsHaobo Qi
Emailshaobo4869@bnu.edu.cn

References

  1. Akter, S. and S. F. Wamba (2016). Big data analytics in e-commerce: a systematic review and agenda for future research. Electronic Markets 26, 173–194.
  2. Beck, A. and M. Teboulle (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. Society for Industrial and Applied Mathematics 2(1), 183–202.
  3. Blanchard, G., O. Bousquet, and P. Massart (2008). Statistical performance of support vector machines. The Annals of Statistics 36, 489–531.
  4. Bordes, A., L. Bottou, and P. Gallinari (2009). Sgd-qn: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research 10, 1737–1754.
  5. Cai, L., X. Guo, H. Lian, and L. Zhu (2025). Statistical inference for high-dimensional convoluted rank regression. Journal of the American Statistical Association, 1–25.
  6. Cai, T., W. Liu, and X. Luo (2011). A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106(494), 594–607.
  7. Chen, B. and C. Chen (2024). Convoluted support matrix machine in high dimensions. Statistica Sinica.
  8. Choi, J., Y. Cho, E. Shim, and H. Woo (2016). Web-based infectious disease surveillance systems and public health perspectives: a systematic review. BMC Public Health 16(1238).
  9. Cortes, C. and V. Vapnik (1995). Support-vector networks. Machine Learning 20(3), 273–297.
  10. Das, S., R. K. Behera, M. Kumar, and S. K. Rath (2018). Real-time sentiment analysis of twitter streaming data for stock prediction. Procedia Computer Science 132, 956–964. International Conference on Computational Intelligence and Data Science.
  11. Duchi, J., E. Hazan, and Y. Singer (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 2121–2159.
  12. Fang, Y., J. Xu, and L. Yang (2018). Online bootstrap confidence intervals for the stochastic gradient descent estimator. Journal of Machine Learning Research 19, 1–21.
  13. Fernandes, M., E. Guerre, and E. Horta (2021). Smoothing quantile regressions. Journal of Business & Economic Statistics 39(1), 338–357.
  14. Han, R., L. Luo, Y. Lin, and J. Huang (2021). Online debiased lasso for streaming data. arXiv preprint arXiv:2106.05925.
  15. Han, R., L. Luo, Y. Lin, and J. Huang (2024). Online inference with debiased stochastic gradient descent. Biometrika 111, 93–108.
  16. Hazan, E., A. Agarwal, and S. Kale (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning 69, 169–192.
  17. Jiang, R., L. Liang, and K. Yu (2024). Renewable huber estimation method for streaming datasets. Electronic Journal of Statistics 18(1), 674–705.
  18. Jiang, R. and K. Yu (2022). Renewable quantile regression for streaming data sets. Neurocomputing 508, 208–224.
  19. Jiang, T., J. Yang, C. Yu, and Y. Sang (2018). A clickstream data analysis of the differences between visiting behaviors of desktop and mobile users. Data and Information Management 2, 130 – 140.
  20. Knight, K. and W. Fu (2000). Asymptotics for lasso-type estimators. The Annals of Statistics 28(5), 1356– 1378.
  21. Koo, J.-Y., Y. Lee, Y. Kim, and C. Park (2008). A bahadur representation of the linear support vector machine. Journal of Machine Learning Research 9, 1343–1368.
  22. Kraft, R., F. Birk, M. Reichert, A. Deshpande, W. Schlee, B. Langguth, H. Baumeister, T. Probst,
  23. M. Spiliopoulou, and R. Pryss (2020). Efficient processing of geospatial mhealth data using a scalable crowdsensing platform. Sensors 20(12).
  24. Lian, H. and Z. Fan (2018). Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. Journal of Machine Learning Research 18, 1–26.
  25. Luo, L., R. Han, Y. Lin, and J. Huang (2023). Statistical inference in high-dimensional generalized linear models with streaming data. Electronic Journal of Statistics 17, 3443–3471.
  26. Luo, L. and P. X.-K. Song (2020). Renewable estimation and incremental inference in generalized linear models with streaming data sets. Journal of Royal Statistical Society: Series B (Statistical Methodology) 82(1), 69–97.
  27. Ma, X., L. Lin, and Y. Gai (2023). A general framework of online updating variable selection for generalized linear models with streaming datasets. Journal of Statistical Computation and Simulation 93(3), 325–340.
  28. Park, C., K.-R. Kim, R. Myung, and J.-Y. Koo (2012). Oracle properties of scad-penalized support vector machine. Journal of Statistical Planning and Inference 142, 2257–2270.
  29. Peng, B., L. Wang, and Y. Wu (2016). An error bound for l1-norm support vector machine coefficients in ultra-high dimension. Journal of Machine Learning Research 17, 1–26.
  30. Peng, Y. and L. Wang (2024). Two-stage online debiased lasso estimation and inference for high-dimensional quantile regression with streaming data. Journal of Systems Science and Complexity 37(3), 1251–1270.
  31. Quan, M. and Z. Lin (2024). Optimal one-pass nonparametric estimation under memory constraint. Journal of the American Statistical Association 119, 285–296.
  32. Rybak, J., H. Battey, and W.-X. Zhou (2025). On inference for the support vector machine. Journal of Machine Learning Research 26(85), 1–54.
  33. Samaras, L., E. Garc´ıa-Barriocanal, and M.-A. Sicilia (2020). Chapter 2 - syndromic surveillance using web data: a systematic review. In M. D. Lytras and A. Sarirete (Eds.), Innovation in Health Informatics, Next Gen Tech Driven Personalized Med & Smart Healthcare. Academic Press.
  34. Schifano, E. D., J. Wu, C. Wang, J. Yan, and M.-H. Chen (2016). Online updating of statistical inference in the big data setting. Technometrics 58(3), 393–403.
  35. Schraudolph, N. N., J. Yu, and S. G¨unter (2007). A stochastic quasi-newton method for online convex optimization. Proceedings of Machine Learning Research 2, 436–443.
  36. Shameer, K., M. A. Badgeley, R. Miotto, B. S. Glicksberg, J. W. Morgan, and J. T. Dudley (2016, 02).
  37. Translational bioinformatics in the era of real-time biomedical, health care and wellness data streams.
  38. Briefings in Bioinformatics 18(1), 105–124.
  39. Sun, X., H. Wang, C. Cai, M. Yao, and K. Wang (2023). Online renewable smooth quantile regression. Computational Statistics and Data Analysis 185, 107781.
  40. Tan, K. M., L. Wang, and W.-X. Zhou (2022). High-dimensional quantile regression: Convolution smoothing and concave regularization. Journal of the Royal Statistical Society Series B: Statistical Methodology 84(1), 205–233.
  41. Toulis, P., J. Rennie, and E. M. Airoldi (2014). Statistical analysis of stochastic gradient methods for generalized linear models. Proceedings of Machine Learning Research 32, 667–675.
  42. Van de Geer, S., P. B¨uhlmann, Y. Ritov, and R. Dezeure (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics 42(3), 1166–1202.
  43. Vapnik, V. (1996). The Nature of Statistical Learning Theory. Springer.
  44. Wang, B., L. Zhou, Y. Gu, and H. Zou (2023). Density-convoluted support vector machines for highdimensional classification. IEEE transactions on Information Theory 69(4), 2523–2536.
  45. Wang, B., L. Zhou, J. Yang, and Q. Mai (2024). Density-convoluted tensor support vector machines. Statistics and Its Interface 17, 231–247.
  46. Wang, F., Y. Zhu, D. Huang, H. Qi, and H. Wang (2021). Distributed one-step upgraded estimation for nonuniformly and non-randomly distributed data. Computational Statistics & Data Analysis 162, 107265.
  47. Wang, K., X. Meng, and X. Sun (2025). Convolution smoothing and online updating estimation for support vector machine. Test 34, 288–323.
  48. Wang, X., Z. Yang, X. Chen, and W. Liu (2019). Distributed inference for linear support vector machine. Journal of Machine Learning Research 20(113), 1–41.
  49. Wu, J., M.-H. Chen, E. D. Schifano, and J. Yan (2021). Online updating of survival analysis. Journal of Computational and Graphical Statistics 30(4), 1209–1223.
  50. Xie, J., X. Yan, B. Jiang, and L. Kong (2025). Statistical inference for smoothed quantile regression with streaming data. Journal of Econometrics 249, 105924.
  51. Yan, Y., X. Wang, and R. Zhang (2023). Confidence intervals and hypothesis testing for high-dimensional quantile regression: Convolution smoothing and debiasing. Journal of Machine Learning Research 24, 1–49.
  52. Zhang, C.-H. and S. S. Zhang (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology 76(1), 217–242.
  53. Zhang, X., Y. Wu, L. Wang, and R. Li (2016). Variable selection for support vector machines in moderately high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78(1), 53–76.
  54. Zhang, Y., Y.-Y. Zhao, and H. Lian (2022). Statistical rates of convergence for functional partially linear support vector machines for classification. Journal of Machine Learning Research 23(156), 1–24.
  55. Zhu, J., S. Rosset, T. Hastie, and R. Tibshirani (2004). 1-norm support vector machines. Advances in Neural Information Processing Systems 16(1), 49–56.
  56. Zhu, W., X. Chen, and W. B. Wu (2023). Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association 118, 393–404.
  57. Zhu, X., F. Li, and H. Wang (2021). Least squares approximation for a distributed system. Journal of Computational and Graphical Statistics 30(4), 1004–1018.

Acknowledgments

We would like to thank the Editor, the Associate Editor, and the two anonymous

reviewers for their valuable comments and constructive suggestions, which led to significant improvements in the paper. Guo’s research is supported by the National Key

R & D Program of China (grant No. 2023YFA1011100), National Natural Science

Foundation of China (grant No. 12322112), and the Fundamental Research Funds for

the Central Universities.

Supplementary Materials

The supplementary materials contain the proof of theoretical results in Section 3 and

additional theoretical and simulation results.


Supplementary materials are available for download.