Abstract

Extraction of information from data is critical in the age of data science.

Probability density function theoretically provides comprehensive information on the

data. But, practically, different probability density models, either parametric or nonparametric, can often characterize partial features on the data, e.g., owing to model

bias or less efficiency in estimation. In this paper we suggest a framework to optimally

combine different density models to catch the comprehensive data features by a new

information criterion (IC) based unsupervised learning approach. Our optimal information extraction is in the sense that the resultant density averaging or selected density

minimises the Kullback–Leibler (KL) information loss function.

Differently from the

usual supervised learning IC for model selection or averaging, we first need to derive an

estimator of the KL loss function in our setting, which takes the Akaike and Takeuchi

information criteria as two special cases. A feasible density model averaging (DMA) procedure is accordingly suggested, with the DMA estimation achieving the lowest possible

KL loss asymptotically. Further, the consistency of the weights of the DMA estimator

tending to the optimal averaging weights minimizing the KL distance is obtained, and

the convergence rate of our empirical weights is also derived. Simulation studies show

that the DMA performs overall better and more robustly than the commonly used parametric or nonparametric density models, including kernel, finite mixture, logarithmic

scoring rule and selection methods for density estimation in the literature. The real data

analysis further demonstrates the performance of the proposed method.

Information

Preprint No.SS-2022-0410
Manuscript IDSS-2022-0410
Complete AuthorsPeng Lin, Jun Liao, Zudi Lu, Kang You, Guohua Zou
Corresponding AuthorsGuohua Zou
Emailsghzou@amss.ac.cn

References

  1. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Petrov, B. and Caski, F., editors, Second International Symposium on Information Theory. Akademiai
  2. Kiado, Budapest.
  3. Ando, T. and Li, K. C. (2014). A model-averaging approach for high-dimensional regression. Journal of the American Statistical Association, 109:254–265.
  4. Baimuratov, I., Shichkina, Y., Stankova, E., Zhukova, N., and Than, N. (2019). A Bayesian information criterion for unsupervised learning based on an objective prior. In Computational Science and Its Applications – ICCSA 2019. Springer International Publishing.
  5. Bickel, P. and Levina, E. (2008). Regularized estimation of large covariance matrices. Annals of Statistics, 36:199–227.
  6. Chen, J. and Khalili, A. (2008). Order selection in finite mixture models with a nonsmooth penalty. Journal of the American Statistical Association, 103:1674–1683.
  7. Chen, J., Li, D., Linton, O., and Lu, Z. (2018). Semiparametric ultra-high dimensional model averaging of nonlinear dynamic time series. Journal of the American Statistical Association, 113:919–932. Optimal averaging estimation for density functions
  8. Cheng, X. and Hansen, B. E. (2015). Forecasting with factor-augmented regression: A frequentist model averaging approach. Journal of Econometrics, 186:280–293.
  9. Claeskens, G. and Hjort, N. L. (2008). Model Selection and Model Averaging. Cambridge University
  10. Press, Cambridge.
  11. Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32:928–961.
  12. Ferguson, T. S. (1996). A Course in Large Sample Theory. Chapman and Hall, London.
  13. Geweke, J. and Amisano, G. (2011). Optimal prediction pools. Journal of Econometrics, 164:130–141.
  14. Hall, S. G. and Mitchell, J. (2007). Combining density forecasts. International Journal of Forecasting, 23:1–13.
  15. Hansen, B. E. (2007). Least squares model averaging. Econometrica, 75:1175–1189.
  16. Hansen, B. E. (2008). Least squares forecast averaging. Journal of Econometrics, 146:342–350.
  17. Hansen, B. E. (2014). Model averaging, asymptotic risk, and regressor groups. Quantitative Economics, 5:495–530.
  18. Hansen, B. E. (2016). Efficient shrinkage in parametric models. Journal of Econometrics, 190:115–132.
  19. Hansen, B. E. and Racine, J. (2012). Jackknife model averaging. Journal of Econometrics, 167:38–46.
  20. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
  21. Hjort, N. L. and Claeskens, G. (2003). Frequentist model average estimators. Journal of the American Optimal averaging estimation for density functions Statistical Association, 98:879–899.
  22. Ishwaran, H., James, L., and Sun, J. (2001). Bayesian model selection in finite mixtures by marginal density decompositions. Journal of the American Statistical Association, 96:1316–1332.
  23. Jones, M. C., Marron, J. S., and Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91:401–407.
  24. Leroux, B. (1992). Consistent estimation of a mixing distribution. Annals of Statistics, 20:1350–1360.
  25. Li, D., Linton, O., and Lu, Z. (2015). A flexible semiparametric forecasting model for time series. Journal of Econometrics, 187:345–357.
  26. Liao, J., Zong, X., Zhang, X., and Zou, G. (2019). Model averaging based on leave-subject-out crossvalidation for vector autoregressions. Journal of Econometrics, 209:35–60.
  27. Liu, Q. and Okui, R. (2013). Heteroskedasticity-robust cp model averaging. The Econometrics Journal, 16:463–472.
  28. Maggioni, M. and Murphy, J. M. (2019). Learning by unsupervised nonlinear diffusion. Journal of Machine Learning Research, 20:1–56.
  29. McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
  30. Safarinejadian, B., Menhaj, M. B., and Karrari, M. (2010). Distributed unsupervised gaussian mixture learning for density estimation in sensor networks. IEEE Transactions on Instrumentation and Measurement, 59:2250–2260.
  31. Saumard, A. and Navarro, F. (2021). Finite sample improvement of Akaike’s information criterion. Optimal averaging estimation for density functions IEEE Transactions on Information Theory, 67:6328–6343.
  32. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6:461–464.
  33. Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53:683–690.
  34. Takeuchi, K. (1976). Distribution of informational statistics and a criterion of model fitting. SuriKagaku [Mathematical Sciences] (in Japanese), 153:12–18.
  35. Wan, A. T. K., Zhang, X., and Zou, G. (2010). Least squares model averaging by Mallows criterion. Journal of Econometrics, 156:277–283.
  36. White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50:1–25.
  37. Yang, Y. (2000). Mixing strategies for density estimation. The Annals of Statistics, 28:75–87.
  38. Yang, Y. (2004). Combining forecasting procedures: Some theoretical results. Econometric Theory, 20:176–222.
  39. Yuan, Z. and Yang, Y. (2005). Combining linear regression models: When and how? Journal of the American Statistical Association, 100:1202–1214.
  40. Zhang, X., Yu, D., Zou, G., and Liang, H. (2016). Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models. Journal of the American Statistical Association, 111:1775–1790.
  41. Zhang, X., Zou, G., and Carroll, R. (2015). Model averaging based on Kullback–Leibler distance. Statistica Sinica, 25:1583–1598.

Acknowledgments

We thank the editor, associate editor and two referees for their helpful comments and suggestions. This work was partially supported by the National Nat-

ural Science Foundation of China (Grant nos.

12001534, 12426308, 12031016

Optimal averaging estimation for density functions

and 71971131). Lu’s work was partially supported by the European Research

Agency’s Marie Curie Career Integration Grant (Grant no. PCIG14-GA-2013-

631692). Zou’s work was also partially supported by the Beijing Outstanding

Young Scientist Program (Grant no. JWZQ20240101027).

Supplementary Materials

The Supplementary Material contains the derivation of tr (Σ12), Lemma 1,

the proofs of all theorems, some illustrating examples, the explanations on the

technical conditions, the discussion on the different density aggregation methods

and the numerical results (Figures 1–8).


Supplementary materials are available for download.