Abstract
Extraction of information from data is critical in the age of data science.
Probability density function theoretically provides comprehensive information on the
data. But, practically, different probability density models, either parametric or nonparametric, can often characterize partial features on the data, e.g., owing to model
bias or less efficiency in estimation. In this paper we suggest a framework to optimally
combine different density models to catch the comprehensive data features by a new
information criterion (IC) based unsupervised learning approach. Our optimal information extraction is in the sense that the resultant density averaging or selected density
minimises the Kullback–Leibler (KL) information loss function.
Differently from the
usual supervised learning IC for model selection or averaging, we first need to derive an
estimator of the KL loss function in our setting, which takes the Akaike and Takeuchi
information criteria as two special cases. A feasible density model averaging (DMA) procedure is accordingly suggested, with the DMA estimation achieving the lowest possible
KL loss asymptotically. Further, the consistency of the weights of the DMA estimator
tending to the optimal averaging weights minimizing the KL distance is obtained, and
the convergence rate of our empirical weights is also derived. Simulation studies show
that the DMA performs overall better and more robustly than the commonly used parametric or nonparametric density models, including kernel, finite mixture, logarithmic
scoring rule and selection methods for density estimation in the literature. The real data
analysis further demonstrates the performance of the proposed method.
Information
| Preprint No. | SS-2022-0410 |
|---|---|
| Manuscript ID | SS-2022-0410 |
| Complete Authors | Peng Lin, Jun Liao, Zudi Lu, Kang You, Guohua Zou |
| Corresponding Authors | Guohua Zou |
| Emails | ghzou@amss.ac.cn |
References
- Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Petrov, B. and Caski, F., editors, Second International Symposium on Information Theory. Akademiai
- Kiado, Budapest.
- Ando, T. and Li, K. C. (2014). A model-averaging approach for high-dimensional regression. Journal of the American Statistical Association, 109:254–265.
- Baimuratov, I., Shichkina, Y., Stankova, E., Zhukova, N., and Than, N. (2019). A Bayesian information criterion for unsupervised learning based on an objective prior. In Computational Science and Its Applications – ICCSA 2019. Springer International Publishing.
- Bickel, P. and Levina, E. (2008). Regularized estimation of large covariance matrices. Annals of Statistics, 36:199–227.
- Chen, J. and Khalili, A. (2008). Order selection in finite mixture models with a nonsmooth penalty. Journal of the American Statistical Association, 103:1674–1683.
- Chen, J., Li, D., Linton, O., and Lu, Z. (2018). Semiparametric ultra-high dimensional model averaging of nonlinear dynamic time series. Journal of the American Statistical Association, 113:919–932. Optimal averaging estimation for density functions
- Cheng, X. and Hansen, B. E. (2015). Forecasting with factor-augmented regression: A frequentist model averaging approach. Journal of Econometrics, 186:280–293.
- Claeskens, G. and Hjort, N. L. (2008). Model Selection and Model Averaging. Cambridge University
- Press, Cambridge.
- Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32:928–961.
- Ferguson, T. S. (1996). A Course in Large Sample Theory. Chapman and Hall, London.
- Geweke, J. and Amisano, G. (2011). Optimal prediction pools. Journal of Econometrics, 164:130–141.
- Hall, S. G. and Mitchell, J. (2007). Combining density forecasts. International Journal of Forecasting, 23:1–13.
- Hansen, B. E. (2007). Least squares model averaging. Econometrica, 75:1175–1189.
- Hansen, B. E. (2008). Least squares forecast averaging. Journal of Econometrics, 146:342–350.
- Hansen, B. E. (2014). Model averaging, asymptotic risk, and regressor groups. Quantitative Economics, 5:495–530.
- Hansen, B. E. (2016). Efficient shrinkage in parametric models. Journal of Econometrics, 190:115–132.
- Hansen, B. E. and Racine, J. (2012). Jackknife model averaging. Journal of Econometrics, 167:38–46.
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
- Hjort, N. L. and Claeskens, G. (2003). Frequentist model average estimators. Journal of the American Optimal averaging estimation for density functions Statistical Association, 98:879–899.
- Ishwaran, H., James, L., and Sun, J. (2001). Bayesian model selection in finite mixtures by marginal density decompositions. Journal of the American Statistical Association, 96:1316–1332.
- Jones, M. C., Marron, J. S., and Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91:401–407.
- Leroux, B. (1992). Consistent estimation of a mixing distribution. Annals of Statistics, 20:1350–1360.
- Li, D., Linton, O., and Lu, Z. (2015). A flexible semiparametric forecasting model for time series. Journal of Econometrics, 187:345–357.
- Liao, J., Zong, X., Zhang, X., and Zou, G. (2019). Model averaging based on leave-subject-out crossvalidation for vector autoregressions. Journal of Econometrics, 209:35–60.
- Liu, Q. and Okui, R. (2013). Heteroskedasticity-robust cp model averaging. The Econometrics Journal, 16:463–472.
- Maggioni, M. and Murphy, J. M. (2019). Learning by unsupervised nonlinear diffusion. Journal of Machine Learning Research, 20:1–56.
- McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
- Safarinejadian, B., Menhaj, M. B., and Karrari, M. (2010). Distributed unsupervised gaussian mixture learning for density estimation in sensor networks. IEEE Transactions on Instrumentation and Measurement, 59:2250–2260.
- Saumard, A. and Navarro, F. (2021). Finite sample improvement of Akaike’s information criterion. Optimal averaging estimation for density functions IEEE Transactions on Information Theory, 67:6328–6343.
- Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6:461–464.
- Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53:683–690.
- Takeuchi, K. (1976). Distribution of informational statistics and a criterion of model fitting. SuriKagaku [Mathematical Sciences] (in Japanese), 153:12–18.
- Wan, A. T. K., Zhang, X., and Zou, G. (2010). Least squares model averaging by Mallows criterion. Journal of Econometrics, 156:277–283.
- White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50:1–25.
- Yang, Y. (2000). Mixing strategies for density estimation. The Annals of Statistics, 28:75–87.
- Yang, Y. (2004). Combining forecasting procedures: Some theoretical results. Econometric Theory, 20:176–222.
- Yuan, Z. and Yang, Y. (2005). Combining linear regression models: When and how? Journal of the American Statistical Association, 100:1202–1214.
- Zhang, X., Yu, D., Zou, G., and Liang, H. (2016). Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models. Journal of the American Statistical Association, 111:1775–1790.
- Zhang, X., Zou, G., and Carroll, R. (2015). Model averaging based on Kullback–Leibler distance. Statistica Sinica, 25:1583–1598.
Acknowledgments
We thank the editor, associate editor and two referees for their helpful comments and suggestions. This work was partially supported by the National Nat-
ural Science Foundation of China (Grant nos.
12001534, 12426308, 12031016
Optimal averaging estimation for density functions
and 71971131). Lu’s work was partially supported by the European Research
Agency’s Marie Curie Career Integration Grant (Grant no. PCIG14-GA-2013-
631692). Zou’s work was also partially supported by the Beijing Outstanding
Young Scientist Program (Grant no. JWZQ20240101027).
Supplementary Materials
The Supplementary Material contains the derivation of tr (Σ12), Lemma 1,
the proofs of all theorems, some illustrating examples, the explanations on the
technical conditions, the discussion on the different density aggregation methods
and the numerical results (Figures 1–8).