Abstract
In regression with unsupervised clustering, the explanatory variables
are first clustered, and separate regression models are then built for each cluster.
The resulting models are often evaluated using in-cluster prediction criteria, such
as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). This paper explores the usefulness of out-of-cluster prediction for
evaluating regression models, particularly in selecting the number of clusters. In
particular, we develop a model exclusion procedure that makes use of the reduced
accuracy of out-of-cluster prediction compared with in-cluster prediction, under
the assumption that regression models differ between clusters, to exclude redundant models before applying model selection. The model exclusion procedure is
considered within standard regression frameworks, including generalized linear
and Cox regression models. For Cox regression models, we propose a normalized
partial log-likelihood to avoid divergence issues that arise when the standard partial log-likelihood is used for model selection. We show that selecting the number
of clusters using AIC, combined with the proposed model exclusion procedure,
achieves model selection consistency. We confirm the improved performance of
the proposed exclusion procedure through extensive simulation studies involving Gaussian linear, logistic, and Cox regression models combined with K-means
clustering.
Key words and phrases: AIC, out-of-cluster prediction, normalized partial log- likelihood, regression with unsupervised clustering, model selection
Information
| Preprint No. | SS-2025-0466 |
|---|---|
| Manuscript ID | SS-2025-0466 |
| Complete Authors | Masao Ueki |
| Corresponding Authors | Masao Ueki |
| Emails | uekimrsd@nifty.com |
References
- Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. 2nd Inter. Symp. on Information Theory, Akademiai Kidao, Budapest, 1973, 267–281.
- Batool, F. and C. Hennig (2021). Clustering with the average silhouette width. Computational Statistics & Data Analysis 158, 107190.
- Chen, J. and Z. Chen (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika 95(3), 759–771.
- Choi, M. Y., I. Chen, A. E. Clarke, M. J. Fritzler, K. A. Buhler, M. Urowitz, J. Hanly, Y. StPierre, C. Gordon, S.-C. Bae, J. Romero-Diaz, J. Sanchez-Guerrero, S. Bernatsky, D. J.
- Wallace, D. A. Isenberg, A. Rahman, J. T. Merrill, P. R. Fortin, D. D. Gladman, I. N.
- Bruce, M. Petri, E. M. Ginzler, M. A. Dooley, R. Ramsey-Goldman, S. Manzi, A. J¨onsen, G. S. Alarc´on, R. F. van Vollenhoven, C. Aranow, M. Mackay, G. Ruiz-Irastorza, S. Lim, M. Inanc, K. Kalunian, S. Jacobsen, C. Peschken, D. L. Kamen, A. Askanase, J. P. Buyon,
- D. Sontag, and K. H. Costenbader (2023). Machine learning identifies clusters of longitudinal autoantibody profiles predictive of systemic lupus erythematosus disease outcomes. Annals of the Rheumatic Diseases 82(7), 927–936.
- Gr¨un, B. and F. Leisch (2007). Fitting finite mixtures of generalized linear regressions in r. Computational Statistics & Data Analysis 51(11), 5247–5252.
- Katahira, K. (2023). Evaluating the predictive performance of subtyping: A criterion for cluster mean-based prediction. Statistics in Medicine 42(7), 1045–1065.
- Leisch, F. (2004). Flexmix: A general framework for finite mixture models and latent class regression in r. Journal of Statistical Software 11(8).
- Li, R., J.-J. Ren, G. Yang, and Y. Yu (2017). Asymptotic behavior of Cox’s partial likelihood and its application to variable selection. Statistica Sinica 28(4), 2713–2731.
- Mallows, C. L. (1973). Some comments on Cp. Technometrics 15(4), 661.
- Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850.
- Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65.
- Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464.
- Shao, J. (1997). An asymptotic theory for linear model selection. Statistica sinica 7(2), 221–242.
- Teng, H.-W., M.-H. Kang, I.-H. Lee, and L.-C. Bai (2024). Bridging accuracy and interpretability: A rescaled cluster-then-predict approach for enhanced credit scoring. International Review of Financial Analysis 91, 103005.
- Therneau, T. M. (2024). survival: Survival Analysis. R package version 3.8-3.
- Tibshirani, R., G. Walther, and T. Hastie (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B: Statistical Methodology 63(2), 411–423.
- Ueki, M. (2025). A deflation-adjusted bayesian information criterion for selecting the number of clusters in k-means clustering. Computational Statistics & Data Analysis 209, 108170.
- Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika 92(4), 937–950.
Acknowledgments
This work was partially supported by JSPS KAKENHI Grant Numbers
23K11009 and 26K14742. During the preparation of this work the author
used ChatGPT-5 in order to improve the readability and language of the
manuscript. After using this service, the author reviewed and edited the
content as needed and takes full responsibility for the content of the published article.
Supplementary Materials
include Supplementary Appendix, Tables, and Figures.