Out-of-cluster Prediction for Model Selection in Regression with Unsupervised Clustering

Masao Ueki

doi:10.5705/ss.202025.0466

Abstract

In regression with unsupervised clustering, the explanatory variables

are first clustered, and separate regression models are then built for each cluster.

The resulting models are often evaluated using in-cluster prediction criteria, such

as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). This paper explores the usefulness of out-of-cluster prediction for

evaluating regression models, particularly in selecting the number of clusters. In

particular, we develop a model exclusion procedure that makes use of the reduced

accuracy of out-of-cluster prediction compared with in-cluster prediction, under

the assumption that regression models differ between clusters, to exclude redundant models before applying model selection. The model exclusion procedure is

considered within standard regression frameworks, including generalized linear

and Cox regression models. For Cox regression models, we propose a normalized

partial log-likelihood to avoid divergence issues that arise when the standard partial log-likelihood is used for model selection. We show that selecting the number

of clusters using AIC, combined with the proposed model exclusion procedure,

achieves model selection consistency. We confirm the improved performance of

the proposed exclusion procedure through extensive simulation studies involving Gaussian linear, logistic, and Cox regression models combined with K-means

clustering.

Key words and phrases: AIC, out-of-cluster prediction, normalized partial log- likelihood, regression with unsupervised clustering, model selection

Information

Preprint No.	SS-2025-0466
Manuscript ID	SS-2025-0466
Complete Authors	Masao Ueki
Corresponding Authors	Masao Ueki
Emails	uekimrsd@nifty.com

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. 2nd Inter. Symp. on Information Theory, Akademiai Kidao, Budapest, 1973, 267–281.
Batool, F. and C. Hennig (2021). Clustering with the average silhouette width. Computational Statistics & Data Analysis 158, 107190.
Chen, J. and Z. Chen (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika 95(3), 759–771.
Choi, M. Y., I. Chen, A. E. Clarke, M. J. Fritzler, K. A. Buhler, M. Urowitz, J. Hanly, Y. StPierre, C. Gordon, S.-C. Bae, J. Romero-Diaz, J. Sanchez-Guerrero, S. Bernatsky, D. J.
Wallace, D. A. Isenberg, A. Rahman, J. T. Merrill, P. R. Fortin, D. D. Gladman, I. N.
Bruce, M. Petri, E. M. Ginzler, M. A. Dooley, R. Ramsey-Goldman, S. Manzi, A. J¨onsen, G. S. Alarc´on, R. F. van Vollenhoven, C. Aranow, M. Mackay, G. Ruiz-Irastorza, S. Lim, M. Inanc, K. Kalunian, S. Jacobsen, C. Peschken, D. L. Kamen, A. Askanase, J. P. Buyon,
D. Sontag, and K. H. Costenbader (2023). Machine learning identifies clusters of longitudinal autoantibody profiles predictive of systemic lupus erythematosus disease outcomes. Annals of the Rheumatic Diseases 82(7), 927–936.
Gr¨un, B. and F. Leisch (2007). Fitting finite mixtures of generalized linear regressions in r. Computational Statistics & Data Analysis 51(11), 5247–5252.
Katahira, K. (2023). Evaluating the predictive performance of subtyping: A criterion for cluster mean-based prediction. Statistics in Medicine 42(7), 1045–1065.
Leisch, F. (2004). Flexmix: A general framework for finite mixture models and latent class regression in r. Journal of Statistical Software 11(8).
Li, R., J.-J. Ren, G. Yang, and Y. Yu (2017). Asymptotic behavior of Cox’s partial likelihood and its application to variable selection. Statistica Sinica 28(4), 2713–2731.
Mallows, C. L. (1973). Some comments on Cp. Technometrics 15(4), 661.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464.
Shao, J. (1997). An asymptotic theory for linear model selection. Statistica sinica 7(2), 221–242.
Teng, H.-W., M.-H. Kang, I.-H. Lee, and L.-C. Bai (2024). Bridging accuracy and interpretability: A rescaled cluster-then-predict approach for enhanced credit scoring. International Review of Financial Analysis 91, 103005.
Therneau, T. M. (2024). survival: Survival Analysis. R package version 3.8-3.
Tibshirani, R., G. Walther, and T. Hastie (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B: Statistical Methodology 63(2), 411–423.
Ueki, M. (2025). A deflation-adjusted bayesian information criterion for selecting the number of clusters in k-means clustering. Computational Statistics & Data Analysis 209, 108170.
Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika 92(4), 937–950.

Acknowledgments

This work was partially supported by JSPS KAKENHI Grant Numbers

23K11009 and 26K14742. During the preparation of this work the author

used ChatGPT-5 in order to improve the readability and language of the

manuscript. After using this service, the author reviewed and edited the

content as needed and takes full responsibility for the content of the published article.

Supplementary Materials

include Supplementary Appendix, Tables, and Figures.

Supplementary materials are available for download.

[1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. 2nd Inter. Symp. on Information Theory, Akademiai Kidao, Budapest, 1973, 267–281.

[2] Batool, F. and C. Hennig (2021). Clustering with the average silhouette width. Computational Statistics & Data Analysis 158, 107190.

[3] Chen, J. and Z. Chen (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika 95(3), 759–771.

[4] Choi, M. Y., I. Chen, A. E. Clarke, M. J. Fritzler, K. A. Buhler, M. Urowitz, J. Hanly, Y. StPierre, C. Gordon, S.-C. Bae, J. Romero-Diaz, J. Sanchez-Guerrero, S. Bernatsky, D. J.

[5] Wallace, D. A. Isenberg, A. Rahman, J. T. Merrill, P. R. Fortin, D. D. Gladman, I. N.

[6] Bruce, M. Petri, E. M. Ginzler, M. A. Dooley, R. Ramsey-Goldman, S. Manzi, A. J¨onsen, G. S. Alarc´on, R. F. van Vollenhoven, C. Aranow, M. Mackay, G. Ruiz-Irastorza, S. Lim, M. Inanc, K. Kalunian, S. Jacobsen, C. Peschken, D. L. Kamen, A. Askanase, J. P. Buyon,

[7] D. Sontag, and K. H. Costenbader (2023). Machine learning identifies clusters of longitudinal autoantibody profiles predictive of systemic lupus erythematosus disease outcomes. Annals of the Rheumatic Diseases 82(7), 927–936.

[8] Gr¨un, B. and F. Leisch (2007). Fitting finite mixtures of generalized linear regressions in r. Computational Statistics & Data Analysis 51(11), 5247–5252.

[9] Katahira, K. (2023). Evaluating the predictive performance of subtyping: A criterion for cluster mean-based prediction. Statistics in Medicine 42(7), 1045–1065.

[10] Leisch, F. (2004). Flexmix: A general framework for finite mixture models and latent class regression in r. Journal of Statistical Software 11(8).

[11] Li, R., J.-J. Ren, G. Yang, and Y. Yu (2017). Asymptotic behavior of Cox’s partial likelihood and its application to variable selection. Statistica Sinica 28(4), 2713–2731.

[12] Mallows, C. L. (1973). Some comments on Cp. Technometrics 15(4), 661.

[13] Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850.

[14] Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65.

[15] Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464.

[16] Shao, J. (1997). An asymptotic theory for linear model selection. Statistica sinica 7(2), 221–242.

[17] Teng, H.-W., M.-H. Kang, I.-H. Lee, and L.-C. Bai (2024). Bridging accuracy and interpretability: A rescaled cluster-then-predict approach for enhanced credit scoring. International Review of Financial Analysis 91, 103005.

[18] Therneau, T. M. (2024). survival: Survival Analysis. R package version 3.8-3.

[19] Tibshirani, R., G. Walther, and T. Hastie (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B: Statistical Methodology 63(2), 411–423.

[20] Ueki, M. (2025). A deflation-adjusted bayesian information criterion for selecting the number of clusters in k-means clustering. Computational Statistics & Data Analysis 209, 108170.

[21] Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika 92(4), 937–950.