Quotation Grün, Bettina, Malsiner-Walli, Gertraud, Frühwirth-Schnatter, Sylvia. 2022. How many data clusters are in the Galaxy data set? Bayesian cluster analysis in action. ADAC - Advances in Data Analysis and Classification. 16 (2), 325-349.




In model-based clustering, the Galaxy data set is often used as a benchmark data set to study the performance of different modeling approaches. Aitkin (Stat Model 1:287–304) compares maximum likelihood and Bayesian analyses of the Galaxy data set and expresses reservations about the Bayesian approach due to the fact that the prior assumptions imposed remain rather obscure while playing a major role in the results obtained and conclusions drawn. The aim of the paper is to address Aitkin’s concerns about the Bayesian approach by shedding light on how the specified priors influence the number of estimated clusters. We perform a sensitivity analysis of different prior specifications for the mixtures of finite mixture model, i.e., the mixture model where a prior on the number of components is included. We use an extensive set of different prior specifications in a full factorial design and assess their impact on the estimated number of clusters for the Galaxy data set. Results highlight the interaction effects of the prior specifications and provide insights into which prior specifications are recommended to obtain a sparse clustering solution. A simulation study with artificial data provides further empirical evidence to support the recommendations. A clear understanding of the impact of the prior specifications removes restraints preventing the use of Bayesian methods due to the complexity of selecting suitable priors. Also, the regularizing properties of the priors may be intentionally exploited to obtain a suitable clustering solution meeting prior expectations and needs of the application.


Status of publication Published
Affiliation WU
Type of publication Journal article
Journal ADAC - Advances in Data Analysis and Classification
WU-Journal-Rating new FIN-A
Language English
Title How many data clusters are in the Galaxy data set? Bayesian cluster analysis in action
Volume 16
Number 2
Year 2022
Page from 325
Page to 349
Reviewed? Y
URL https://link.springer.com/article/10.1007%2Fs11634-021-00461-8
DOI https://doi.org/10.1007/s11634-021-00461-8
Open Access Y
Open Access Link https://link.springer.com/article/10.1007%2Fs11634-021-00461-8


Shrinking and Regularizing Finite Mixture Models
WU project: High-Dimensional Bayesian Gaussian Mixture Modeling
Grün, Bettina
Malsiner-Walli, Gertraud
Frühwirth-Schnatter, Sylvia
Institute for Statistics and Mathematics IN
Research areas (ÖSTAT Classification 'Statistik Austria')
1105 Computer software
1113 Mathematical statistics
5701 Applied statistics
