Model-Based Clustering with HDBSCAN*

Michael Strobl*, Jörg Sander, Ricardo J.G.B. Campello, Osmar Zaïane

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

We propose an efficient model-based clustering approach for creating Gaussian Mixture Models from finite datasets. Models are extracted from HDBSCAN* hierarchies using the Classification Likelihood and the Expectation Maximization algorithm. Prior knowledge of the number of components of the model, corresponding to the number of clusters, is not necessary and can be determined dynamically. Due to relatively small hierarchies created by HDBSCAN* compared to previous approaches, this can be done efficiently. The lower the number of objects in a dataset, the more difficult it is to accurately estimate the number of parameters of a fully unrestricted Gaussian Mixture Model. Therefore, more parsimonious models can be created by our algorithm, if necessary. The user has a choice of two information criteria for model selection, as well as a likelihood test using unseen data, in order to select the best-fitting model. We compare our approach to two baselines and show its superiority in two settings: recovering the original data-generating distribution and partitioning the data correctly. Furthermore, we show that our approach is robust to its hyperparameter settings. (Data and code are publicly available at: https://github.com/mjstrobl/HCEM

Original languageEnglish
Title of host publicationMachine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2020, Proceedings
EditorsFrank Hutter, Kristian Kersting, Jefrey Lijffijt, Isabel Valera
PublisherSpringer
Publication date2021
Pages364-379
ISBN (Print)9783030676605
DOIs
Publication statusPublished - 2021
Externally publishedYes
EventEuropean Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2020 - Virtual, Online
Duration: 14. Sept 202018. Sept 2020

Conference

ConferenceEuropean Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2020
CityVirtual, Online
Period14/09/202018/09/2020
SeriesLecture Notes in Computer Science
Volume12458
ISSN0302-9743

Bibliographical note

Publisher Copyright:
© 2021, Springer Nature Switzerland AG.

Keywords

  • Expectation maximization
  • Hierarchical clustering
  • Model selection

Fingerprint

Dive into the research topics of 'Model-Based Clustering with HDBSCAN*'. Together they form a unique fingerprint.

Cite this