TY - GEN
T1 - Efficient computation of multiple density-based clustering hierarchies
AU - Neto, Antonio Cavalcante Araujo
AU - Sander, Joerg
AU - Campello, Ricardo J.G.B.
AU - Nascimento, Mario A.
N1 - Publisher Copyright:
PY - 2017/12/15
Y1 - 2017/12/15
N2 - HDBSCAN∗, a state-of-the-art density-based hierarchical clustering method, produces a hierarchical organization of clusters in a dataset w.r.t. a parameter mpts. While the performance of HDBSCAN∗ is robust w.r.t. mpts, choosing a 'good' value for it can be challenging: depending on the data distribution, a high or low value for mpts may be more appropriate, and certain data clusters may reveal themselves at different values of mpts. To explore results for a range of mpts, one has to run HDBSCAN∗ for each value in the range independently, which is computationally inefficient. In this paper we propose an efficient approach to compute all HDBSCAN∗ hierarchies for a range of mpts by replacing the graph used by HDBSCAN∗ with a much smaller graph that is guaranteed to contain the required information. Our experiments show that our approach can obtain, for example, over one hundred hierarchies for a cost equivalent to running HDBSCAN∗ about 2 times. In fact, this speedup tends to increase with the number of hierarchies to be computed.
AB - HDBSCAN∗, a state-of-the-art density-based hierarchical clustering method, produces a hierarchical organization of clusters in a dataset w.r.t. a parameter mpts. While the performance of HDBSCAN∗ is robust w.r.t. mpts, choosing a 'good' value for it can be challenging: depending on the data distribution, a high or low value for mpts may be more appropriate, and certain data clusters may reveal themselves at different values of mpts. To explore results for a range of mpts, one has to run HDBSCAN∗ for each value in the range independently, which is computationally inefficient. In this paper we propose an efficient approach to compute all HDBSCAN∗ hierarchies for a range of mpts by replacing the graph used by HDBSCAN∗ with a much smaller graph that is guaranteed to contain the required information. Our experiments show that our approach can obtain, for example, over one hundred hierarchies for a cost equivalent to running HDBSCAN∗ about 2 times. In fact, this speedup tends to increase with the number of hierarchies to be computed.
KW - Clustering
KW - HDBSCAN
KW - Hierarchical Clustering
KW - Relative Neighborhood Graph
U2 - 10.1109/ICDM.2017.127
DO - 10.1109/ICDM.2017.127
M3 - Article in proceedings
AN - SCOPUS:85043998052
T3 - Proceedings of the IEEE International Conference on Data Mining, ICDM
SP - 991
EP - 996
BT - Proceedings - 17th IEEE International Conference on Data Mining, ICDM 2017
A2 - Karypis, George
A2 - Alu, Srinivas
A2 - Raghavan, Vijay
A2 - Wu, Xindong
A2 - Miele, Lucio
PB - IEEE
T2 - 17th IEEE International Conference on Data Mining, ICDM 2017
Y2 - 18 November 2017 through 21 November 2017
ER -