Abstract
We perform an extensive experimental evaluation of clustering-based outlier detection methods. These methods offer benefits such as efficiency, the possibility to capitalize on more mature evaluation measures, more developed subspace analysis for high-dimensional data and better explainability, and yet they have so-far been neglected in literature. To our knowledge, our work is the first effort to analytically and empirically study their advantages and disadvantages. Our main goal is to evaluate whether or not clustering-based techniques can compete in efficiency and effectiveness against the most studied state-of-the-art algorithms in the literature. We consider the quality of the results, the resilience against different types of data and variations in parameter configuration, the scalability, and the ability to filter out inappropriate parameter values automatically based on internal measures of clustering quality. It has been recently shown that several classic, simple, unsupervised methods surpass many deep learning approaches and, hence, remain at the state-of-the-art of outlier detection. We therefore study 14 of the best classic unsupervised methods, in particular 11 clustering-based methods and 3 non-clustering-based ones, using a consistent parameterization heuristic to identify the pros and cons of each approach. We consider 46 real and synthetic datasets with up to 125k points and 1.5k dimensions aiming to achieve plausibility with the broadest possible diversity of real-world use cases. Our results indicate that the clustering-based methods are on par with (if not surpass) the non-clustering-based ones, and we argue that clustering-based methods like KMeans−− should be included as baselines in future benchmarking studies, as they often offer a competitive quality at a relatively low run time, besides several other benefits.
Original language | English |
---|---|
Article number | 13 |
Journal | Data Mining and Knowledge Discovery |
Volume | 39 |
Issue number | 2 |
ISSN | 1384-5810 |
DOIs | |
Publication status | Published - Mar 2025 |
Bibliographical note
Publisher Copyright:© The Author(s) 2025.
Keywords
- Clustering-based outlier detection
- Evaluation
- Experimental analysis and comparison