Subsampling for efficient and effective unsupervised outlier detection ensembles

Arthur Zimek, Matthew Gaudet, Ricardo J.G.B. Campello, Jörg Sander

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

Outlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Here, we propose and study subsampling as a technique to induce diversity among individual outlier detectors. We show analytically and experimentally that an outlier detector based on a subsample per se, besides inducing diversity, can, under certain conditions, already improve upon the results of the same outlier detector on the complete dataset. Building an ensemble on top of several subsamples is further improving the results. While in the literature so far the intuition that ensembles improve over single outlier detectors has just been transferred from the classification literature, here we also justify analytically why ensembles are also expected to work in the unsupervised area of outlier detection. As a side effect, running an ensemble of several outlier detectors on subsamples of the dataset is more efficient than ensembles based on other means of introducing diversity and, depending on the sample rate and the size of the ensemble, can be even more efficient than just the single outlier detector on the complete data.

Original languageEnglish
Title of host publicationProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
EditorsRayid Ghani, Ted E. Senator, Paul Bradley, Rajesh Parekh, Jingrui He
PublisherAssociation for Computing Machinery
Publication date11. Aug 2013
Pages428-436
ISBN (Electronic)978-1-4503-2174-7
DOIs
Publication statusPublished - 11. Aug 2013
Externally publishedYes
Event19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Chicago, United States
Duration: 11. Aug 201314. Aug 2013

Conference

Conference19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
CountryUnited States
CityChicago
Period11/08/201314/08/2013
SponsorACM KDD.org, ACM SIGMOD

Fingerprint

detectors
data mining
learning

Keywords

  • Ensemble
  • Outlier detection

Cite this

Zimek, A., Gaudet, M., Campello, R. J. G. B., & Sander, J. (2013). Subsampling for efficient and effective unsupervised outlier detection ensembles. In R. Ghani, T. E. Senator, P. Bradley, R. Parekh, & J. He (Eds.), Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 428-436). Association for Computing Machinery. https://doi.org/10.1145/2487575.2487676
Zimek, Arthur ; Gaudet, Matthew ; Campello, Ricardo J.G.B. ; Sander, Jörg. / Subsampling for efficient and effective unsupervised outlier detection ensembles. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. editor / Rayid Ghani ; Ted E. Senator ; Paul Bradley ; Rajesh Parekh ; Jingrui He. Association for Computing Machinery, 2013. pp. 428-436
@inproceedings{dec85cda638646ffaa84d0af3cb109d8,
title = "Subsampling for efficient and effective unsupervised outlier detection ensembles",
abstract = "Outlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Here, we propose and study subsampling as a technique to induce diversity among individual outlier detectors. We show analytically and experimentally that an outlier detector based on a subsample per se, besides inducing diversity, can, under certain conditions, already improve upon the results of the same outlier detector on the complete dataset. Building an ensemble on top of several subsamples is further improving the results. While in the literature so far the intuition that ensembles improve over single outlier detectors has just been transferred from the classification literature, here we also justify analytically why ensembles are also expected to work in the unsupervised area of outlier detection. As a side effect, running an ensemble of several outlier detectors on subsamples of the dataset is more efficient than ensembles based on other means of introducing diversity and, depending on the sample rate and the size of the ensemble, can be even more efficient than just the single outlier detector on the complete data.",
keywords = "Ensemble, Outlier detection",
author = "Arthur Zimek and Matthew Gaudet and Campello, {Ricardo J.G.B.} and J{\"o}rg Sander",
year = "2013",
month = "8",
day = "11",
doi = "10.1145/2487575.2487676",
language = "English",
pages = "428--436",
editor = "Rayid Ghani and Senator, {Ted E.} and Paul Bradley and Rajesh Parekh and Jingrui He",
booktitle = "Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",
publisher = "Association for Computing Machinery",
address = "United States",

}

Zimek, A, Gaudet, M, Campello, RJGB & Sander, J 2013, Subsampling for efficient and effective unsupervised outlier detection ensembles. in R Ghani, TE Senator, P Bradley, R Parekh & J He (eds), Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, pp. 428-436, 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, United States, 11/08/2013. https://doi.org/10.1145/2487575.2487676

Subsampling for efficient and effective unsupervised outlier detection ensembles. / Zimek, Arthur; Gaudet, Matthew; Campello, Ricardo J.G.B.; Sander, Jörg.

Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ed. / Rayid Ghani; Ted E. Senator; Paul Bradley; Rajesh Parekh; Jingrui He. Association for Computing Machinery, 2013. p. 428-436.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

TY - GEN

T1 - Subsampling for efficient and effective unsupervised outlier detection ensembles

AU - Zimek, Arthur

AU - Gaudet, Matthew

AU - Campello, Ricardo J.G.B.

AU - Sander, Jörg

PY - 2013/8/11

Y1 - 2013/8/11

N2 - Outlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Here, we propose and study subsampling as a technique to induce diversity among individual outlier detectors. We show analytically and experimentally that an outlier detector based on a subsample per se, besides inducing diversity, can, under certain conditions, already improve upon the results of the same outlier detector on the complete dataset. Building an ensemble on top of several subsamples is further improving the results. While in the literature so far the intuition that ensembles improve over single outlier detectors has just been transferred from the classification literature, here we also justify analytically why ensembles are also expected to work in the unsupervised area of outlier detection. As a side effect, running an ensemble of several outlier detectors on subsamples of the dataset is more efficient than ensembles based on other means of introducing diversity and, depending on the sample rate and the size of the ensemble, can be even more efficient than just the single outlier detector on the complete data.

AB - Outlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Here, we propose and study subsampling as a technique to induce diversity among individual outlier detectors. We show analytically and experimentally that an outlier detector based on a subsample per se, besides inducing diversity, can, under certain conditions, already improve upon the results of the same outlier detector on the complete dataset. Building an ensemble on top of several subsamples is further improving the results. While in the literature so far the intuition that ensembles improve over single outlier detectors has just been transferred from the classification literature, here we also justify analytically why ensembles are also expected to work in the unsupervised area of outlier detection. As a side effect, running an ensemble of several outlier detectors on subsamples of the dataset is more efficient than ensembles based on other means of introducing diversity and, depending on the sample rate and the size of the ensemble, can be even more efficient than just the single outlier detector on the complete data.

KW - Ensemble

KW - Outlier detection

U2 - 10.1145/2487575.2487676

DO - 10.1145/2487575.2487676

M3 - Article in proceedings

AN - SCOPUS:85015249301

SP - 428

EP - 436

BT - Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

A2 - Ghani, Rayid

A2 - Senator, Ted E.

A2 - Bradley, Paul

A2 - Parekh, Rajesh

A2 - He, Jingrui

PB - Association for Computing Machinery

ER -

Zimek A, Gaudet M, Campello RJGB, Sander J. Subsampling for efficient and effective unsupervised outlier detection ensembles. In Ghani R, Senator TE, Bradley P, Parekh R, He J, editors, Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery. 2013. p. 428-436 https://doi.org/10.1145/2487575.2487676