MDCGen

multidimensional dataset generator for clustering

Félix Iglesias*, Tanja Zseby, Daniel Ferreira, Arthur Zimek

*Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

22 Downloads (Pure)

Abstract

We present a tool for generating multidimensional synthetic datasets for testing, evaluating, and benchmarking unsupervised classification algorithms. Our proposal fills a gap observed in previous approaches with regard to underlying distributions for the creation of multidimensional clusters. As a novelty, normal and non-normal distributions can be combined for either independently defining values feature by feature (i.e., multivariate distributions) or establishing overall intra-cluster distances. Being highly flexible, parameterizable, and randomizable, MDCGen also implements classic pursued features: (a) customization of cluster-separation, (b) overlap control, (c) addition of outliers and noise, (d) definition of correlated variables and rotations, (e) flexibility for allowing or avoiding isolation constraints per dimension, (f) creation of subspace clusters and subspace outliers, (g) importing arbitrary distributions for the value generation, and (h) dataset quality evaluations, among others. As a result, the proposed tool offers an improved range of potential datasets to perform a more comprehensive testing of clustering algorithms.

Original languageEnglish
JournalJournal of Classification
Number of pages20
ISSN0176-4268
DOIs
Publication statusE-pub ahead of print - 23. Apr 2019

Fingerprint

Testing
Benchmarking
Clustering algorithms

Keywords

  • Clustering
  • Dataset generator
  • Synthetic data

Cite this

Iglesias, Félix ; Zseby, Tanja ; Ferreira, Daniel ; Zimek, Arthur. / MDCGen : multidimensional dataset generator for clustering. In: Journal of Classification. 2019.
@article{24f11d14aaf14bbdb776405c49f6f5f9,
title = "MDCGen: multidimensional dataset generator for clustering",
abstract = "We present a tool for generating multidimensional synthetic datasets for testing, evaluating, and benchmarking unsupervised classification algorithms. Our proposal fills a gap observed in previous approaches with regard to underlying distributions for the creation of multidimensional clusters. As a novelty, normal and non-normal distributions can be combined for either independently defining values feature by feature (i.e., multivariate distributions) or establishing overall intra-cluster distances. Being highly flexible, parameterizable, and randomizable, MDCGen also implements classic pursued features: (a) customization of cluster-separation, (b) overlap control, (c) addition of outliers and noise, (d) definition of correlated variables and rotations, (e) flexibility for allowing or avoiding isolation constraints per dimension, (f) creation of subspace clusters and subspace outliers, (g) importing arbitrary distributions for the value generation, and (h) dataset quality evaluations, among others. As a result, the proposed tool offers an improved range of potential datasets to perform a more comprehensive testing of clustering algorithms.",
keywords = "Clustering, Dataset generator, Synthetic data",
author = "F{\'e}lix Iglesias and Tanja Zseby and Daniel Ferreira and Arthur Zimek",
year = "2019",
month = "4",
day = "23",
doi = "10.1007/s00357-019-9312-3",
language = "English",
journal = "Journal of Classification",
issn = "0176-4268",

}

MDCGen : multidimensional dataset generator for clustering. / Iglesias, Félix; Zseby, Tanja; Ferreira, Daniel; Zimek, Arthur.

In: Journal of Classification, 23.04.2019.

Research output: Contribution to journalJournal articleResearchpeer-review

TY - JOUR

T1 - MDCGen

T2 - multidimensional dataset generator for clustering

AU - Iglesias, Félix

AU - Zseby, Tanja

AU - Ferreira, Daniel

AU - Zimek, Arthur

PY - 2019/4/23

Y1 - 2019/4/23

N2 - We present a tool for generating multidimensional synthetic datasets for testing, evaluating, and benchmarking unsupervised classification algorithms. Our proposal fills a gap observed in previous approaches with regard to underlying distributions for the creation of multidimensional clusters. As a novelty, normal and non-normal distributions can be combined for either independently defining values feature by feature (i.e., multivariate distributions) or establishing overall intra-cluster distances. Being highly flexible, parameterizable, and randomizable, MDCGen also implements classic pursued features: (a) customization of cluster-separation, (b) overlap control, (c) addition of outliers and noise, (d) definition of correlated variables and rotations, (e) flexibility for allowing or avoiding isolation constraints per dimension, (f) creation of subspace clusters and subspace outliers, (g) importing arbitrary distributions for the value generation, and (h) dataset quality evaluations, among others. As a result, the proposed tool offers an improved range of potential datasets to perform a more comprehensive testing of clustering algorithms.

AB - We present a tool for generating multidimensional synthetic datasets for testing, evaluating, and benchmarking unsupervised classification algorithms. Our proposal fills a gap observed in previous approaches with regard to underlying distributions for the creation of multidimensional clusters. As a novelty, normal and non-normal distributions can be combined for either independently defining values feature by feature (i.e., multivariate distributions) or establishing overall intra-cluster distances. Being highly flexible, parameterizable, and randomizable, MDCGen also implements classic pursued features: (a) customization of cluster-separation, (b) overlap control, (c) addition of outliers and noise, (d) definition of correlated variables and rotations, (e) flexibility for allowing or avoiding isolation constraints per dimension, (f) creation of subspace clusters and subspace outliers, (g) importing arbitrary distributions for the value generation, and (h) dataset quality evaluations, among others. As a result, the proposed tool offers an improved range of potential datasets to perform a more comprehensive testing of clustering algorithms.

KW - Clustering

KW - Dataset generator

KW - Synthetic data

U2 - 10.1007/s00357-019-9312-3

DO - 10.1007/s00357-019-9312-3

M3 - Journal article

JO - Journal of Classification

JF - Journal of Classification

SN - 0176-4268

ER -