TY - GEN

T1 - Combining information from distributed evolutionary k-means

AU - Coelho Naldi, Murilo

AU - Campello, Ricardo Jose Gabrielli Barreto

PY - 2012

Y1 - 2012

N2 - One of the challenges for clustering resides in dealing with huge amounts of data, which causes the need for distribution of large data sets in separate repositories. However, most clustering techniques require the data to be centralized. One of them, the k-means, has been elected one of the most influential data mining algorithms. Although exact distributed versions of the k-means algorithm have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires that the number of clusters be specified in advance. This work tackles the problem of generating an approximated model for distributed clustering, based on k-means, for scenarios where the number of clusters of the distributed data is unknown. We propose a collection of algorithms that generate and select k-means clustering for each distributed subset of the data and combine them afterwards. The variants of the algorithm are compared from two perspectives: the theoretical one, through asymptotic complexity analyses, and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests.

AB - One of the challenges for clustering resides in dealing with huge amounts of data, which causes the need for distribution of large data sets in separate repositories. However, most clustering techniques require the data to be centralized. One of them, the k-means, has been elected one of the most influential data mining algorithms. Although exact distributed versions of the k-means algorithm have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires that the number of clusters be specified in advance. This work tackles the problem of generating an approximated model for distributed clustering, based on k-means, for scenarios where the number of clusters of the distributed data is unknown. We propose a collection of algorithms that generate and select k-means clustering for each distributed subset of the data and combine them afterwards. The variants of the algorithm are compared from two perspectives: the theoretical one, through asymptotic complexity analyses, and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests.

KW - clustering

KW - distributed data sets

KW - k-means

U2 - 10.1109/SBRN.2012.11

DO - 10.1109/SBRN.2012.11

M3 - Article in proceedings

AN - SCOPUS:84873146620

SN - 9780769548234

T3 - Proceedings - Brazilian Symposium on Neural Networks, SBRN

SP - 43

EP - 48

BT - Proceedings - 2012 Brazilian Conference on Neural Networks, SBRN 2012

PB - IEEE

T2 - 2012 Brazilian Conference on Neural Networks, SBRN 2012

Y2 - 20 October 2012 through 25 October 2012

ER -