TY - GEN
T1 - Adaptive grid-based k-median clustering of streaming data with accuracy guarantee
AU - Cao, Jianneng
AU - Zhou, Yongluan
AU - Wu, Min
N1 - Best Paper Award
PY - 2015
Y1 - 2015
N2 - Data stream clustering has wide applications, such as online financial transactions, telephone records, and network monitoring. Grid-based clustering partitions stream data into cells, derives statistical information of the cells, and then applies clustering on these much smaller statistical information without referring to the input data. Therefore, grid-based clustering is efficient and very suitable for high-throughput data streams, which are continuous, time-varying, and possibly unpredictable. Various grid-based clustering schemes have been proposed. However, to the best of our knowledge, none of them provides an accuracy guarantee for their clustering output. To fill this gap, in this paper we study grid-based k-median clustering. We first develop an accuracy guarantee on the cost difference between grid-based solution and the optimum. Based on the theoretical analysis, we then propose a general and adaptive solution, which partitions stream data into cells of dynamically determined granularity and runs k-median clustering on the statistical information of cells with an accuracy guarantee. An extensive experiment over three real datasets clearly shows that our solution provides high-quality clustering outputs in an efficient way.
AB - Data stream clustering has wide applications, such as online financial transactions, telephone records, and network monitoring. Grid-based clustering partitions stream data into cells, derives statistical information of the cells, and then applies clustering on these much smaller statistical information without referring to the input data. Therefore, grid-based clustering is efficient and very suitable for high-throughput data streams, which are continuous, time-varying, and possibly unpredictable. Various grid-based clustering schemes have been proposed. However, to the best of our knowledge, none of them provides an accuracy guarantee for their clustering output. To fill this gap, in this paper we study grid-based k-median clustering. We first develop an accuracy guarantee on the cost difference between grid-based solution and the optimum. Based on the theoretical analysis, we then propose a general and adaptive solution, which partitions stream data into cells of dynamically determined granularity and runs k-median clustering on the statistical information of cells with an accuracy guarantee. An extensive experiment over three real datasets clearly shows that our solution provides high-quality clustering outputs in an efficient way.
U2 - 10.1007/978-3-319-18120-2_5
DO - 10.1007/978-3-319-18120-2_5
M3 - Article in proceedings
SN - 978-3-319-18119-6
VL - 1
T3 - Lecture Notes in Computer Science
SP - 75
EP - 91
BT - Database Systems for Advanced Applications
A2 - Renz et al., Matthias
PB - Springer
T2 - 20th International Conference on Databse Systems for Advanced Applications
Y2 - 20 April 2015 through 23 April 2015
ER -