A unified view of density-based methods for semi-supervised clustering and classification

Jadson Castro Gertrudes*, Arthur Zimek, Jörg Sander, Ricardo J.G.B. Campello

*Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

Abstract

Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering.

Original languageEnglish
JournalData Mining and Knowledge Discovery
Volume33
Issue number6
Pages (from-to)1894-1952
Number of pages59
ISSN1384-5810
DOIs
Publication statusPublished - Nov 2019

Fingerprint

Clustering algorithms
Supervised learning
Acoustic waves
Big data

Keywords

  • Density-based clustering
  • Semi-supervised classification
  • Semi-supervised clustering

Cite this

Castro Gertrudes, Jadson ; Zimek, Arthur ; Sander, Jörg ; Campello, Ricardo J.G.B. / A unified view of density-based methods for semi-supervised clustering and classification. In: Data Mining and Knowledge Discovery. 2019 ; Vol. 33, No. 6. pp. 1894-1952.
@article{3a5b5fa101b44c33b4251adce77f0794,
title = "A unified view of density-based methods for semi-supervised clustering and classification",
abstract = "Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering.",
keywords = "Density-based clustering, Semi-supervised classification, Semi-supervised clustering",
author = "{Castro Gertrudes}, Jadson and Arthur Zimek and J{\"o}rg Sander and Campello, {Ricardo J.G.B.}",
year = "2019",
month = "11",
doi = "10.1007/s10618-019-00651-1",
language = "English",
volume = "33",
pages = "1894--1952",
journal = "Data Mining and Knowledge Discovery",
issn = "1384-5810",
publisher = "Springer",
number = "6",

}

A unified view of density-based methods for semi-supervised clustering and classification. / Castro Gertrudes, Jadson; Zimek, Arthur; Sander, Jörg; Campello, Ricardo J.G.B.

In: Data Mining and Knowledge Discovery, Vol. 33, No. 6, 11.2019, p. 1894-1952.

Research output: Contribution to journalJournal articleResearchpeer-review

TY - JOUR

T1 - A unified view of density-based methods for semi-supervised clustering and classification

AU - Castro Gertrudes, Jadson

AU - Zimek, Arthur

AU - Sander, Jörg

AU - Campello, Ricardo J.G.B.

PY - 2019/11

Y1 - 2019/11

N2 - Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering.

AB - Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering.

KW - Density-based clustering

KW - Semi-supervised classification

KW - Semi-supervised clustering

U2 - 10.1007/s10618-019-00651-1

DO - 10.1007/s10618-019-00651-1

M3 - Journal article

VL - 33

SP - 1894

EP - 1952

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

SN - 1384-5810

IS - 6

ER -