Guiding biomedical clustering with ClustEval

Christian Wiwie, Jan Baumbach, Richard Röttger

Research output: Contribution to journalJournal articleResearchpeer-review

Abstract

Clustering is a popular technique for discovering groups of similar objects in large datasets. It is nowadays applied in all areas of life sciences, from biomedicine to physics. However, designing high-quality cluster analyses is a tedious and complicated task with manifold choices along the way. As a cluster analysis is often the first step of a succeeding downstream analysis, the clustering must be reliable, reproducible, and of the highest quality. To address these challenges, we recently developed ClustEval, an integrated and extensible platform for the automated and standardized design and execution of complex cluster analyses. It allows researchers to design and carry out cluster analyses involving a large number of clustering methods applied to many, large datasets. ClustEval helps to shed light on all major aspects of cluster analysis, from choosing the right similarity function to using validity indices and data preprocessing protocols. Only this high degree of automation allows the researcher to easily run a clustering task with many different tools, parameters, and settings in order to gain the best possible outcome. In this paper, we guide the user step by step through three fundamentally important and widely applicable use cases: (i) identification of the best clustering method for a new, user-given protein sequence similarity dataset; (ii) evaluation of the performance of a new, user-given clustering method (densityCut) against the state of the art; and (iii) prediction of the best method for a new protein sequence similarity dataset. This protocol guides the user through the most important features of ClustEval and takes â 1/44 h to complete.

Original languageEnglish
JournalNature Protocols
Volume13
Issue number6
Pages (from-to)1429-1444
ISSN1754-2189
DOIs
Publication statusPublished - 1 Jun 2018

Fingerprint

Cluster Analysis
Research Personnel
Physics
Proteins
Datasets

Cite this

Wiwie, Christian ; Baumbach, Jan ; Röttger, Richard. / Guiding biomedical clustering with ClustEval. In: Nature Protocols. 2018 ; Vol. 13, No. 6. pp. 1429-1444.
@article{4423ea728fd04821a84377d73c705628,
title = "Guiding biomedical clustering with ClustEval",
abstract = "Clustering is a popular technique for discovering groups of similar objects in large datasets. It is nowadays applied in all areas of life sciences, from biomedicine to physics. However, designing high-quality cluster analyses is a tedious and complicated task with manifold choices along the way. As a cluster analysis is often the first step of a succeeding downstream analysis, the clustering must be reliable, reproducible, and of the highest quality. To address these challenges, we recently developed ClustEval, an integrated and extensible platform for the automated and standardized design and execution of complex cluster analyses. It allows researchers to design and carry out cluster analyses involving a large number of clustering methods applied to many, large datasets. ClustEval helps to shed light on all major aspects of cluster analysis, from choosing the right similarity function to using validity indices and data preprocessing protocols. Only this high degree of automation allows the researcher to easily run a clustering task with many different tools, parameters, and settings in order to gain the best possible outcome. In this paper, we guide the user step by step through three fundamentally important and widely applicable use cases: (i) identification of the best clustering method for a new, user-given protein sequence similarity dataset; (ii) evaluation of the performance of a new, user-given clustering method (densityCut) against the state of the art; and (iii) prediction of the best method for a new protein sequence similarity dataset. This protocol guides the user through the most important features of ClustEval and takes {\^a} 1/44 h to complete.",
author = "Christian Wiwie and Jan Baumbach and Richard R{\"o}ttger",
year = "2018",
month = "6",
day = "1",
doi = "10.1038/nprot.2018.038",
language = "English",
volume = "13",
pages = "1429--1444",
journal = "Nature Protocols (Print)",
issn = "1754-2189",
publisher = "Nature Publishing Group",
number = "6",

}

Guiding biomedical clustering with ClustEval. / Wiwie, Christian; Baumbach, Jan; Röttger, Richard.

In: Nature Protocols, Vol. 13, No. 6, 01.06.2018, p. 1429-1444.

Research output: Contribution to journalJournal articleResearchpeer-review

TY - JOUR

T1 - Guiding biomedical clustering with ClustEval

AU - Wiwie, Christian

AU - Baumbach, Jan

AU - Röttger, Richard

PY - 2018/6/1

Y1 - 2018/6/1

N2 - Clustering is a popular technique for discovering groups of similar objects in large datasets. It is nowadays applied in all areas of life sciences, from biomedicine to physics. However, designing high-quality cluster analyses is a tedious and complicated task with manifold choices along the way. As a cluster analysis is often the first step of a succeeding downstream analysis, the clustering must be reliable, reproducible, and of the highest quality. To address these challenges, we recently developed ClustEval, an integrated and extensible platform for the automated and standardized design and execution of complex cluster analyses. It allows researchers to design and carry out cluster analyses involving a large number of clustering methods applied to many, large datasets. ClustEval helps to shed light on all major aspects of cluster analysis, from choosing the right similarity function to using validity indices and data preprocessing protocols. Only this high degree of automation allows the researcher to easily run a clustering task with many different tools, parameters, and settings in order to gain the best possible outcome. In this paper, we guide the user step by step through three fundamentally important and widely applicable use cases: (i) identification of the best clustering method for a new, user-given protein sequence similarity dataset; (ii) evaluation of the performance of a new, user-given clustering method (densityCut) against the state of the art; and (iii) prediction of the best method for a new protein sequence similarity dataset. This protocol guides the user through the most important features of ClustEval and takes â 1/44 h to complete.

AB - Clustering is a popular technique for discovering groups of similar objects in large datasets. It is nowadays applied in all areas of life sciences, from biomedicine to physics. However, designing high-quality cluster analyses is a tedious and complicated task with manifold choices along the way. As a cluster analysis is often the first step of a succeeding downstream analysis, the clustering must be reliable, reproducible, and of the highest quality. To address these challenges, we recently developed ClustEval, an integrated and extensible platform for the automated and standardized design and execution of complex cluster analyses. It allows researchers to design and carry out cluster analyses involving a large number of clustering methods applied to many, large datasets. ClustEval helps to shed light on all major aspects of cluster analysis, from choosing the right similarity function to using validity indices and data preprocessing protocols. Only this high degree of automation allows the researcher to easily run a clustering task with many different tools, parameters, and settings in order to gain the best possible outcome. In this paper, we guide the user step by step through three fundamentally important and widely applicable use cases: (i) identification of the best clustering method for a new, user-given protein sequence similarity dataset; (ii) evaluation of the performance of a new, user-given clustering method (densityCut) against the state of the art; and (iii) prediction of the best method for a new protein sequence similarity dataset. This protocol guides the user through the most important features of ClustEval and takes â 1/44 h to complete.

U2 - 10.1038/nprot.2018.038

DO - 10.1038/nprot.2018.038

M3 - Journal article

VL - 13

SP - 1429

EP - 1444

JO - Nature Protocols (Print)

JF - Nature Protocols (Print)

SN - 1754-2189

IS - 6

ER -