Online transitivity clustering of biological data with missing values

Richard Röttger, C. Kreutzer, T.D. Vu, T. Wittkop, Jan Baumbach

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

Motivation: Equipped with sophisticated biochemical measurement techniques we generate a massive amount of biomedical data that needs to be analyzed computationally. One long-standing challenge in automatic knowledge extraction is clustering. We seek to partition a set of objects into groups such that the objects within the clusters share common traits. Usually, we have given a similarity matrix computed from a pairwise similarity function. While many approaches for biomedical data clustering exist, most methods neglect two important problems: (1) Computing the similarity matrix might not be trivial but resource-intense. (2) A clustering algorithm itself is not sufficient for the biologist, who needs an integrated online system capable of performing preparative and follow-up tasks as well. Results: Here, we present a significantly extended version of Transitivity Clustering. Our first main contribution is its' capability of dealing with missing values in the similarity matrix such that we save time and memory. Hence, we reduce one main bottleneck of computing all pairwise similarity values. We integrated this functionality into the Weighted Graph Cluster Editing model underlying Transitivity Clustering. By means of identifying protein (super)families from incomplete all-vs-all BLAST results we demonstrate the robustness of our approach. While most tools concentrate on the partitioning process itself, we present a new, intuitive web interface that aids with all important steps of a cluster analysis: (1) computing and post-processing of a similarity matrix, (2) estimation of a meaningful density parameter, (3) clustering, (4) comparison with given gold standards, and (5) fine-tuning of the clustering by varying the parameters. Availability: Transitivity Clustering, the new Cost Matrix Creator, all used data sets as well as an online documentation are online available at http://transclust.mmci.uni-saarland.de/.
Original languageEnglish
Title of host publicationGerman Conference on Bioinformatics 2012, GCB 2012
EditorsS. Böcker, F. Hufsky, K. Scheubert, J. Schleicher, S. Schuster
PublisherSchloss Dagstuhl-Leibniz-Zentrum fuer Informatik
Publication date1 Jan 2012
Pages57-68
ISBN (Print)978-3-939897-44-6
DOIs
Publication statusPublished - 1 Jan 2012
Externally publishedYes
EventGerman Conference on Bioinformatics - Jena, Germany
Duration: 20 Sep 201222 Sep 2012

Conference

ConferenceGerman Conference on Bioinformatics
CountryGermany
CityJena
Period20/09/201222/09/2012
SeriesOpenAccess Series in Informatics
Volume26
ISSN2190-6807

Fingerprint

Online systems
Cluster analysis
Clustering algorithms
Tuning
Availability
Proteins
Data storage equipment
Processing
Costs

Keywords

  • Transitivity Clustering, Large Scale clustering, Missing Values, Web Interface

Cite this

Röttger, R., Kreutzer, C., Vu, T. D., Wittkop, T., & Baumbach, J. (2012). Online transitivity clustering of biological data with missing values. In S. Böcker, F. Hufsky, K. Scheubert, J. Schleicher, & S. Schuster (Eds.), German Conference on Bioinformatics 2012, GCB 2012 (pp. 57-68). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. OpenAccess Series in Informatics, Vol.. 26 https://doi.org/10.4230/OASIcs.GCB.2012.57
Röttger, Richard ; Kreutzer, C. ; Vu, T.D. ; Wittkop, T. ; Baumbach, Jan. / Online transitivity clustering of biological data with missing values. German Conference on Bioinformatics 2012, GCB 2012. editor / S. Böcker ; F. Hufsky ; K. Scheubert ; J. Schleicher ; S. Schuster. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2012. pp. 57-68 (OpenAccess Series in Informatics, Vol. 26).
@inproceedings{70b9ce8bc0ea46a8bf6a8dba6b343f9e,
title = "Online transitivity clustering of biological data with missing values",
abstract = "Motivation: Equipped with sophisticated biochemical measurement techniques we generate a massive amount of biomedical data that needs to be analyzed computationally. One long-standing challenge in automatic knowledge extraction is clustering. We seek to partition a set of objects into groups such that the objects within the clusters share common traits. Usually, we have given a similarity matrix computed from a pairwise similarity function. While many approaches for biomedical data clustering exist, most methods neglect two important problems: (1) Computing the similarity matrix might not be trivial but resource-intense. (2) A clustering algorithm itself is not sufficient for the biologist, who needs an integrated online system capable of performing preparative and follow-up tasks as well. Results: Here, we present a significantly extended version of Transitivity Clustering. Our first main contribution is its' capability of dealing with missing values in the similarity matrix such that we save time and memory. Hence, we reduce one main bottleneck of computing all pairwise similarity values. We integrated this functionality into the Weighted Graph Cluster Editing model underlying Transitivity Clustering. By means of identifying protein (super)families from incomplete all-vs-all BLAST results we demonstrate the robustness of our approach. While most tools concentrate on the partitioning process itself, we present a new, intuitive web interface that aids with all important steps of a cluster analysis: (1) computing and post-processing of a similarity matrix, (2) estimation of a meaningful density parameter, (3) clustering, (4) comparison with given gold standards, and (5) fine-tuning of the clustering by varying the parameters. Availability: Transitivity Clustering, the new Cost Matrix Creator, all used data sets as well as an online documentation are online available at http://transclust.mmci.uni-saarland.de/.",
keywords = "Transitivity Clustering, Large Scale clustering, Missing Values, Web Interface",
author = "Richard R{\"o}ttger and C. Kreutzer and T.D. Vu and T. Wittkop and Jan Baumbach",
year = "2012",
month = "1",
day = "1",
doi = "10.4230/OASIcs.GCB.2012.57",
language = "English",
isbn = "978-3-939897-44-6",
pages = "57--68",
editor = "S. B{\"o}cker and F. Hufsky and K. Scheubert and J. Schleicher and Schuster, {S. }",
booktitle = "German Conference on Bioinformatics 2012, GCB 2012",
publisher = "Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik",

}

Röttger, R, Kreutzer, C, Vu, TD, Wittkop, T & Baumbach, J 2012, Online transitivity clustering of biological data with missing values. in S Böcker, F Hufsky, K Scheubert, J Schleicher & S Schuster (eds), German Conference on Bioinformatics 2012, GCB 2012. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, OpenAccess Series in Informatics, vol. 26, pp. 57-68, Jena, Germany, 20/09/2012. https://doi.org/10.4230/OASIcs.GCB.2012.57

Online transitivity clustering of biological data with missing values. / Röttger, Richard; Kreutzer, C.; Vu, T.D.; Wittkop, T.; Baumbach, Jan.

German Conference on Bioinformatics 2012, GCB 2012. ed. / S. Böcker; F. Hufsky; K. Scheubert; J. Schleicher; S. Schuster. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2012. p. 57-68.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

TY - GEN

T1 - Online transitivity clustering of biological data with missing values

AU - Röttger, Richard

AU - Kreutzer, C.

AU - Vu, T.D.

AU - Wittkop, T.

AU - Baumbach, Jan

PY - 2012/1/1

Y1 - 2012/1/1

N2 - Motivation: Equipped with sophisticated biochemical measurement techniques we generate a massive amount of biomedical data that needs to be analyzed computationally. One long-standing challenge in automatic knowledge extraction is clustering. We seek to partition a set of objects into groups such that the objects within the clusters share common traits. Usually, we have given a similarity matrix computed from a pairwise similarity function. While many approaches for biomedical data clustering exist, most methods neglect two important problems: (1) Computing the similarity matrix might not be trivial but resource-intense. (2) A clustering algorithm itself is not sufficient for the biologist, who needs an integrated online system capable of performing preparative and follow-up tasks as well. Results: Here, we present a significantly extended version of Transitivity Clustering. Our first main contribution is its' capability of dealing with missing values in the similarity matrix such that we save time and memory. Hence, we reduce one main bottleneck of computing all pairwise similarity values. We integrated this functionality into the Weighted Graph Cluster Editing model underlying Transitivity Clustering. By means of identifying protein (super)families from incomplete all-vs-all BLAST results we demonstrate the robustness of our approach. While most tools concentrate on the partitioning process itself, we present a new, intuitive web interface that aids with all important steps of a cluster analysis: (1) computing and post-processing of a similarity matrix, (2) estimation of a meaningful density parameter, (3) clustering, (4) comparison with given gold standards, and (5) fine-tuning of the clustering by varying the parameters. Availability: Transitivity Clustering, the new Cost Matrix Creator, all used data sets as well as an online documentation are online available at http://transclust.mmci.uni-saarland.de/.

AB - Motivation: Equipped with sophisticated biochemical measurement techniques we generate a massive amount of biomedical data that needs to be analyzed computationally. One long-standing challenge in automatic knowledge extraction is clustering. We seek to partition a set of objects into groups such that the objects within the clusters share common traits. Usually, we have given a similarity matrix computed from a pairwise similarity function. While many approaches for biomedical data clustering exist, most methods neglect two important problems: (1) Computing the similarity matrix might not be trivial but resource-intense. (2) A clustering algorithm itself is not sufficient for the biologist, who needs an integrated online system capable of performing preparative and follow-up tasks as well. Results: Here, we present a significantly extended version of Transitivity Clustering. Our first main contribution is its' capability of dealing with missing values in the similarity matrix such that we save time and memory. Hence, we reduce one main bottleneck of computing all pairwise similarity values. We integrated this functionality into the Weighted Graph Cluster Editing model underlying Transitivity Clustering. By means of identifying protein (super)families from incomplete all-vs-all BLAST results we demonstrate the robustness of our approach. While most tools concentrate on the partitioning process itself, we present a new, intuitive web interface that aids with all important steps of a cluster analysis: (1) computing and post-processing of a similarity matrix, (2) estimation of a meaningful density parameter, (3) clustering, (4) comparison with given gold standards, and (5) fine-tuning of the clustering by varying the parameters. Availability: Transitivity Clustering, the new Cost Matrix Creator, all used data sets as well as an online documentation are online available at http://transclust.mmci.uni-saarland.de/.

KW - Transitivity Clustering, Large Scale clustering, Missing Values, Web Interface

U2 - 10.4230/OASIcs.GCB.2012.57

DO - 10.4230/OASIcs.GCB.2012.57

M3 - Article in proceedings

SN - 978-3-939897-44-6

SP - 57

EP - 68

BT - German Conference on Bioinformatics 2012, GCB 2012

A2 - Böcker, S.

A2 - Hufsky, F.

A2 - Scheubert, K.

A2 - Schleicher, J.

A2 - Schuster, S.

PB - Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik

ER -

Röttger R, Kreutzer C, Vu TD, Wittkop T, Baumbach J. Online transitivity clustering of biological data with missing values. In Böcker S, Hufsky F, Scheubert K, Schleicher J, Schuster S, editors, German Conference on Bioinformatics 2012, GCB 2012. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. 2012. p. 57-68. (OpenAccess Series in Informatics, Vol. 26). https://doi.org/10.4230/OASIcs.GCB.2012.57