TiCoNE 2: A Composite Clustering Model for Robust Cluster Analyses on Noisy Data

Christian Wiwie, Richard Röttger, Jan Baumbach

Research output: Contribution to journalJournal articleResearch

1 Downloads (Pure)

Abstract

Identifying groups of similar objects using clustering approaches is one of the most frequently employed first steps in exploratory biomedical data analysis. Many clustering methods have been developed that pursue different strategies to identify the optimal clustering for a data set. We previously published TiCoNE, an interactive clustering approach coupled with de-novo network enrichment of identified clusters. However, in this first version time-series and network analysis remained two separate steps in that only time-series data was clustered, and identified clusters mapped to and enriched within a network in a second separate step. In this work, we present TiCoNE 2: An extension that can now seamlessly incorporate multiple data types within its composite clustering model. Systematic evaluation on 50 random data sets, as well as on 2,400 data sets containing enriched cluster structure and varying levels of noise, shows that our approach is able to successfully recover cluster patterns embedded in random data and that it is more robust towards noise than non-composite models using only one data type, when applied to two data types simultaneously. Herein, each data set was clustered using five different similarity functions into k=10/30 clusters, resulting to ~5,000 clusterings in total. We evaluated the quality of each derived clustering with the Jaccard index and an internal validity score. We used TiCoNE to calculate empirical p-values for all generated clusters with different permutation functions, resulting in ~80,000 cluster p-values. We show, that derived p-values can be used to reliably distinguish between foreground and background clusters. TiCoNE 2 allows researchers to seamlessly analyze time-series data together with biological interaction networks in an intuitive way and thereby provides more robust results than single data type cluster analyses.
Original languageEnglish
Journalarxiv.org
Publication statusPublished - 28 Apr 2019

Fingerprint

Time series
Time series analysis
Composite materials
Electric network analysis

Keywords

  • q-bio.QM

Cite this

@article{be1ffc5997254a7983da13baa8c359d9,
title = "TiCoNE 2: A Composite Clustering Model for Robust Cluster Analyses on Noisy Data",
abstract = "Identifying groups of similar objects using clustering approaches is one of the most frequently employed first steps in exploratory biomedical data analysis. Many clustering methods have been developed that pursue different strategies to identify the optimal clustering for a data set. We previously published TiCoNE, an interactive clustering approach coupled with de-novo network enrichment of identified clusters. However, in this first version time-series and network analysis remained two separate steps in that only time-series data was clustered, and identified clusters mapped to and enriched within a network in a second separate step. In this work, we present TiCoNE 2: An extension that can now seamlessly incorporate multiple data types within its composite clustering model. Systematic evaluation on 50 random data sets, as well as on 2,400 data sets containing enriched cluster structure and varying levels of noise, shows that our approach is able to successfully recover cluster patterns embedded in random data and that it is more robust towards noise than non-composite models using only one data type, when applied to two data types simultaneously. Herein, each data set was clustered using five different similarity functions into k=10/30 clusters, resulting to ~5,000 clusterings in total. We evaluated the quality of each derived clustering with the Jaccard index and an internal validity score. We used TiCoNE to calculate empirical p-values for all generated clusters with different permutation functions, resulting in ~80,000 cluster p-values. We show, that derived p-values can be used to reliably distinguish between foreground and background clusters. TiCoNE 2 allows researchers to seamlessly analyze time-series data together with biological interaction networks in an intuitive way and thereby provides more robust results than single data type cluster analyses.",
keywords = "q-bio.QM",
author = "Christian Wiwie and Richard R{\"o}ttger and Jan Baumbach",
year = "2019",
month = "4",
day = "28",
language = "English",
journal = "arxiv.org",

}

TiCoNE 2: A Composite Clustering Model for Robust Cluster Analyses on Noisy Data. / Wiwie, Christian; Röttger, Richard; Baumbach, Jan.

In: arxiv.org, 28.04.2019.

Research output: Contribution to journalJournal articleResearch

TY - JOUR

T1 - TiCoNE 2: A Composite Clustering Model for Robust Cluster Analyses on Noisy Data

AU - Wiwie, Christian

AU - Röttger, Richard

AU - Baumbach, Jan

PY - 2019/4/28

Y1 - 2019/4/28

N2 - Identifying groups of similar objects using clustering approaches is one of the most frequently employed first steps in exploratory biomedical data analysis. Many clustering methods have been developed that pursue different strategies to identify the optimal clustering for a data set. We previously published TiCoNE, an interactive clustering approach coupled with de-novo network enrichment of identified clusters. However, in this first version time-series and network analysis remained two separate steps in that only time-series data was clustered, and identified clusters mapped to and enriched within a network in a second separate step. In this work, we present TiCoNE 2: An extension that can now seamlessly incorporate multiple data types within its composite clustering model. Systematic evaluation on 50 random data sets, as well as on 2,400 data sets containing enriched cluster structure and varying levels of noise, shows that our approach is able to successfully recover cluster patterns embedded in random data and that it is more robust towards noise than non-composite models using only one data type, when applied to two data types simultaneously. Herein, each data set was clustered using five different similarity functions into k=10/30 clusters, resulting to ~5,000 clusterings in total. We evaluated the quality of each derived clustering with the Jaccard index and an internal validity score. We used TiCoNE to calculate empirical p-values for all generated clusters with different permutation functions, resulting in ~80,000 cluster p-values. We show, that derived p-values can be used to reliably distinguish between foreground and background clusters. TiCoNE 2 allows researchers to seamlessly analyze time-series data together with biological interaction networks in an intuitive way and thereby provides more robust results than single data type cluster analyses.

AB - Identifying groups of similar objects using clustering approaches is one of the most frequently employed first steps in exploratory biomedical data analysis. Many clustering methods have been developed that pursue different strategies to identify the optimal clustering for a data set. We previously published TiCoNE, an interactive clustering approach coupled with de-novo network enrichment of identified clusters. However, in this first version time-series and network analysis remained two separate steps in that only time-series data was clustered, and identified clusters mapped to and enriched within a network in a second separate step. In this work, we present TiCoNE 2: An extension that can now seamlessly incorporate multiple data types within its composite clustering model. Systematic evaluation on 50 random data sets, as well as on 2,400 data sets containing enriched cluster structure and varying levels of noise, shows that our approach is able to successfully recover cluster patterns embedded in random data and that it is more robust towards noise than non-composite models using only one data type, when applied to two data types simultaneously. Herein, each data set was clustered using five different similarity functions into k=10/30 clusters, resulting to ~5,000 clusterings in total. We evaluated the quality of each derived clustering with the Jaccard index and an internal validity score. We used TiCoNE to calculate empirical p-values for all generated clusters with different permutation functions, resulting in ~80,000 cluster p-values. We show, that derived p-values can be used to reliably distinguish between foreground and background clusters. TiCoNE 2 allows researchers to seamlessly analyze time-series data together with biological interaction networks in an intuitive way and thereby provides more robust results than single data type cluster analyses.

KW - q-bio.QM

M3 - Journal article

JO - arxiv.org

JF - arxiv.org

ER -