On the Power and Limits of Sequence Similarity Based Clustering of Proteins Into Families

Christian Wiwie, Richard Röttger

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

Over the last decades, we have observed an ongoing tremendous growth of available sequencing data fueled by the advancements in wet-lab technology. The sequencing information is only the beginning of the actual understanding of how organisms survive and prosper. It is, for instance, equally important to also unravel the proteomic repertoire of an organism. A classical computational approach for detecting protein families is a sequence-based similarity calculation coupled with a subsequent cluster analysis. In this work we have intensively analyzed various clustering tools on a large scale. We used the data to investigate the behavior of the tools' parameters underlining the diversity of the protein families. Furthermore, we trained regression models for predicting the expected performance of a clustering tool for an unknown data set and aimed to also suggest optimal parameters in an automated fashion. Our analysis demonstrates the benefits and limitations of the clustering of proteins with low sequence similarity indicating that each protein family requires its own distinct set of tools and parameters. All results, a tool prediction service, and additional supporting material is also available online under http://proteinclustering.compbio.sdu.dk.

Original languageEnglish
Title of host publicationBiocomputing 2017 : Proceedings of the Pacific Symposium
EditorsRuss B Altman, A Keith Dunker, Lawrence Hunter, Marylyn Ritchie, Tiffany Murray, Teri Klein
PublisherWorld Scientific
Publication date2017
Pages39-50
ISBN (Print)978-981-3207-80-6
ISBN (Electronic)978-981-3207-82-0
DOIs
Publication statusPublished - 2017
EventPacific Symposium on Biocomputing 2017 - Hawaii, United States
Duration: 3 Jan 20177 Jan 2017
Conference number: 22

Conference

ConferencePacific Symposium on Biocomputing 2017
Number22
CountryUnited States
CityHawaii
Period03/01/201707/01/2017

Fingerprint

Cluster Analysis
Proteins
Growth

Cite this

Wiwie, C., & Röttger, R. (2017). On the Power and Limits of Sequence Similarity Based Clustering of Proteins Into Families. In R. B. Altman, A. K. Dunker, L. Hunter, M. Ritchie, T. Murray, & T. Klein (Eds.), Biocomputing 2017: Proceedings of the Pacific Symposium (pp. 39-50). World Scientific. https://doi.org/10.1142/9789813207813_0005
Wiwie, Christian ; Röttger, Richard. / On the Power and Limits of Sequence Similarity Based Clustering of Proteins Into Families. Biocomputing 2017: Proceedings of the Pacific Symposium. editor / Russ B Altman ; A Keith Dunker ; Lawrence Hunter ; Marylyn Ritchie ; Tiffany Murray ; Teri Klein. World Scientific, 2017. pp. 39-50
@inproceedings{ba9e3f9382834320b790d4cd7edff7bc,
title = "On the Power and Limits of Sequence Similarity Based Clustering of Proteins Into Families",
abstract = "Over the last decades, we have observed an ongoing tremendous growth of available sequencing data fueled by the advancements in wet-lab technology. The sequencing information is only the beginning of the actual understanding of how organisms survive and prosper. It is, for instance, equally important to also unravel the proteomic repertoire of an organism. A classical computational approach for detecting protein families is a sequence-based similarity calculation coupled with a subsequent cluster analysis. In this work we have intensively analyzed various clustering tools on a large scale. We used the data to investigate the behavior of the tools' parameters underlining the diversity of the protein families. Furthermore, we trained regression models for predicting the expected performance of a clustering tool for an unknown data set and aimed to also suggest optimal parameters in an automated fashion. Our analysis demonstrates the benefits and limitations of the clustering of proteins with low sequence similarity indicating that each protein family requires its own distinct set of tools and parameters. All results, a tool prediction service, and additional supporting material is also available online under http://proteinclustering.compbio.sdu.dk.",
author = "Christian Wiwie and Richard R{\"o}ttger",
year = "2017",
doi = "10.1142/9789813207813_0005",
language = "English",
isbn = "978-981-3207-80-6",
pages = "39--50",
editor = "Altman, {Russ B} and Dunker, {A Keith} and Lawrence Hunter and Marylyn Ritchie and Tiffany Murray and Teri Klein",
booktitle = "Biocomputing 2017",
publisher = "World Scientific",

}

Wiwie, C & Röttger, R 2017, On the Power and Limits of Sequence Similarity Based Clustering of Proteins Into Families. in RB Altman, AK Dunker, L Hunter, M Ritchie, T Murray & T Klein (eds), Biocomputing 2017: Proceedings of the Pacific Symposium. World Scientific, pp. 39-50, Pacific Symposium on Biocomputing 2017, Hawaii, United States, 03/01/2017. https://doi.org/10.1142/9789813207813_0005

On the Power and Limits of Sequence Similarity Based Clustering of Proteins Into Families. / Wiwie, Christian; Röttger, Richard.

Biocomputing 2017: Proceedings of the Pacific Symposium. ed. / Russ B Altman; A Keith Dunker; Lawrence Hunter; Marylyn Ritchie; Tiffany Murray; Teri Klein. World Scientific, 2017. p. 39-50.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

TY - GEN

T1 - On the Power and Limits of Sequence Similarity Based Clustering of Proteins Into Families

AU - Wiwie, Christian

AU - Röttger, Richard

PY - 2017

Y1 - 2017

N2 - Over the last decades, we have observed an ongoing tremendous growth of available sequencing data fueled by the advancements in wet-lab technology. The sequencing information is only the beginning of the actual understanding of how organisms survive and prosper. It is, for instance, equally important to also unravel the proteomic repertoire of an organism. A classical computational approach for detecting protein families is a sequence-based similarity calculation coupled with a subsequent cluster analysis. In this work we have intensively analyzed various clustering tools on a large scale. We used the data to investigate the behavior of the tools' parameters underlining the diversity of the protein families. Furthermore, we trained regression models for predicting the expected performance of a clustering tool for an unknown data set and aimed to also suggest optimal parameters in an automated fashion. Our analysis demonstrates the benefits and limitations of the clustering of proteins with low sequence similarity indicating that each protein family requires its own distinct set of tools and parameters. All results, a tool prediction service, and additional supporting material is also available online under http://proteinclustering.compbio.sdu.dk.

AB - Over the last decades, we have observed an ongoing tremendous growth of available sequencing data fueled by the advancements in wet-lab technology. The sequencing information is only the beginning of the actual understanding of how organisms survive and prosper. It is, for instance, equally important to also unravel the proteomic repertoire of an organism. A classical computational approach for detecting protein families is a sequence-based similarity calculation coupled with a subsequent cluster analysis. In this work we have intensively analyzed various clustering tools on a large scale. We used the data to investigate the behavior of the tools' parameters underlining the diversity of the protein families. Furthermore, we trained regression models for predicting the expected performance of a clustering tool for an unknown data set and aimed to also suggest optimal parameters in an automated fashion. Our analysis demonstrates the benefits and limitations of the clustering of proteins with low sequence similarity indicating that each protein family requires its own distinct set of tools and parameters. All results, a tool prediction service, and additional supporting material is also available online under http://proteinclustering.compbio.sdu.dk.

U2 - 10.1142/9789813207813_0005

DO - 10.1142/9789813207813_0005

M3 - Article in proceedings

SN - 978-981-3207-80-6

SP - 39

EP - 50

BT - Biocomputing 2017

A2 - Altman, Russ B

A2 - Dunker, A Keith

A2 - Hunter, Lawrence

A2 - Ritchie, Marylyn

A2 - Murray, Tiffany

A2 - Klein, Teri

PB - World Scientific

ER -

Wiwie C, Röttger R. On the Power and Limits of Sequence Similarity Based Clustering of Proteins Into Families. In Altman RB, Dunker AK, Hunter L, Ritchie M, Murray T, Klein T, editors, Biocomputing 2017: Proceedings of the Pacific Symposium. World Scientific. 2017. p. 39-50 https://doi.org/10.1142/9789813207813_0005