Abstrakt
Over the last decades, we have observed an ongoing tremendous growth of available sequencing data fueled by the advancements in wet-lab technology. The sequencing information is only the beginning of the actual understanding of how organisms survive and prosper. It is, for instance, equally important to also unravel the proteomic repertoire of an organism. A classical computational approach for detecting protein families is a sequence-based similarity calculation coupled with a subsequent cluster analysis. In this work we have intensively analyzed various clustering tools on a large scale. We used the data to investigate the behavior of the tools' parameters underlining the diversity of the protein families. Furthermore, we trained regression models for predicting the expected performance of a clustering tool for an unknown data set and aimed to also suggest optimal parameters in an automated fashion. Our analysis demonstrates the benefits and limitations of the clustering of proteins with low sequence similarity indicating that each protein family requires its own distinct set of tools and parameters. All results, a tool prediction service, and additional supporting material is also available online under http://proteinclustering.compbio.sdu.dk.
Originalsprog | Engelsk |
---|---|
Titel | Biocomputing 2017 : Proceedings of the Pacific Symposium |
Redaktører | Russ B Altman, A Keith Dunker, Lawrence Hunter, Marylyn Ritchie, Tiffany Murray, Teri Klein |
Forlag | World Scientific |
Publikationsdato | 2017 |
Sider | 39-50 |
ISBN (Trykt) | 978-981-3207-80-6 |
ISBN (Elektronisk) | 978-981-3207-82-0 |
DOI | |
Status | Udgivet - 2017 |
Begivenhed | Pacific Symposium on Biocomputing 2017 - Hawaii, USA Varighed: 3. jan. 2017 → 7. jan. 2017 Konferencens nummer: 22 |
Konference
Konference | Pacific Symposium on Biocomputing 2017 |
---|---|
Nummer | 22 |
Land/Område | USA |
By | Hawaii |
Periode | 03/01/2017 → 07/01/2017 |