On the selection of appropriate distances for gene expression data clustering

Pablo A. Jaskowiak*, Ricardo J.G.B. Campello, Ivan G. Costa

*Kontaktforfatter

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningpeer review

Abstract

Background: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. Results and conclusions: We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.

OriginalsprogEngelsk
ArtikelnummerS2
TidsskriftBMC Bioinformatics
Vol/bind15
Udgave nummerSuppl 2
ISSN1471-2105
DOI
StatusUdgivet - 2014
Udgivet eksterntJa

Bibliografisk note

Funding Information:
The authors would like to thank Brazilian research agencies CAPES, CNPq, FACEPE and FAPESP (Processes #2011/04247-5 and #2012/15751-9). IGC was partially funded by the Excellence Initiative of the German federal and state governments and the German Research Foundation through Grant GSC 111 and IZKF Aachen (Interdisciplinary Centre for Clinical Research within the faculty of Medicine at RWTH Aachen University).

Funding Information:
The publication costs for this article were funded by Brazilian Research Agencies CAPES, CNPq, FACEPE and FAPESP (Processes #2011/04247-5 and #2012/15751-9). It was also partially funded by the Excellence Initiative of the German federal and state governments and the German Research Foundation through Grant GSC 111 and IZKF Aachen (Interdisciplinary Centre for Clinical Research within the faculty of Medicine at RWTH Aachen University). This article has been published as part of BMC Bioinformatics Volume 15 Supplement 2, 2014: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/ bmcbioinformatics/supplements/15/S2.

Publisher Copyright:
© 2014 Jaskowiak et al.

Fingeraftryk

Dyk ned i forskningsemnerne om 'On the selection of appropriate distances for gene expression data clustering'. Sammen danner de et unikt fingeraftryk.

Citationsformater