TY - GEN
T1 - Evaluating correlation coefficients for clustering gene expression profiles of cancer
AU - Jaskowiak, Pablo A.
AU - Campello, Ricardo J.G.B.
AU - Costa, Ivan G.
PY - 2012
Y1 - 2012
N2 - Cluster analysis is usually the first step adopted to unveil information from gene expression data. One of its common applications is the clustering of cancer samples, associated with the detection of previously unknown cancer subtypes. Although guidelines have been established concerning the choice of appropriate clustering algorithms, little attention has been given to the subject of proximity measures. Whereas the Pearson correlation coefficient appears as the de facto proximity measure in this scenario, no comprehensive study analyzing other correlation coefficients as alternatives to it has been conducted. Considering such facts, we evaluated five correlation coefficients (along with Euclidean distance) regarding the clustering of cancer samples. Our evaluation was conducted on 35 publicly available datasets covering both (i) intrinsic separation ability and (ii) clustering predictive ability of the correlation coefficients. Our results support that correlation coefficients rarely considered in the gene expression literature may provide competitive results to more generally employed ones. Finally, we show that a recently introduced measure arises as a promising alternative to the commonly employed Pearson, providing competitive and even superior results to it.
AB - Cluster analysis is usually the first step adopted to unveil information from gene expression data. One of its common applications is the clustering of cancer samples, associated with the detection of previously unknown cancer subtypes. Although guidelines have been established concerning the choice of appropriate clustering algorithms, little attention has been given to the subject of proximity measures. Whereas the Pearson correlation coefficient appears as the de facto proximity measure in this scenario, no comprehensive study analyzing other correlation coefficients as alternatives to it has been conducted. Considering such facts, we evaluated five correlation coefficients (along with Euclidean distance) regarding the clustering of cancer samples. Our evaluation was conducted on 35 publicly available datasets covering both (i) intrinsic separation ability and (ii) clustering predictive ability of the correlation coefficients. Our results support that correlation coefficients rarely considered in the gene expression literature may provide competitive results to more generally employed ones. Finally, we show that a recently introduced measure arises as a promising alternative to the commonly employed Pearson, providing competitive and even superior results to it.
KW - clustering
KW - correlation
KW - gene expression
KW - proximity measure
UR - http://www.scopus.com/inward/record.url?scp=84865578190&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-31927-3_11
DO - 10.1007/978-3-642-31927-3_11
M3 - Article in proceedings
AN - SCOPUS:84865578190
SN - 9783642319266
T3 - Lecture Notes in Computer Science
SP - 120
EP - 131
BT - Advances in Bioinformatics and Computational Biology - 7th Brazilian Symposium on Bioinformatics, BSB 2012, Proceedings
PB - Springer
T2 - 7th Brazilian Symposium on Bioinformatics, BSB 2012
Y2 - 15 August 2012 through 17 August 2012
ER -