TY - JOUR
T1 - Clustering of RNA-Seq samples
T2 - Comparison study on cancer data
AU - Jaskowiak, Pablo Andretta
AU - Costa, Ivan G.
AU - Campello, Ricardo J.G.B.
N1 - Funding Information:
This project was partially funded by Brazilian research agencies FAPESP (Process 2011/04247-5 ), CNPq (Processes 304137/2013-8 , 400772/2014-0 , and 164595/2015-5 ), and by the Interdisciplinary Center for Clinical Research (IZKF) within the faculty of Medicine at the RWTH Aachen University.
Funding Information:
This project was partially funded by Brazilian research agencies FAPESP (Process 2011/04247-5), CNPq (Processes 304137/2013-8, 400772/2014-0, and 164595/2015-5), and by the Interdisciplinary Center for Clinical Research (IZKF) within the faculty of Medicine at the RWTH Aachen University.
Publisher Copyright:
© 2017 Elsevier Inc.
PY - 2018/1/1
Y1 - 2018/1/1
N2 - RNA-Seq is becoming the standard technology for large-scale gene expression level measurements, as it offers a number of advantages over microarrays. Standards for RNA-Seq data analysis are, however, in its infancy when compared to those of microarrays. Clustering, which is essential for understanding gene expression data, has been widely investigated w.r.t. microarrays. In what concerns the clustering of RNA-Seq data, however, a number of questions remain open, resulting in a lack of guidelines to practitioners. Here we evaluate computational steps relevant for clustering cancer samples via an empirical analysis of 15 mRNA-seq datasets. Our evaluation considers strategies regarding expression estimates, number of genes after non-specific filtering and data transformations. We evaluate the performance of four clustering algorithms and twelve distance measures, which are commonly used for gene expression analysis. Results support that clustering cancer samples based on a gene quantification should be preferred. The use of non-specific filtering leading to a small number of features (1,000) presents, in general, superior results. Data should be log-transformed previously to cluster analysis. Regarding the choice of clustering algorithms, Average-Linkage and k-medoids provide, in general, superior recoveries. Although specific cases can benefit from a careful selection of a distance measure, Symmetric Rank-Magnitude correlation provides consistent and sound results in different scenarios.
AB - RNA-Seq is becoming the standard technology for large-scale gene expression level measurements, as it offers a number of advantages over microarrays. Standards for RNA-Seq data analysis are, however, in its infancy when compared to those of microarrays. Clustering, which is essential for understanding gene expression data, has been widely investigated w.r.t. microarrays. In what concerns the clustering of RNA-Seq data, however, a number of questions remain open, resulting in a lack of guidelines to practitioners. Here we evaluate computational steps relevant for clustering cancer samples via an empirical analysis of 15 mRNA-seq datasets. Our evaluation considers strategies regarding expression estimates, number of genes after non-specific filtering and data transformations. We evaluate the performance of four clustering algorithms and twelve distance measures, which are commonly used for gene expression analysis. Results support that clustering cancer samples based on a gene quantification should be preferred. The use of non-specific filtering leading to a small number of features (1,000) presents, in general, superior results. Data should be log-transformed previously to cluster analysis. Regarding the choice of clustering algorithms, Average-Linkage and k-medoids provide, in general, superior recoveries. Although specific cases can benefit from a careful selection of a distance measure, Symmetric Rank-Magnitude correlation provides consistent and sound results in different scenarios.
KW - Cancer
KW - Cluster analysis
KW - Clustering
KW - Gene expression
KW - RNA-Seq
U2 - 10.1016/j.ymeth.2017.07.023
DO - 10.1016/j.ymeth.2017.07.023
M3 - Journal article
C2 - 28778489
AN - SCOPUS:85027220790
SN - 1046-2023
VL - 132
SP - 42
EP - 49
JO - Methods
JF - Methods
ER -