Classification of Breast Cancer Subtypes by combining Gene Expression and DNA Methylation Data

Markus List, Anne-Christin Hauschild, Qihua Tan, Torben A Kruse, Jan Mollenhauer, Jan Baumbach, Richa Batra

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningpeer review

Resumé

Selecting the most promising treatment strategy for breast cancer crucially depends on determining the correct subtype. In recent years, gene expression profiling has been investigated as an alternative to histochemical methods. Since databases like TCGA provide easy and unrestricted access to gene expression data for hundreds of patients, the challenge is to extract a minimal optimal set of genes with good prognostic properties from a large bulk of genes making a moderate contribution to classification. Several studies have successfully applied machine learning algorithms to solve this so-called gene selection problem. However, more diverse data from other OMICS technologies are available, including methylation. We hypothesize that combining methylation and gene expression data could already lead to a largely improved classification model, since the resulting model will reflect differences not only on the transcriptomic, but also on an epigenetic level. We compared so-called random forest derived classification models based on gene expression and methylation data alone, to a model based on the combined features and to a model based on the gold standard PAM50. We obtained bootstrap errors of 10-20% and classification error of 1-50%, depending on breast cancer subtype and model. The gene expression model was clearly superior to the methylation model, which was also reflected in the combined model, which mainly selected features from gene expression data. However, the methylation model was able to identify unique features not considered as relevant by the gene expression model, which might provide deeper insights into breast cancer subtype differentiation on an epigenetic level.
OriginalsprogEngelsk
TidsskriftJournal of Integrative Bioinformatics
Vol/bind11
Udgave nummer2
Sider (fra-til)236
ISSN1613-4516
DOI
StatusUdgivet - jun. 2014

Fingeraftryk

DNA Methylation
Epigenomics
Gene Expression Profiling
Databases

Citer dette

@article{23842ecafeb3491c8e114d7b60177a57,
title = "Classification of Breast Cancer Subtypes by combining Gene Expression and DNA Methylation Data",
abstract = "Selecting the most promising treatment strategy for breast cancer crucially depends on determining the correct subtype. In recent years, gene expression profiling has been investigated as an alternative to histochemical methods. Since databases like TCGA provide easy and unrestricted access to gene expression data for hundreds of patients, the challenge is to extract a minimal optimal set of genes with good prognostic properties from a large bulk of genes making a moderate contribution to classification. Several studies have successfully applied machine learning algorithms to solve this so-called gene selection problem. However, more diverse data from other OMICS technologies are available, including methylation. We hypothesize that combining methylation and gene expression data could already lead to a largely improved classification model, since the resulting model will reflect differences not only on the transcriptomic, but also on an epigenetic level. We compared so-called random forest derived classification models based on gene expression and methylation data alone, to a model based on the combined features and to a model based on the gold standard PAM50. We obtained bootstrap errors of 10-20{\%} and classification error of 1-50{\%}, depending on breast cancer subtype and model. The gene expression model was clearly superior to the methylation model, which was also reflected in the combined model, which mainly selected features from gene expression data. However, the methylation model was able to identify unique features not considered as relevant by the gene expression model, which might provide deeper insights into breast cancer subtype differentiation on an epigenetic level.",
author = "Markus List and Anne-Christin Hauschild and Qihua Tan and Kruse, {Torben A} and Jan Mollenhauer and Jan Baumbach and Richa Batra",
year = "2014",
month = "6",
doi = "10.2390/biecoll-jib-2014-236",
language = "English",
volume = "11",
pages = "236",
journal = "Journal of Integrative Bioinformatics",
issn = "1613-4516",
publisher = "IMBIO e.V.",
number = "2",

}

Classification of Breast Cancer Subtypes by combining Gene Expression and DNA Methylation Data. / List, Markus; Hauschild, Anne-Christin; Tan, Qihua; Kruse, Torben A; Mollenhauer, Jan; Baumbach, Jan; Batra, Richa.

I: Journal of Integrative Bioinformatics, Bind 11, Nr. 2, 06.2014, s. 236.

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningpeer review

TY - JOUR

T1 - Classification of Breast Cancer Subtypes by combining Gene Expression and DNA Methylation Data

AU - List, Markus

AU - Hauschild, Anne-Christin

AU - Tan, Qihua

AU - Kruse, Torben A

AU - Mollenhauer, Jan

AU - Baumbach, Jan

AU - Batra, Richa

PY - 2014/6

Y1 - 2014/6

N2 - Selecting the most promising treatment strategy for breast cancer crucially depends on determining the correct subtype. In recent years, gene expression profiling has been investigated as an alternative to histochemical methods. Since databases like TCGA provide easy and unrestricted access to gene expression data for hundreds of patients, the challenge is to extract a minimal optimal set of genes with good prognostic properties from a large bulk of genes making a moderate contribution to classification. Several studies have successfully applied machine learning algorithms to solve this so-called gene selection problem. However, more diverse data from other OMICS technologies are available, including methylation. We hypothesize that combining methylation and gene expression data could already lead to a largely improved classification model, since the resulting model will reflect differences not only on the transcriptomic, but also on an epigenetic level. We compared so-called random forest derived classification models based on gene expression and methylation data alone, to a model based on the combined features and to a model based on the gold standard PAM50. We obtained bootstrap errors of 10-20% and classification error of 1-50%, depending on breast cancer subtype and model. The gene expression model was clearly superior to the methylation model, which was also reflected in the combined model, which mainly selected features from gene expression data. However, the methylation model was able to identify unique features not considered as relevant by the gene expression model, which might provide deeper insights into breast cancer subtype differentiation on an epigenetic level.

AB - Selecting the most promising treatment strategy for breast cancer crucially depends on determining the correct subtype. In recent years, gene expression profiling has been investigated as an alternative to histochemical methods. Since databases like TCGA provide easy and unrestricted access to gene expression data for hundreds of patients, the challenge is to extract a minimal optimal set of genes with good prognostic properties from a large bulk of genes making a moderate contribution to classification. Several studies have successfully applied machine learning algorithms to solve this so-called gene selection problem. However, more diverse data from other OMICS technologies are available, including methylation. We hypothesize that combining methylation and gene expression data could already lead to a largely improved classification model, since the resulting model will reflect differences not only on the transcriptomic, but also on an epigenetic level. We compared so-called random forest derived classification models based on gene expression and methylation data alone, to a model based on the combined features and to a model based on the gold standard PAM50. We obtained bootstrap errors of 10-20% and classification error of 1-50%, depending on breast cancer subtype and model. The gene expression model was clearly superior to the methylation model, which was also reflected in the combined model, which mainly selected features from gene expression data. However, the methylation model was able to identify unique features not considered as relevant by the gene expression model, which might provide deeper insights into breast cancer subtype differentiation on an epigenetic level.

U2 - 10.2390/biecoll-jib-2014-236

DO - 10.2390/biecoll-jib-2014-236

M3 - Journal article

C2 - 24953305

VL - 11

SP - 236

JO - Journal of Integrative Bioinformatics

JF - Journal of Integrative Bioinformatics

SN - 1613-4516

IS - 2

ER -