Massive fungal biodiversity data re-annotation with multi-level clustering

D. Vu, S. Szoke, Christian Wiwie, Jan Baumbach, G. Cardinali, Richard Röttger, V. Robert

Research output: Contribution to journalJournal articleResearchpeer-review

Abstract

With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours.
Original languageEnglish
Article number6837
JournalScientific Reports
Volume4
Number of pages9
ISSN2045-2322
DOIs
Publication statusPublished - 2014

Keywords

  • INTERNAL TRANSCRIBED SPACER PROTEIN SEQUENCES DNA SEARCH GENERATION ALGORITHM FAMILIES HOMOLOGY BLAST

Cite this

Vu, D. ; Szoke, S. ; Wiwie, Christian ; Baumbach, Jan ; Cardinali, G. ; Röttger, Richard ; Robert, V. / Massive fungal biodiversity data re-annotation with multi-level clustering. In: Scientific Reports. 2014 ; Vol. 4.
@article{6d1634ebf7f149b28146d345d78779af,
title = "Massive fungal biodiversity data re-annotation with multi-level clustering",
abstract = "With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours.",
keywords = "INTERNAL TRANSCRIBED SPACER PROTEIN SEQUENCES DNA SEARCH GENERATION ALGORITHM FAMILIES HOMOLOGY BLAST",
author = "D. Vu and S. Szoke and Christian Wiwie and Jan Baumbach and G. Cardinali and Richard R{\"o}ttger and V. Robert",
note = "ISI Document Delivery No.: AS0RW Times Cited: 0 Cited Reference Count: 31 Duong Vu Szoke, Szaniszlo Wiwie, Christian Baumbach, Jan Cardinali, Gianluigi Rottger, Richard Robert, Vincent 0 NATURE PUBLISHING GROUP LONDON SCI REP-UK",
year = "2014",
doi = "10.1038/srep06837",
language = "English",
volume = "4",
journal = "Scientific Reports",
issn = "2045-2322",
publisher = "Nature Publishing Group",

}

Massive fungal biodiversity data re-annotation with multi-level clustering. / Vu, D.; Szoke, S.; Wiwie, Christian; Baumbach, Jan; Cardinali, G.; Röttger, Richard; Robert, V.

In: Scientific Reports, Vol. 4, 6837, 2014.

Research output: Contribution to journalJournal articleResearchpeer-review

TY - JOUR

T1 - Massive fungal biodiversity data re-annotation with multi-level clustering

AU - Vu, D.

AU - Szoke, S.

AU - Wiwie, Christian

AU - Baumbach, Jan

AU - Cardinali, G.

AU - Röttger, Richard

AU - Robert, V.

N1 - ISI Document Delivery No.: AS0RW Times Cited: 0 Cited Reference Count: 31 Duong Vu Szoke, Szaniszlo Wiwie, Christian Baumbach, Jan Cardinali, Gianluigi Rottger, Richard Robert, Vincent 0 NATURE PUBLISHING GROUP LONDON SCI REP-UK

PY - 2014

Y1 - 2014

N2 - With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours.

AB - With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours.

KW - INTERNAL TRANSCRIBED SPACER PROTEIN SEQUENCES DNA SEARCH GENERATION ALGORITHM FAMILIES HOMOLOGY BLAST

U2 - 10.1038/srep06837

DO - 10.1038/srep06837

M3 - Journal article

VL - 4

JO - Scientific Reports

JF - Scientific Reports

SN - 2045-2322

M1 - 6837

ER -