Combining semantic and term frequency similarities for text clustering

Victor Hugo Andrade Soares*, Ricardo J.G.B. Campello, Seyednaser Nourashrafeddin, Evangelos Milios, Murilo Coelho Naldi

*Kontaktforfatter

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningpeer review

Abstract

A key challenge for document clustering consists in finding a proper similarity measure for text documents that enables the generation of cohesive groups. Measures based on the classic bag-of-words model take into account solely the presence (and frequency) of words in documents. In doing so, semantically similar documents which use different vocabularies may end up in different clusters. For this reason, semantic similarity measures that use external knowledge, such as word n-gram corpora or thesauri, have been proposed in the literature. In this paper, the Frequency Google Tri-gram Measure is proposed to assess similarity between documents based on the frequencies of terms in the compared documents as well as the Google n-gram corpus as an additional semantic similarity source. Clustering algorithms are applied to several real datasets in order to experimentally evaluate the quality of the clusters obtained with the proposed measure and compare it with a number of state-of-the-art measures from the literature. The experimental results demonstrate that the proposed measure improves significantly the quality of document clustering, based on statistical tests. We further demonstrate that clustering results combining bag-of-words and semantic similarity are superior to those obtained with either approach independently.

OriginalsprogEngelsk
TidsskriftKnowledge and Information Systems
Vol/bind61
Udgave nummer3
Sider (fra-til)1485-1516
ISSN0219-1377
DOI
StatusUdgivet - 1. dec. 2019
Udgivet eksterntJa

Bibliografisk note

Funding Information:
The authors acknowledge the Brazilian Research Agencies Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)—Finance Code 001, CNPq, FAPEMIG and FAPESP, the Natural Sciences and Engineering Research Council of Canada, the Boeing Company, CALDO, and the International Development Research Centre, Ottawa, Canada, for their financial support to this work.

Publisher Copyright:
© 2019, Springer-Verlag London Ltd., part of Springer Nature.

Fingeraftryk

Dyk ned i forskningsemnerne om 'Combining semantic and term frequency similarities for text clustering'. Sammen danner de et unikt fingeraftryk.

Citationsformater