Informal-to-formal word conversion for persian language using natural language processing techniques

Amin Naemi*, Marjan Mansourvar, Mostafa Naemi, Bahman Damirchilu, Ali Ebrahimi, Uffe Kock Wiil

*Kontaktforfatter for dette arbejde

Publikation: Kapitel i bog/rapport/konference-proceedingKonferencebidrag i proceedingsForskningpeer review

Abstrakt

A vast amount of text data is available today on the Internet due to the extensive use of social media. Valuable information can be extracted from this data through natural language processing. However, the process of information extraction can be difficult due to the informal forms of these texts. This paper aims to address this challenge by focusing on the conversion of Persian informal words to formal words by using the spell-checking approach. For this purpose, two datasets for formal and informal words were extracted from the four most visited news websites in Persian. Then Persian informal words were divided into multiple categories based on the level of changes required to build the formal equivalents. These were then converted to the formal forms according to their features. Statistical analyses combined with correction rules were used to produce a "candidate list"to find the best formal candidate equivalents. The performance of our conversion system was evaluated through people's comments extracted from the four most visited Persian (Farsi) news agencies. Results show that our proposed system can detect approximately 94% of the Persian informal words, with the ability to detect 85% of the best equivalent formal words. In addition, the comparison between the proposed system and two well-known Persian spell-checkers, Virastyar and Vafa, shows that in terms of detection and correction, the proposed system outperforms significantly. Further analysis shows that the time complexity of the proposed system is linear.

OriginalsprogEngelsk
TitelProceedings of 2021 2nd International Conference on Computing, Networks and Internet of Things
Antal sider7
ForlagAssociation for Computing Machinery
Publikationsdato20. maj 2021
Artikelnummer19
ISBN (Elektronisk)9781450389693
DOI
StatusUdgivet - 20. maj 2021
Begivenhed2nd International Conference on Computing, Networks and Internet of Things, CNIOT 2021 - Beijing, Kina
Varighed: 20. maj 202122. maj 2021

Konference

Konference2nd International Conference on Computing, Networks and Internet of Things, CNIOT 2021
Land/OmrådeKina
ByBeijing
Periode20/05/202122/05/2021

Bibliografisk note

Publisher Copyright:
© 2021 ACM.

Fingeraftryk

Dyk ned i forskningsemnerne om 'Informal-to-formal word conversion for persian language using natural language processing techniques'. Sammen danner de et unikt fingeraftryk.

Citationsformater