A vast amount of text data is available today on the Internet due to the extensive use of social media. Valuable information can be extracted from this data through natural language processing. However, the process of information extraction can be difficult due to the informal forms of these texts. This paper aims to address this challenge by focusing on the conversion of Persian informal words to formal words by using the spell-checking approach. For this purpose, two datasets for formal and informal words were extracted from the four most visited news websites in Persian. Then Persian informal words were divided into multiple categories based on the level of changes required to build the formal equivalents. These were then converted to the formal forms according to their features. Statistical analyses combined with correction rules were used to produce a "candidate list"to find the best formal candidate equivalents. The performance of our conversion system was evaluated through people's comments extracted from the four most visited Persian (Farsi) news agencies. Results show that our proposed system can detect approximately 94% of the Persian informal words, with the ability to detect 85% of the best equivalent formal words. In addition, the comparison between the proposed system and two well-known Persian spell-checkers, Virastyar and Vafa, shows that in terms of detection and correction, the proposed system outperforms significantly. Further analysis shows that the time complexity of the proposed system is linear.
|Titel||Proceedings of 2021 2nd International Conference on Computing, Networks and Internet of Things|
|Forlag||Association for Computing Machinery|
|Publikationsdato||20. maj 2021|
|Status||Udgivet - 20. maj 2021|
|Begivenhed||2nd International Conference on Computing, Networks and Internet of Things, CNIOT 2021 - Beijing, Kina|
Varighed: 20. maj 2021 → 22. maj 2021
|Konference||2nd International Conference on Computing, Networks and Internet of Things, CNIOT 2021|
|Periode||20/05/2021 → 22/05/2021|
Bibliografisk notePublisher Copyright:
© 2021 ACM.