Loading...

A Hybrid Approach for Normalization of Non-Standard Persian Texts

Rostami, Ramtin | 2016

489 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 49482 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Sameti, Hossein; Ghasem-Sani, Gholamreza
  7. Abstract:
  8. With the increase of internet usage and the volume of available data, the need for data mining and text processing is felt. One of the common obstacles for using these methods is usage of colloquial and non-standard language in writings. Due to this fact, combined with the fact that NLP tasks in Persian language had always faced data shortage issues, in this thesis, we first collect and construct a parallel data set, consisting of colloquial texts used in social media. Then after examining various methods used in other languages for text normalization, we propose a combination of new hybrid methods, involving Statistical Machine Translation methodology with some modification, to normalize these texts. Finally we compare the results of these methods with each other as well as with older, rule-based methods used in Persian, by the common measures in the fields of normalization and translation. The best method proposed here have achieved a BLEU score of 0.9284 which is a 0.063 points improvement over the previously proposed rule-based method
  9. Keywords:
  10. Statistical Machine Translation ; Normalization ; Text Normalization ; Informal Text ; Colloquial Texts ; Colloquial Data

 Digital Object List

 Bookmark

No TOC