Loading...

Doc2vec Natural Language Model of Farsi

Fazeli, Mohammad | 2019

52 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 56651 (02)
  4. University: Sharif University of Technology
  5. Department: Mathematical Sciences
  6. Advisor(s): Moghadasi, Reza
  7. Abstract:
  8. Due to immense increase in availability of text data, interest in using machine learning models to solve problems previously impossibly costly has increased significantly. The first step is to represent natural language in a form that is easy for the machine learning algorithms to work on. Recent advances in learned representation of text data using simple neural networks(e.g. word2vec and doc2vec) helped increase performance of natural language processing on downstream tasks. Here we show that methods like doc2vec that were examined mostly in the English language can be used on Persian(Farsi) with little modification. To Demonstrate this, we use text classification tasks, and train different models on features extracted using TF-IDF and many classical preprocessings customary on Persian language and compare it with doc2vec features with little preprocessing learned from the corpus and show for sufficiently big datasets, doc2vec features surpass the strong baseline method mention in the classification tasks. But with smaller datasets classical methods still perform better than doc2vec
  9. Keywords:
  10. Natural Language Processing ; Machine Learning ; Language Model ; word2vec Model ; doc2vec Model ; Text Categorization

 Digital Object List

 Bookmark

No TOC