Persian word embedding evaluation benchmarks

Zahedi, M. S ; Sharif University of Technology | 2018

553 Viewed
  1. Type of Document: Article
  2. DOI: 10.1109/ICEE.2018.8472549
  3. Publisher: Institute of Electrical and Electronics Engineers Inc , 2018
  4. Abstract:
  5. Recently, there has been renewed interest in semantic word representation also called word embedding, in a wide variety of natural language processing tasks requiring sophisticated semantic and syntactic information. The quality of word embedding methods is usually evaluated based on English language benchmarks. Nevertheless, only a few studies analyze word embedding for low resource languages such as Persian. In this paper, we perform such an extensive word embedding evaluation in Persian language based on a set of lexical semantics tasks named analogy, concept categorization, and word semantic relatedness. For these evaluation tasks, we provide three benchmark data sets to show the strengths and weakness of five well-known embedding models which are trained on Wikipedia corpus. The experimental results indicates that FastText (sg) and Word2Vec(cbow) outperform other models. © 2018 IEEE
  6. Keywords:
  7. Evaluation Benchmark ; FastText ; Word2Vec ; Benchmarking ; Natural language processing systems ; Petroleum reservoir evaluation ; Semantics ; GloVe ; Low resource languages ; Semantic relatedness ; Syntactic information ; Word embedding ; Word representations ; Quality control
  8. Source: 26th Iranian Conference on Electrical Engineering, ICEE 2018, 8 May 2018 through 10 May 2018 ; 2018 , Pages 1583-1588 ; 9781538649169 (ISBN)
  9. URL: https://ieeexplore.ieee.org/document/8472549