Utilizing Latent Topic Models for Persian Document Classification and Providing Appropriate Solutions to Improve It

Khaki Ardekani, Basira; Bahrani, Mohammad Vazirnezhad, Bahram

Please enable javascript in your browser.

Utilizing Latent Topic Models for Persian Document Classification and Providing Appropriate Solutions to Improve It

Khaki Ardekani, Basira | 2014

1030 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 45526 (31)
University: Sharif University of Technology
Department: Language and Linguistics Center
Advisor(s): Bahrani, Mohammad; Vazirnezhad, Bahram
Abstract:
Text classification accompanied by high precision has become a challenging issue in computational linguistics and natural language processing science. Proper data set accessibility, utilizing the best method and prominent linguistics features has been always regarded as the basic concern of this process. The following study relying on Bijan Khan Corpus is tried to represent keywords vectors of different documents using tf_idf. These vectors are regarded as an input for latent topic models algorithms including probabilistic latent semantic analysis. The output of this algorithm will be the documents feature vectors which will be later used in order to train different classifiers like K nearest neighbor, naïve Baysian and support vector machine. New documents will be finally classified by these classifiers. In order to improve classifier system, linguistics features like bigrams and noun phrases have been introduced as keywords. Tests on 574 documents with 9 various subjects on Bijan Khan Corpus show that F measure has been reached to 88% by using noun phrase as keywords and support vector machine as classifier
Keywords:
Probabilistic Latent Semantic Analysis (PLSA) ; K-Nearest Neighbor Method ; Naive Bayes Nearest Neighbor (NBNN) ; Support Vector Machine (SVM) ; Text Classification ; Bigram ; Noun Phrases

Digital Object List

محتواي کتاب
view

Bookmark

No TOC

Friend's email
Your name
Your email
enter code