Loading...
Utilizing Latent Topic Models for Persian Document Classification and Providing Appropriate Solutions to Improve It
Khaki Ardekani, Basira | 2014
1030
Viewed
- Type of Document: M.Sc. Thesis
- Language: Farsi
- Document No: 45526 (31)
- University: Sharif University of Technology
- Department: Language and Linguistics Center
- Advisor(s): Bahrani, Mohammad; Vazirnezhad, Bahram
- Abstract:
- Text classification accompanied by high precision has become a challenging issue in computational linguistics and natural language processing science. Proper data set accessibility, utilizing the best method and prominent linguistics features has been always regarded as the basic concern of this process. The following study relying on Bijan Khan Corpus is tried to represent keywords vectors of different documents using tf_idf. These vectors are regarded as an input for latent topic models algorithms including probabilistic latent semantic analysis. The output of this algorithm will be the documents feature vectors which will be later used in order to train different classifiers like K nearest neighbor, naïve Baysian and support vector machine. New documents will be finally classified by these classifiers. In order to improve classifier system, linguistics features like bigrams and noun phrases have been introduced as keywords. Tests on 574 documents with 9 various subjects on Bijan Khan Corpus show that F measure has been reached to 88% by using noun phrase as keywords and support vector machine as classifier
- Keywords:
- Probabilistic Latent Semantic Analysis (PLSA) ; K-Nearest Neighbor Method ; Naive Bayes Nearest Neighbor (NBNN) ; Support Vector Machine (SVM) ; Text Classification ; Bigram ; Noun Phrases
-
محتواي کتاب
- view