Loading...
- Type of Document: M.Sc. Thesis
- Language: Farsi
- Document No: 44196 (19)
- University: Sharif University of Technology
- Department: Computer Engineering
- Advisor(s): Beigy, Hamid
- Abstract:
- Content-based spam filtering problem is defined as classifying input emails into spam and legitimate emails. so it is considered as an application of supervised-learning. The supervised learning methods often require a large training set of labelled emails to attain good accuracy and the users should label huge amount of emails. In reality, it is not reasonable to expect users to do this. To address this issue and reduce number of labelling request from user active learning techniques can be used. The goal of active Learning algorithms is to achieve appropriate accuracy by using fewer amounts of labelled data in comparison with supervised-learning methods.In this thesis two active learning methods are proposed to solve spam filtering problems. There are two main criteria in active learning methods which are widely used for active query selection. These are counted as correlation and uncertainty. These two criteria are combined and used for querying instances by using the graph-based active learning methods in both methods aforementioned in this thesis.The first proposed method is exploration based strategy and uses spectral clustering. A harmonic Gaussian random field classifier, which is a graph based semi-supervised classifier, is considered for each cluster. Both uncertainty and correlation criteria has been used by cited classifier for querying in each cluster.The second proposed method is based on uncertainty, and uses data correlation criterion to prevent the selection of outlier data. To incorporate correlation criterion in instances selection this method uses harmonic active learning, which is a Gaussian based active learning strategy.The results of implementation on standard benchmark spam datasets show that first proposed strategy achieves a desired decision boundary faster by fewer sampling rather than random sampling and hierarchical sampling as exploration based strategies and the second proposed method has a better performance in comparison with uncertainty sampling and the method that combines correlation and uncertainty
- Keywords:
- Spam Filtering ; Active Learning ; Spectral Clustering ; Uncertainty Criterion ; Correlation Criterion ; Graph-Based Active Learning ; Harmonic Gaussian Field Classifier
-
محتواي کتاب
- view