Loading...

Machine Learning in Automated Spam Detection

Famil Saeedian, Mehrnoush | 2008

596 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 39076 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Beigy, Hamid
  7. Abstract:
  8. Nowadays spam has become as a universal problem which all email users are familiar with it. Studies show that a large proportion of sent emails are spam. Obviously it results in wasting a vast range of resources. There is different ways to fight spam; each of them has its own strengths and weaknesses. The most common filtering technique is content based filtering. This problem has been addressed as a text classification problem. Two main defect of spam filtering techniques are manually definition of rules and circumventing them, one solution for overcoming this problem is applying machine learning algorithms. Spam classification using machine learning techniques is very successful and attracted all attentions. Ensembles have showed high accuracy among machine learning algorithms. In this thesis, we use ensemble for spam filtering and propose two algorithms to improve ensemble performance in spam filtering. First algorithm is dynamic classifier selection, which is based on clustering and selection technique. The algorithm consists of two steps: clustering and selection. At first step, clustering is used for sub-sampling and a classifier is trained on each cluster. In the second step, when a new email arrives, the most relevant cluster is identified and its corresponding classifier is selected to classify the new email. The evaluation shows that the algorithm outperforms single classifier. In the second algorithm, we propose a dynamic weighted voting algorithm. The first step is similar to the previous algorithm; in the second step all base classifiers' prediction are used for decision making. By new email arrival each classifier obtains a weight according to similarity between corresponding cluster and new email and affects in final decision of the ensemble. Experimental results show that the latter algorithm outperforms majority voting and the former one
  9. Keywords:
  10. Spam ; Machine Learning ; Text Categorization ; Ensemble Learning ; Refining ; Dynamic Classifier Selection ; Dynamic Weighted Voting

 Digital Object List

 Bookmark

No TOC