Search for: text-classification
0.003 seconds

    Spam Detection using Dynamic Weighted Voting based on Clustering

    , Article 2008 2nd International Symposium on Intelligent Information Technology Application, IITA 2008, Shanghai, 21 December 2008 through 22 December 2008 ; Volume 2 , January , 2008 , Pages 122-126 ; 9780769534978 (ISBN) Famil Saeedian, M ; Beigy, H ; Sharif University of Technology
    In the last decade spam detection has been addressed as a text classification or categorization problem. In this paper we propose a new dynamic weighted voting method based on the combination of clustering and weighted voting, and apply it to the task of spam filtering. In order to classify a new sample, it first compares with all cluster centroids and its similarity to each cluster is identified; Classifiers in the vicinity of the input sample obtain greater weight for the final decision of the ensemble. The evaluation shows that the algorithm outperforms pure SVM. © 2008 IEEE  

    Extracting Cultural Similarities from Social Networks Data Using Topic Detection Techniques

    , M.Sc. Thesis Sharif University of Technology Annamoradnejad, Issa (Author) ; Habibi, Jafar (Supervisor)
    With the widespread usage of the internet among all layers of societies and the fast growth of social networks’ impact, researchers found a new source to study people’s habits, interests and culture. In order to combine these two important aspects of social networks, we used data from social networks to perform a cross-cultural study. The proposed method includes steps of gathering data from twitter, automatic classification of tweets into news categories and calculating cultural distance and cultural similarities from the overall distribution of tweets among the selected classes. By applying the proposed method on a sample of tweets in 2016, we examined the overall tendencies of users of... 

    Adversarial Robustness of Deep Neural Networks in Text Domain

    , M.Sc. Thesis Sharif University of Technology Behjati, Melika (Author) ; Soleymani Baghshah, Mahdieh (Supervisor)
    In recent years, neural networks have been widely used in most machine learning domains. However, it has been shown that these networks are vulnerable to adversarial examples. adversarial examples are small and imperceptible perturbations applied to the input which lead to producing wrong output and thus, fooling the network. This will become an important issue in security related applications of deep neural networks, such as self-driving cars and medical diagnostics. Since, in the wort-case scenario, even human lives could be threatened. Although, many works have focused on crafting adversarial examples for image data, only a few studies have been done on textual data due to the existing... 

    Utilizing Latent Topic Models for Persian Document Classification and Providing Appropriate Solutions to Improve It

    , M.Sc. Thesis Sharif University of Technology Khaki Ardekani, Basira (Author) ; Bahrani, Mohammad (Supervisor) ; Vazirnezhad, Bahram (Co-Advisor)
    Text classification accompanied by high precision has become a challenging issue in computational linguistics and natural language processing science. Proper data set accessibility, utilizing the best method and prominent linguistics features has been always regarded as the basic concern of this process. The following study relying on Bijan Khan Corpus is tried to represent keywords vectors of different documents using tf_idf. These vectors are regarded as an input for latent topic models algorithms including probabilistic latent semantic analysis. The output of this algorithm will be the documents feature vectors which will be later used in order to train different classifiers like K... 

    Deep Semi-Supervised Text Classification

    , M.Sc. Thesis Sharif University of Technology Karimi, Ali (Author) ; Semati, Hossein (Supervisor)
    Large data sources labeled by experts at cost are essential for deep learning success in various domains. But, when labeling is expensive and labeled data is scarce, deep learning generally does not perform well. The goal of semi-supervised learning is to leverage abundant unlabeled data that one can easily collect. New semi-supervised algorithms based on data augmentation techniques have reached new advances in this field. In this work, by studying different textual augmentation techniques, a new approach is proposed that can obtain effective information signals from unlabeled data. The method encourages the model to generate the same representation vectors for different augmented versions... 

    A new ensemble method for feature ranking in text mining

    , Article International Journal on Artificial Intelligence Tools ; Volume 22, Issue 3 , June , 2013 ; 02182130 (ISSN) Sadeghi, S ; Beigy, H ; Sharif University of Technology
    Dimensionality reduction is a necessary task in data mining when working with high dimensional data. A type of dimensionality reduction is feature selection. Feature selection based on feature ranking has received much attention by researchers. The major reasons are its scalability, ease of use, and fast computation. Feature ranking methods can be divided into different categories and may use different measures for ranking features. Recently, ensemble methods have entered in the field of ranking and achieved more accuracy among others. Accordingly, in this paper a Heterogeneous ensemble based algorithm for feature ranking is proposed. The base ranking methods in this ensemble structure are... 

    Persian text classification based on topic models

    , Article 24th Iranian Conference on Electrical Engineering, ICEE 2016, 10 May 2016 through 12 May 2016 ; 2016 , Pages 86-91 ; 9781467387897 (ISBN) Ahmadi, P ; Tabandeh, M ; Gholampour, I ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2016
    With the extensive growth in information, text classification as one of the text mining methods, plays a vital role in organizing and management information. Most text classification methods represent a documents collection as a Bag of Words (BOW) model and then use the histogram of words as the classification features. But in this way, the number of features is very large; therefore performing text classification faces serious computational cost problems. Moreover, the BOW representation is unable to recognize semantic relations between words. Recently, topic-model approaches have been successfully applied for text classification to overcome the problems of BOW. Our main goal in this paper...