Loading...

Efficient stochastic algorithms for document clustering

Forsati, R ; Sharif University of Technology | 2013

380 Viewed
  1. Type of Document: Article
  2. DOI: 10.1016/j.ins.2012.07.025
  3. Publisher: 2013
  4. Abstract:
  5. Clustering has become an increasingly important and highly complicated research area for targeting useful and relevant information in modern application domains such as the World Wide Web. Recent studies have shown that the most commonly used partitioning-based clustering algorithm, the K-means algorithm, is more suitable for large datasets. However, the K-means algorithm may generate a local optimal clustering. In this paper, we present novel document clustering algorithms based on the Harmony Search (HS) optimization method. By modeling clustering as an optimization problem, we first propose a pure HS based clustering algorithm that finds near-optimal clusters within a reasonable time. Then, harmony clustering is integrated with the K-means algorithm in three ways to achieve better clustering by combining the explorative power of HS with the refining power of the K-means. Contrary to the localized searching property of K-means algorithm, the proposed algorithms perform a globalized search in the entire solution space. Additionally, the proposed algorithms improve K-means by making it less dependent on the initial parameters such as randomly chosen initial cluster centers, therefore, making it more stable. The behavior of the proposed algorithm is theoretically analyzed by modeling its population variance as a Markov chain. We also conduct an empirical study to determine the impacts of various parameters on the quality of clusters and convergence behavior of the algorithms. In the experiments, we apply the proposed algorithms along with K-means and a Genetic Algorithm (GA) based clustering algorithm on five different document datasets. Experimental results reveal that the proposed algorithms can find better clusters and the quality of clusters is comparable based on F-measure, Entropy, Purity, and Average Distance of Documents to the Cluster Centroid (ADDC)
  6. Keywords:
  7. Hybridization ; Stochastic optimization ; Document Clustering ; Harmony search ; K-means ; Stochastic optimizations ; Cluster analysis ; Genetic algorithms ; Information retrieval ; Markov processes ; Optimization ; Parameter estimation ; Stochastic systems ; World Wide Web ; Clustering algorithms
  8. Source: Information Sciences ; Volume 220 , 2013 , Pages 269-291 ; 00200255 (ISSN)
  9. URL: http://www.sciencedirect.com/science/article/pii/S0020025512004975