Loading...

Improving the Scalability of the DBSCAN Clustering Algorithm using Intelligent Histogram-Based Partitioning

Nouradini, Mahdi | 2025

0 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 58482 (05)
  4. University: Sharif University of Technology
  5. Department: Electrical Engineering
  6. Advisor(s): Bayat, Siavash
  7. Abstract:
  8. The density-based clustering algorithm, DBSCAN, is widely recognized for its unique ability to identify arbitrarily shaped clusters and handle noise. However, with the exponential growth of data in modern applications, the algorithm's high computational complexity has become a significant bottleneck, limiting its scalability for large datasets. This research addresses this challenge by introducing and evaluating a novel hybrid clustering algorithm, named HB-DBSCAN. The proposed method is based on a "divide and conquer" strategy, with its core innovation lying in a fast and non-iterative partitioning process driven by one-dimensional histogram analysis. In this algorithm, the natural boundaries between dense regions are first identified by analyzing the structure of a key feature's histogram. These boundaries are then used to intelligently partition the entire dataset into smaller subsets, upon which the standard DBSCAN algorithm is executed locally. The theoretical foundation of this approach is based on the assumption that the histogram serves as a non-parametric estimate of an underlying Gaussian Mixture Model. For performance evaluation, the HB-DBSCAN algorithm was implemented and tested on six standard datasets from the UCI repository, and its results were compared against standard DBSCAN and K-DBSCAN. The experimental results demonstrate that the proposed algorithm is significantly faster than both alternatives, achieving a runtime improvement of up to $96\%$ over standard DBSCAN on large datasets. Furthermore, this increase in efficiency did not come at the cost of quality; in most cases, the clustering quality, measured by internal validation indices such as the Silhouette score, was either maintained or substantially improved. Ultimately, this research demonstrates that intelligent partitioning based on histogram density analysis is an effective, efficient, and powerful strategy for solving the scalability problem of density-based algorithms, making it a viable pre-processing step in big data analysis pipelines
  9. Keywords:
  10. Clustering ; Density-based Spatial Clustering Applicoction with Noise (DBSCAN) ; Big Data Proccessing ; Hybrid Algorithm ; Scalability ; Histogram Analysis ; Big Data Analytics ; Partitioning

 Digital Object List

 Bookmark

...see more