Constraint Clustering for High Dimensional Data

Keramatian, Amir; Beigy, Hamid

Please enable javascript in your browser.

Constraint Clustering for High Dimensional Data

Keramatian, Amir | 2015

973 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 47609 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Beigy, Hamid
Abstract:
Genome sequences, high dimensional digital pictures and on-line text news are all examples of high dimensional data sets. As technology keeps advancing new challenges arise from applications of high dimensional datasets. Amongst these challenges, the problem of constraint clustering for high dimensional data is of great importance. This problem deals with 2 major challenges.The first challenge is the concentration effect of Lp norms, which means as the dimensionality increases, ratio of distance between the closest points to the distance of furthest points approaches 1. This in turn makes the concept of nearest neighbour meaning less. It also means the discriminative property of such distance measures drop as the dimensionality increases. Thus the performance of many classical data analysis tools drops as the dimensionality increases.The second challenge is the appropriate use of must-links and cannont-links as side information made available to the algorithm. This side information is meant to help the clustering algorithm better unravel the intrinsic properties of clusters and samples belonging to them in order to make clustering more accurate.
Due to different intrinsic and topological properties of clusters we propose learning unique transformations for each one of clusters. For this to become well defined we establish some goals and formulate them as an optimization problem.For solving the optimization problem we use an iterative method which updates the clustering and transformations intertwiningly. This method uses the feedback of updating clustering in updating the transformations and vice versa.
For updating clustering in each iteration, we propose 3 different methods. Method one is too greedy which makes it ineffective in practice. Method two is the improved version of the first one. Method three has a probablistic approach and by using it we can find more accurate clusters in average. The main drawback of the third method is it’s convergence problem. In some experiments we have combined method 3 in early iterations of the optimization scheme followed by method 2 in the upcoming iterations.We model another constrained optimization problem for the sub-problem of updating the transfomations. We study and prove a theorem to transform the aforementioned constrained optimization to an unconstrained one. Then we use gradient descent to solve it.Our experiments show that the idea of learning unique transformations for each cluster works. Also by employing constraints we can get more accurate clusterings and also the algorithm converges faster. Our method has out performed existing methods in high dimensional clustering problem
Keywords:
Clustering ; Constrained Clustering ; High Dimention Data ; Must-Link Constraint ; Cannot-Link Constraint

Digital Object List

محتواي کتاب
view

Bookmark

Friend's email
Your name
Your email
enter code