Loading...
- Type of Document: M.Sc. Thesis
- Language: Farsi
- Document No: 47609 (19)
- University: Sharif University of Technology
- Department: Computer Engineering
- Advisor(s): Beigy, Hamid
- Abstract:
- Genome sequences, high dimensional digital pictures and on-line text news are all examples of high dimensional data sets. As technology keeps advancing new challenges arise from applications of high dimensional datasets. Amongst these challenges, the problem of constraint clustering for high dimensional data is of great importance. This problem deals with 2 major challenges.The first challenge is the concentration effect of Lp norms, which means as the dimensionality increases, ratio of distance between the closest points to the distance of furthest points approaches 1. This in turn makes the concept of nearest neighbour meaning less. It also means the discriminative property of such distance measures drop as the dimensionality increases. Thus the performance of many classical data analysis tools drops as the dimensionality increases.The second challenge is the appropriate use of must-links and cannont-links as side information made available to the algorithm. This side information is meant to help the clustering algorithm better unravel the intrinsic properties of clusters and samples belonging to them in order to make clustering more accurate.
Due to different intrinsic and topological properties of clusters we propose learning unique transformations for each one of clusters. For this to become well defined we establish some goals and formulate them as an optimization problem.For solving the optimization problem we use an iterative method which updates the clustering and transformations intertwiningly. This method uses the feedback of updating clustering in updating the transformations and vice versa.
For updating clustering in each iteration, we propose 3 different methods. Method one is too greedy which makes it ineffective in practice. Method two is the improved version of the first one. Method three has a probablistic approach and by using it we can find more accurate clusters in average. The main drawback of the third method is it’s convergence problem. In some experiments we have combined method 3 in early iterations of the optimization scheme followed by method 2 in the upcoming iterations.We model another constrained optimization problem for the sub-problem of updating the transfomations. We study and prove a theorem to transform the aforementioned constrained optimization to an unconstrained one. Then we use gradient descent to solve it.Our experiments show that the idea of learning unique transformations for each cluster works. Also by employing constraints we can get more accurate clusterings and also the algorithm converges faster. Our method has out performed existing methods in high dimensional clustering problem - Keywords:
- Clustering ; Constrained Clustering ; High Dimention Data ; Must-Link Constraint ; Cannot-Link Constraint
