Clustering based on the Structure of the Data and Side Information

Soleymani Baghshah, Mahdieh; Bagheri Shouraki, Saeed

Please enable javascript in your browser.

Clustering based on the Structure of the Data and Side Information

Soleymani Baghshah, Mahdieh | 2010

569 Viewed

Type of Document: Ph.D. Dissertation
Language: Farsi
Document No: 40926 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Bagheri Shouraki, Saeed
Abstract:
Clustering is one of the important problems in machine learning, data mining, and pattern recognition fields. When the considered feature space for data representation is not suitable for discrimination of data groups, the data clustering problem may be a difficult problem that cannot be solved properly. In the other words, when the Euclidean distance cannot describe the dissimilarity of data pairs appropriately, the common clustering algorithms may not be helpful and the clusters show arbitrary shapes and spread in such spaces. Although since the late 1990’s several algorithms have been proposed for finding clusters of arbitrary structures, these algorithms cannot yield desirable performance generally (independent of the domain) in the absence of side information. Therefore, in the last few years, attention on such algorithms has been considerably reduced compared with a decade ago. In the last decade, semi-supervised clustering methods that usually use side information in the form of mustlink and cannot-link constraints have been received much attention. Many of the earlier methods incorporate side information in model-based clustering techniques to improve clustering performance. Recently distance metric learning has been considered as a more powerful approach. However, many of the existing methods using thisapproach can either learn a Mahalanobis distance metric (equivalently, learn a linear transformation) or consider only must-link constraints. In this thesis, distance metric learning is used as the most flexible approach that can consider complexities of the cluster structures in the feature space. In the proposed methods, the structure of the data is incorporated with the side information to find proper representations preserving the topological structure of the data while yielding more compatibility with the side information. These methods can learn flexible distance metrics and find spaces in which the Euclidean distance can work. For this purpose, a general framework is considered for learning linear and non-linear transformations and in the other view for learning distance metrics and kernels. In this framework, the topological structure of the data and the side information are used to find a more desirable space. In the other sight, the dimensionality reduction and low-rank kernel (or metric) learning are also attended. The proposed methods are formulated as optimization problems and these problems are solved properly. In the proposed framework, we can learn i) linear transformation in the data space, ii) non-linear transformations which are equivalent to linear transformations in a specified kernel space, iii) non-parametric kernel matrices, and iv) low-rank kernel matrices (and also transformations with dimensionality reduction ability). Experiments have been conducted on synthetic and real-world data sets to evaluate the performance of the proposed methods. Results of our methods are compared with those of some state-of-the-art methods that learn distance metrics for semi-supervised clustering tasks. Our methods show superior performance on many data sets
Keywords:
Semi-Supervised Clustering ; Side Information ; Kernel Learning ; Data Structure ; Metric Learning

Digital Object List

محتواي پايان نامه
view

Bookmark

No TOC

Friend's email
Your name
Your email
enter code