NotesFAQContact Us
Search Tips
Back to results
ERIC Number: ED555178
Record Type: Non-Journal
Publication Date: 2010
Pages: 165
Abstractor: As Provided
ISBN: 978-1-3033-0148-3
Semi-Supervised Clustering for High-Dimensional and Sparse Features
Yan, Su
ProQuest LLC, Ph.D. Dissertation, The Pennsylvania State University
Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some "weak" form of side information about the domain or data sets can be often available or derivable. In particular, information in the form of instance-level pairwise constraints is general and is relatively easy to derive. The problem with traditional clustering techniques is that they cannot benefit from side information even when available. I study the problem of semi-supervised clustering, which aims to partition a set of unlabeled data items into coherent groups given a collection of constraints. Because semi-supervised clustering promises higher quality with little extra human effort, it is of great interest both in theory and in practice. Semi-supervised clustering shares a difficulty with a large number of other learning methods in data mining literature. That is, they lose their algorithmic effectiveness for high dimensional data. I focus on data with high-dimensional sparse features and present a series of novel semi-supervised clustering approaches that are both effective and efficient in learning from high-dimensional data. The proposed approaches are based on the dimensionality reduction idea. High-dimensional input data are embedded into an optimal low-dimensional subspace determined with the help of side information. The clustering structure of data is more evident in the subspace than in the original input space, and thus enable higher quality clustering solutions. The proposed clustering approaches explore both a small set of constraints and the large amount of unlabeled data, thus perform robustly even with limited side information. Besides, I also study how to automatically generate constraints based on domain knowledge. Since automatically generated constraints are inevitably noisy, I propose a semi-supervised approach that is able to use noisy side information to improve clustering accuracy. Moreover, the non-linear separability problem is studied in the semi-supervised clustering setting. I propose a solution that is computationally as easy as a linear-transformation based method, but is still able to separate non-linear data effectively. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page:]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site:
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A