**ERIC Number:**ED516772

**Record Type:**Non-Journal

**Publication Date:**2010

**Pages:**160

**Abstractor:**As Provided

**Reference Count:**0

**ISBN:**ISBN-978-1-1240-2199-7

**ISSN:**N/A

Knowledge Discovery from Relations

Guo, Zhen

ProQuest LLC, Ph.D. Dissertation, State University of New York at Binghamton

A basic and classical assumption in the machine learning research area is "randomness assumption" (also known as i.i.d assumption), which states that data are assumed to be independent and identically generated by some known or unknown distribution. This assumption, which is the foundation of most existing approaches in the literature, simplifies the complex conditions in the real world problems and makes it attractable to obtain the solutions. The real world problems, however, very often violate this assumption in the sense that the examples are related to each other in certain ways. For example, in a collection of scientific articles, the articles are related to each other through citations and they are not independent of each other. Therefore, those learning approaches based on the i.i.d assumption only perform effectively on the problems (approximately) satisfying the i.i.d assumption and the effectiveness depends on the goodness of the approximation. In these existing approaches, the relations among data are totally ignored such that the underlying factors that are responsible for generating data cannot be fully captured and discovered. The problem of learning from relational data has been receiving more and more attention recently because the rapid development of the Internet has made available huge repositories (such as digital libraries) online, where one of the most important properties is that objects in the repositories are interdependent on each other. In order to accurately capture the intrinsic characteristics of real world problems, one needs to incorporate the relations into the learning process. The relations among data can be categorized into two types: homogeneous relations between objects of the same type and heterogeneous relations between objects of different types. For example, in an image database in which each image has a few words given as annotation, the relations between the words are homogeneous relations and the relations between the images and the words are heterogeneous relations. Moreover, the homogeneous relations can be generalized as the relations between two groups of objects of the same type, which is called the general homogeneous relation. For example, given two subsets of the whole data where no explicit relations are observed between them, the implicit relations still exist in the sense that they both follow the same distribution. In other words, the existence of one subset implies a high probability of the existence of another subset. Thus, this kind of homogeneous relations represents the dependence between the probability densities of two groups of data. Therefore, these various explicit and implicit relations present huge challenges to the classical i.i.d assumption, meanwhile potential benefits are made possible by incorporating the relations into learning processes. This dissertation is dedicated to the problem of incorporating the above relations into learning processes in order to better approximate the underlying characteristics of problems. Specifically, the focus of this dissertation is on developing systematic machine learning approaches for different relational data available in various data mining tasks including supervised learning, unsupervised learning, and semi-supervised learning. The proposed approaches have been applied to developing the advanced data mining and knowledge discovery tools in data mining and information retrieval and the extensive experimental comparisons with state-of-the-art methods demonstrate very promising knowledge discovery capabilities in reality. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]

Descriptors: Artificial Intelligence, Man Machine Systems, Probability, Data, Relationship, Problem Solving, Data Analysis, Pattern Recognition, Classification, Information Retrieval, Program Development, Supervision, Learning Processes, Electronic Libraries, Epistemology

ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml

**Publication Type:**Dissertations/Theses - Doctoral Dissertations

**Education Level:**Higher Education

**Audience:**N/A

**Language:**English

**Sponsor:**N/A

**Authoring Institution:**N/A