A Tale of Two Paradigms: Disambiguating Extracted Entities with Applications to a Digital Library and the Web.

Huang, Jian

With the increasing wealth of information on the Web, information integration is ubiquitous as the same real-world entity may appear in a variety of forms extracted from different sources. This dissertation proposes supervised and unsupervised algorithms that are naturally integrated in a scalable framework to solve the entity resolution problem, which lies at the heart of the information integration process. This dissertation focuses on two incarnations of the entity resolution problem that arise in the data mining and natural language processing areas. First, "name disambiguation" occurs when one is seeking a list of publications of an author in a digital library, who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework that disambiguates the extracted author metadata from paper headers in a divide-and-conquer fashion: based on the metadata records extracted from paper headers, a blocking method retrieves candidate classes of authors with similar names and a density-based clustering method, DBSCAN, clusters the records by author. The distance metric between papers used for clustering is calculated by an online active selection Support Vector Machines algorithm LASVM. We prove that by recasting transitivity as density connectivity in DBSCAN, transitivity is guaranteed for core points. The method achieves high accuracy on a manually labeled dataset and readily disambiguates about a million author metadata records in CiteSeer, which paves the way for the fielded search by author name feature in CiteSeer[superscript X]. Second, as a key step towards document understanding in natural language processing, we investigate the problem of "cross document coreference" (CDC), which aims to decipher the true reference of a named entity across the boundary of documents. This dissertation presents a novel cross document coreference approach that leverages the profiles of entities which are constructed by information extraction tools and reconciled using a within-document coreference module. We propose to match the profiles by using a learned ensemble distance function comprised of a suite of similarity specialists. We develop a kernelized soft relational clustering algorithm that makes use of the learned distance function to partition the entities into fuzzy sets of identities. Evaluation on a large benchmark collection shows that the proposed methods achieve competitive coreference results. We further discuss the details of the implementation of the CDC and web person search system. This dissertation surveys the literature on author name disambiguation in citations and paper headers, citation matching and cross document coreference. Additionally, we explore the social networks of the disambiguated authors, performing a comprehensive study of the network and community level characteristics and proposing a stochastic model to predict collaborations of individuals. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]