NotesFAQContact Us
Search Tips
ERIC Number: ED550324
Record Type: Non-Journal
Publication Date: 2012
Pages: 118
Abstractor: As Provided
Reference Count: N/A
ISBN: 978-1-2677-7376-0
Multi-Filter String Matching and Human-Centric Entity Matching for Information Extraction
Sun, Chong
ProQuest LLC, Ph.D. Dissertation, The University of Wisconsin - Madison
More and more information is being generated in text documents, such as Web pages, emails and blogs. To effectively manage this unstructured information, one broadly used approach includes locating relevant content in documents, extracting structured information and integrating the extracted information for querying, mining or further analysis. In this thesis, we consider two common and ubiquitous problems, approximate string membership checking and entity matching. The approximate string membership checking problem is to find all the strings in the documents that approximately match some string in a given dictionary. A filter-verification based approach is well recognized as a good way to solve this problem. We propose a new string filter, the token distribution filter, and we use both synthetic and real data sets to empirically verify that the token distribution filter performs well. However, we observe that the token distribution filter is not superior to other filters in all cases. We suspect that maybe no single optimal filter exists for different problem instances. Accordingly, we propose to view approximate string membership checking as an optimization problem, and we propose a multi-filter, optimization based approach to fully utilize all the available string filters to get the best performance. Entity matching is to identify the data records referring to the same entity. Through entity matching, we can accurately integrate all the information on the same entity, or compare the information about the same entity from different sources. We design a human-centric, two-phase entity matching approach, in which users can iteratively check the data records or the intermediate results, propose rules and apply rules to achieve high accuracy. We also propose techniques to make users more efficient and effective during the entity matching process. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page:]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site:
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A