NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
ERIC Number: ED526606
Record Type: Non-Journal
Publication Date: 2009
Pages: 129
Abstractor: As Provided
ISBN: ISBN-978-1-1095-8004-4
ISSN: N/A
EISSN: N/A
Challenges in Managing Information Extraction
Shen, Warren H.
ProQuest LLC, Ph.D. Dissertation, University of Illinois at Urbana-Champaign
This dissertation studies information extraction (IE), the problem of extracting structured information from unstructured data. Example IE tasks include extracting person names from news articles, product information from e-commerce Web pages, street addresses from emails, and names of emerging music bands from blogs. IE is all increasingly important problem in a broad range of applications that seek to utilize the growing amount of unstructured data available today. Such applications include structured community Web portals, data integration systems, and data mining applications over text data. However, despite significant progress, managing IE and building end-to-end IE applications still involves many difficult challenges, including writing complex IE programs and optimizing them, deciding how to store and process the large amounts of data the IE applications manage, and executing and obtaining meaningful results from partially specified or approximate IE programs (e.g., during the development process, or in scenarios where an approximate result may already be sufficient). In this dissertation, we develop solutions to the key challenges mentioned above. First, we develop a declarative framework that can help make it easier for developers to write and understand IE programs, and show how to automatically optimize IE programs written in this framework to reduce runtime. Next, given that relational database systems (RDBMSs) were designed to store and process large data sets, we study the benefits and limitations of employing RDBMSs for storing and processing data in IE applications. Finally, we extend our declarative framework to enable "best-effort IE," allowing developers to more easily write and refine approximate IE programs. A key idea underlying these solutions is that many of the principles behind RDBMSs for managing structured data can be extended to IE for managing unstructured data. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A
Grant or Contract Numbers: N/A