NotesFAQContact Us
Collection
Advanced
Search Tips
ERIC Number: ED515826
Record Type: Non-Journal
Publication Date: 2009
Pages: 172
Abstractor: As Provided
Reference Count: 0
ISBN: ISBN-978-1-1096-5886-6
ISSN: N/A
TableSeer: Automatic Table Extraction, Search, and Understanding
Liu, Ying
ProQuest LLC, Ph.D. Dissertation, The Pennsylvania State University
Tables are ubiquitous with a history that pre-dates that of sentential text. Authors often report a summary of their most important findings using tabular structure in documents. For example, scientists widely use tables to present the latest experimental results or statistical data in a condensed fashion. Along with the explosive development of the digital library and Internet, tables have become a valuable information source for information seeking and data analysis. Interest in and use of table data necessitates table indexing and search. However, current search engines do not support table search. The difficulty of automatically extracting tables from un-tagged documents, the lack of a universal table metadata specification, and the limitation of the existing ranking schemes make the table search problem challenging. Effectively and efficiently searching table data becomes an urgent demand. In this dissertation, we present an automatic table extraction and search engine, "TableSeer". "TableSeer" crawls the web and digital libraries, detects tables from documents using heuristic-based and machine-learning based methods, represents tables using an extensive set of medium-independent table metadata that other people can reuse, indexes table metadata files, ranks tables, and provides a user-friendly search interface. To improve the performance of the table boundary detection, a novel page-box-cutting method and a sparse-line detection method are proposed. Given a keyword-based table search query, TableSeer ranks the matched tables and returns the most relevant tables with a novel table ranking algorithm--TableRank. TableRank tailors the classic vector space model and adopts an innovative term weighting scheme by aggregating multiple features from three levels: the term, table and document levels. Although tables are widely used, there is no standard on the table structure designing. Many issues that go into the design of tables and will impair the table data readability, accessibility, and reusability are ignored. In order to have a deep understanding on the table characterization and to improve the table extraction and search performance, we also implement the first large-scale table quantitative study on table natures in digital libraries. We demonstrate the value of "TableSeer" with empirical studies on scientific documents. The experimental results show that our table search engine outperforms existing search engines on table search. Overall, TableSeer eliminates the burden of manually extracting table data from digital libraries and enables users to automatically examine tables. TableSeer is successfully deployed and in current use in several scientific digital libraries, for example "CiteSeer[superscript x]". [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A