A Nugget-Based Test Collection Construction Paradigm.

Rajput, Shahzad K.

Notes FAQ Contact Us

Back to results

Direct link

ERIC Number: ED546592

Record Type: Non-Journal

Publication Date: 2012

Pages: 159

Abstractor: As Provided

ISBN: 978-1-2675-7128-1

ISSN: N/A

EISSN: N/A

A Nugget-Based Test Collection Construction Paradigm

Rajput, Shahzad K.

ProQuest LLC, Ph.D. Dissertation, Northeastern University

The problem of building test collections is central to the development of information retrieval systems such as search engines. The primary use of test collections is the evaluation of IR systems. The widely employed "Cranfield paradigm" dictates that the information relevant to a topic be encoded at the level of documents, therefore requiring effectively complete document relevance assessments. As this is no longer practical for modern corpora, numerous problems arise, including "scalability," "reusability," and "applicability." We propose a new method for relevance assessment based on relevant "information," not relevant "documents." Once the relevant information is collected, any document can be assessed for relevance, and any retrieved list of documents can be assessed for performance. Starting with a few relevant "nuggets" of information manually extracted from existing TREC corpora, we implement and test a method that finds and correctly assesses the vast majority of relevant documents found by TREC assessors, as well as many relevant documents not found by those assessors. We then show how these inferred relevance assessments can be used to perform IR system evaluation. We also demonstrate a highly efficient algorithm for simultaneously obtaining both relevant "documents" and relevant "information." Our technique exploits the mutually reinforcing relationship between relevant documents and relevant information, yielding test collections whose efficiency and efficacy exceeds those of typical Cranfield-style collection construction methodologies. Using TREC assessments as feedback, we later demonstrate that using automatically extracted relevant nuggets from documents as features for learning to rank algorithms significantly outperforms standard learning to rank features. Our main contribution is a methodology for producing test collections that are highly accurate, scalable, reusable, and have great potential for future applications. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]

Descriptors: Information Retrieval, Relevance (Education), Evaluation Methods, Documentation, Data Collection, Models, Accuracy, Scaling, Efficiency, Sustainability

ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml

Publication Type: Dissertations/Theses - Doctoral Dissertations

Education Level: N/A

Audience: N/A

Language: English

Sponsor: N/A

Authoring Institution: N/A

Grant or Contract Numbers: N/A

Privacy | Copyright | Contact Us | Selection Policy | API