NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
ERIC Number: ED640531
Record Type: Non-Journal
Publication Date: 2023
Pages: 144
Abstractor: As Provided
ISBN: 979-8-3807-3577-3
ISSN: N/A
EISSN: N/A
Design and Data Mining Techniques for Large-Scale Scholarly Digital Libraries and Search Engines
Shaurya Rohatgi
ProQuest LLC, Ph.D. Dissertation, The Pennsylvania State University
The exponential growth of digital libraries and the proliferation of scholarly content in electronic formats have made data mining and information retrieval essential tools for effectively managing, organizing, and disseminating knowledge. This thesis provides a comprehensive analysis of the advancements and challenges in these fields, with a focus on mathematical information retrieval from scholarly documents, figure captioning and classification of scientific images, and searching and re-ranking techniques for large-scale scholarly documents. We also explore the future of scholarly search, considering the potential roles of generative artificial intelligence and scientific question-answering systems in these domains. In the initial section of this thesis, we delve into the complex design and implementation aspects involved in building a large-scale digital library. We discuss various challenges and critical design decisions that were made to ensure the library's long-term sustainability and ease of maintenance. Furthermore, we provide an in-depth analysis of the unique and robust components of CiteSeerX, including its advanced crawling, extraction, ingestion, and production-ready capabilities. Our focus then shifts to the investigation of search and re-ranking techniques specifically tailored for large-scale scholarly documents. We delve into various approaches for indexing, searching, and ranking vast collections of scientific literature, proposing inventive methods for optimizing their performance and scalability. After successfully implementing our system and achieving an impressive index of over 15 million academic papers, we explore the numerous potential applications and opportunities that can arise from this extensive collection of scholarly articles. Following this, we conduct a thorough examination of cutting-edge mathematical information retrieval techniques for extracting and processing mathematical expressions from scholarly documents. We present an exhaustive review of existing approaches, shedding light on their strengths and weaknesses, and propose innovative methods that significantly enhance the accuracy and efficiency of mathematical information retrieval systems. Subsequently, we discuss a subset of CiteSeerX data that is focused on Computational Linguistics (CL) -- The ACL Anthology Corpus. We provide the metadata, full-text, and citation graph for the CL domain. This dataset is then analyzed for deeper insights into the evolving direction of the field and the potential applications that can be developed from it. One such application is addressing the challenges of figure captioning and classification of scientific images. We analyze state-of-the-art methods for extracting and processing image data from scholarly documents and propose a groundbreaking approach that effectively combines advanced image processing techniques with cutting-edge machine learning algorithms for highly accurate and reliable figure captioning and classification. Lastly, we discuss the future of scholarly search and the role of generative AI in scientific question answering. We envision a question answering system, which looks at the relevant literature and formulates an answer for the researcher's information need. To this end, we investigate the potential of large language models and search for enabling such capabilities and outline the challenges and opportunities that lie ahead in this exciting domain. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: National Science Foundation (NSF), Division of Information and Intelligent Systems (IIS); National Science Foundation (NSF)
Authoring Institution: N/A
Grant or Contract Numbers: 1717997; 1823288