Automatic Extraction of Metadata from Scientific Publications for CRIS Systems.

Kovacevic, Aleksandar; Ivanovic, Dragan; Milosavljevic, Branko; Konjovic, Zora; Surla, Dusan

Notes FAQ Contact Us

Back to results

Peer reviewed

Direct link

ERIC Number: EJ941449

Record Type: Journal

Publication Date: 2011

Pages: 21

Abstractor: As Provided

ISBN: N/A

ISSN: ISSN-0033-0337

EISSN: N/A

Automatic Extraction of Metadata from Scientific Publications for CRIS Systems

Kovacevic, Aleksandar; Ivanovic, Dragan; Milosavljevic, Branko; Konjovic, Zora; Surla, Dusan

Program: Electronic Library and Information Systems, v45 n4 p376-396 2011

Purpose: The aim of this paper is to develop a system for automatic extraction of metadata from scientific papers in PDF format for the information system for monitoring the scientific research activity of the University of Novi Sad (CRIS UNS). Design/methodology/approach: The system is based on machine learning and performs automatic extraction and classification of metadata in eight pre-defined categories. The extraction task is realised as a classification process. For the purpose of classification each row of text is represented with a vector that comprises different features: formatting, position, characteristics related to the words, etc. Experiments were performed with standard classification models. Both a single classifier with all eight categories and eight individual classifiers were tested. Classifiers were evaluated using the five-fold cross validation, on a manually annotated corpus comprising 100 scientific papers in PDF format, collected from various conferences, journals and authors' personal web pages. Findings: Based on the performances obtained on classification experiments, eight separate support vector machines (SVM) models (each of which recognises its corresponding category) were chosen. All eight models were established to have a good performance. The F-measure was over 85 per cent for almost all of the classifiers and over 90 per cent for most of them. Research limitations/implications: Automatically extracted metadata cannot be directly entered into CRIS UNS but requires control of the curators. Practical implications: The proposed system for automatic metadata extraction using support vector machines model was integrated into the software system, CRIS UNS. Metadata extraction has been tested on the publications of researchers from the Department of Mathematics and Informatics of the Faculty of Sciences in Novi Sad. Analysis of extracted metadata from these publications showed that the performance of the system for the previously unseen data is in accordance with that obtained by the cross-validation from eight separate SVM classifiers. This system will help in the process of synchronising metadata from CRIS UNS with other institutional repositories. Originality/value: The paper documents a fully automated system for metadata extraction from scientific papers that was developed. The system is based on the SVM classifier and open source tools, and is capable of extracting eight types of metadata from scientific articles of any format that can be converted to PDF. Although developed as part of CRIS UNS, the proposed system can be integrated into other CRIS systems, as well as institutional repositories and library management systems. (Contains 6 tables and 1 figure.)

Descriptors: Scientific Research, Library Administration, Classification, Information Science, Metadata, Information Retrieval, Scientific and Technical Information, Models, Library Automation, Library Development, Library Materials, Cataloging, Database Management Systems, Computer System Design, Program Descriptions, Data Processing, Programming Languages, Programming

Emerald. One Mifflin Place Suite 400, Harvard Square, Cambridge, MA 02138. Tel: 617-576-5782; e-mail: america@emeraldinsight.com; Web site: http://www.emeraldinsight.com

Publication Type: Journal Articles; Reports - Evaluative

Education Level: Higher Education

Audience: N/A

Language: English

Sponsor: N/A

Authoring Institution: N/A

Grant or Contract Numbers: N/A

Privacy | Copyright | Contact Us | Selection Policy | API