A Sequential Method for Automatic Document Classification.

White, Lee J.; And Others

Notes FAQ Contact Us

PDF pending restoration

ERIC Number: ED160040

Record Type: Non-Journal

Publication Date: 1975-Aug

Pages: 140

Abstractor: N/A

ISBN: N/A

ISSN: N/A

EISSN: N/A

A Sequential Method for Automatic Document Classification.

White, Lee J.; And Others

The major advantage of sequential classification, a technique for automatically classifying documents into previously selected categories, is that the entire document need not be processed before it is classified. This method assumes the availability of a priori categories, a selection of keywords representative of these categories, and the a priori probabilities of the keywords within each category. In practice, these categories and keyword probabilities are constructed from a randomly selected document sample set. The performance of the sequential technique has been evaluated by classifying diverse data bases. The sequential technique was compared directly with the Williams' discriminant analysis method, and its performance compared very favorably. A series of experiments was also conducted which involved a full-text data base and a hierarchical data base of epilepsy abstracts. A technique has been developed for detection of those keywords which are "noisy" and adversely affect classification. This technique, called the Bayesian distance criterion, is also useful for obtaining multiple classes associated with a document. It is anticipated that the results of this research will find application in the classification and retrieval of library documents and in interactive document retrieval. (Author?CMV)

Descriptors: Algorithms, Automatic Indexing, Bayesian Statistics, Classification, Cluster Grouping, Databases, Documentation, Flow Charts, Mathematical Models, Probability, Sequential Approach, Statistical Analysis

Publication Type: Reports - Research

Education Level: N/A

Audience: N/A

Language: English

Sponsor: National Science Foundation, Washington, DC. Div. of Science Information.

Authoring Institution: Ohio State Univ., Columbus. Computer and Information Science Research Center.

Grant or Contract Numbers: N/A

Privacy | Copyright | Contact Us | Selection Policy | API