ERIC Number: ED160040
Record Type: Non-Journal
Publication Date: 1975-Aug
Reference Count: N/A
A Sequential Method for Automatic Document Classification.
White, Lee J.; And Others
The major advantage of sequential classification, a technique for automatically classifying documents into previously selected categories, is that the entire document need not be processed before it is classified. This method assumes the availability of a priori categories, a selection of keywords representative of these categories, and the a priori probabilities of the keywords within each category. In practice, these categories and keyword probabilities are constructed from a randomly selected document sample set. The performance of the sequential technique has been evaluated by classifying diverse data bases. The sequential technique was compared directly with the Williams' discriminant analysis method, and its performance compared very favorably. A series of experiments was also conducted which involved a full-text data base and a hierarchical data base of epilepsy abstracts. A technique has been developed for detection of those keywords which are "noisy" and adversely affect classification. This technique, called the Bayesian distance criterion, is also useful for obtaining multiple classes associated with a document. It is anticipated that the results of this research will find application in the classification and retrieval of library documents and in interactive document retrieval. (Author?CMV)
Publication Type: Reports - Research
Education Level: N/A
Sponsor: National Science Foundation, Washington, DC. Div. of Science Information.
Authoring Institution: Ohio State Univ., Columbus. Computer and Information Science Research Center.