ERIC Number: ED027917
Record Type: RIE
Publication Date: 1966-Feb
Reference Count: N/A
Discriminant Analysis for Content Classification.
Williams, John H., Jr.
A series of experiments was performed to investigate the effectiveness and utility of automatically classifying documents through the use of multiple discriminant functions. Classification is accomplished by computing the distance from the mean vector of each category to the vector of observed frequencies of a document and assigning the document to the category having the highest probability. Data concerning the effect of the principal classification parameters on classification performance is reported, based on a data base of approximately 2700 abstracts from the solid state physics field. The parameters studied were the number of sample documents required to define a category, the length of documents, the inter-relationship of the number of sample documents and their lengths, the relation of the number of word types in a document to the number of categories, and performance measures. A higher performance level was obtained when samples of 140 documents were used to define each category than with samples of 35 and 70 documents. Classification results obtained on independent test sets of documents ranged from 73 to 92 per cent. The test sets contained 419 and 1333 documents. Results are also reported in terms of Swets' effectiveness measure and Cleverdon's ratios of relevance, recall and precision. (Author)
Descriptors: Automation, Classification, Content Analysis, Discriminant Analysis, Documentation, Indexing, Information Storage, Statistical Analysis
Clearinghouse for Federal Scientific and Technical Information, Springfield, Va. 22151 (AD 630 127, MF $0.65, HC $3.00).
Publication Type: N/A
Education Level: N/A
Sponsor: Rome Air Development Center, Griffiss AFB, NY.
Authoring Institution: International Business Machines Corp., Bethesda, MD. Federal Systems Div.
Identifiers: Probabilistic Indexing