ERIC Number: ED096987
Record Type: Non-Journal
Publication Date: 1974
Reference Count: N/A
A Theory of Term Importance in Automatic Text Analysis.
Salton, G.; And Others
Most existing automatic content analysis and indexing techniques are based on work frequency characteristics applied largely in an ad hoc manner. Contradictory requirements arise in this connection, in that terms exhibiting high occurrence frequencies in individual documents are often useful for high recall performance (to retrieve many relevant items), whereas terms with low frequency in the whole collection are useful for high precision (to reject nonrelevant items). A new technique known as discrimination value analysis ranks the text words in accordance with how well they are able to discriminate the documents of a collection from each other; that is, the value of a term depends on how much the average separation between individual documents changes when the given term is assigned for content identification. The best words are those which achieve the greatest separation. The discrimination value analysis accounts for a number of important phenomena in the content analysis of natural language texts: (a) the role and importance of single words; (b) the role of juxtaposed words (phrases); (c) the role of word groups or classes, as specified in a thesaurus. Effective criteria can be given for assigning each term to one of these three classes, and for constructing optimal indexing vocabularies. (Author)
Publication Type: Reports - Research
Education Level: N/A
Authoring Institution: Cornell Univ., Ithaca, NY. Dept. of Computer Science.