ERIC Number: ED160041
Record Type: Non-Journal
Publication Date: 1975-Aug
Reference Count: N/A
A Distance Measure for Automatic Sequential Document Classification.
Kar, B. Gautam; White, Lee J.
The feasibility of using a distance measure, called the Bayesian distance, for automatic sequential document classification was studied. Results indicate that, by observing the variation of this distance measure as keywords are extracted sequentially from a document, the occurrence of noisy keywords may be detected. This property of the distance measure has been utilized to design a sequential classification algorithm which works in two phases. In the first phase keywords extracted from a document are partitioned into two groups, the good keyword group and the noisy keyword group. In the second phase these two groups are analyzed separately to assign primary and secondary classes to a document. The algorithm has been applied to the SPIN data base, and very encouraging results have been obtained. Appendices include descriptions and mathematical models of (1) Bayesian distance and classification error, (2) Bayesian distance and alpha-j values, (3) Bayesian distance and keyword vectors, and (4) the classification algorithm. (Author/CMV)
Publication Type: Reports - Research
Education Level: N/A
Sponsor: National Science Foundation, Washington, DC. Div. of Science Information.
Authoring Institution: Ohio State Univ., Columbus. Computer and Information Science Research Center.