Extracting Useful Semantic Information from Large Scale Corpora of Text.

Mendoza, Ray Padilla, Jr.

Extracting and representing semantic information from large scale corpora is at the crux of computer-assisted knowledge generation. Semantic information depends on collocation extraction methods, mathematical models used to represent distributional information, and weighting functions which transform the space. This dissertation provides a solution to the problem of extracting useful collocations, improves the standard vector space model, and posits non-frequency based transformations on the vector space model space. First, several collocation extraction methods exist based on linear proximity, or syntactic structure. Syntactic structure can be generated using parsers, or be provided by treebanks. However, using syntactic structure is computationally expensive and cannot scale well to large-scale corpora. Two algorithms are proposed which approximate those extracted by parser-based methods. They are computationally inexpensive and exclude semantically irrelevant collocations, produce collocations which are more statistically significant than linear proximity-based collocations, and produce tighter, more well-separated clusters. Second, the problem of embedding collocations into a useful mathematical model is commonly addressed with the use of the Vector Space Model (Salton, 1975). However, it implicitly assumes an orthonormal basis. This contradicts the reality that words associated to dimensions which form the basis can be related. A general solution to this issue will be provided which partially relaxes the assumption of orthonormality. The generalized vector space shows improved semantic category separation for known semantic categories. Lastly, weighting functions used on vector spaces are generally frequency based. This is necessary because relationships between points in the vector space do not immediately reflect their distributional relatedness, though they should (Harris, 1954). This is due to frequency effects in language use (Zipf, 1932). The correlation between word frequency and relevant features for specific semantic categories (Overschelde, 2004) will be explored. This dissertation proposes a weighting function based on a quantitative measure of confidence that the asymptotic limit of collected distribution has been reached. When combined with frequency-based weighting, cluster separation improves for known semantic classes. In summary, this work provides an efficient and accurate collocation extraction method, generalizes the vector space model, and offers alternative non-frequency based weighting functions for vector space transformation. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]