Linguistic Extensions of Topic Models.

Boyd-Graber, Jordan

Topic models like latent Dirichlet allocation (LDA) provide a framework for analyzing large datasets where observations are collected into groups. Although topic modeling has been fruitfully applied to problems social science, biology, and computer vision, it has been most widely used to model datasets where documents are modeled as exchangeable groups of words. In this context, topic models discover topics, distributions over words that express a coherent theme like "business" or "politics." While one of the strengths of topic models is that they make few assumptions about the underlying data, such a general approach sometimes limits the type of problems topic models can solve. When we restrict our focus to natural language datasets, we can use insights from linguistics to create models that understand and discover richer language patterns. In this thesis, we extend LDA in three different ways: adding knowledge of word meaning, modeling multiple languages, and incorporating local syntactic context. These extensions apply topic models to new problems, such as discovering the meaning of ambiguous words, extend topic models for new datasets, such as unaligned multilingual corpora, and combine topic models with other sources of information about documents' context. In Chapter 2, we present latent Dirichlet allocation with WordNet (LDAWN), an unsupervised probabilistic topic model that includes word sense as a hidden variable. LDAWN replaces the multinomial topics of LDA with Abney and Light's distribution over meanings. Thus, posterior inference in this model discovers not only the topical domains of each token, as in LDA, but also the meaning associated with each token. We show that considering more topics improves the problem of word sense disambiguation. LDAWN allows us to separate the representation of meaning from how that meaning is expressed as word forms. In Chapter 3, we extend LDAWN to allow meanings to be expressed using different word forms in different languages. In addition to the disambiguation provided by LDAWN, this offers a new method of using topic models on corpora with multiple languages. In Chapter 4, we relax the assumptions of multilingual LDAWN. We present the multilingual topic model for unaligned text (MuTo). Like multilingual LDAWN, it is a probabilistic model of text that is designed to analyze corpora composed of documents in multiple languages. Unlike multilingual LDAWN, which requires the correspondence between languages to be painstakingly annotated, MuTo also uses stochastic EM to simultaneously discover both a matching between the languages while it simultaneously learns multilingual topics. We demonstrate that MuTo allows the meaning of similar documents to be recovered across languages. In Chapter 5, we address a recurring problem that hindered the performance of the models presented in the previous chapters: the lack of a local context. We develop the syntactic topic model (STM), a non-parametric Bayesian model of parsed documents. The STM generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-specific topic weights and parse-tree-specific syntactic transitions. Words are assumed to be generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents. In Chapter 6, we conclude with a discussion of how the models presented in this thesis can be applied in real world applications such as sentiment analysis and how the models can be extended to capture even richer linguistic information from text. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]