Rank and Sparsity in Language Processing.

Hutchinson, Brian

Language modeling is one of many problems in language processing that have to grapple with naturally high ambient dimensions. Even in large datasets, the number of unseen sequences is overwhelmingly larger than the number of observed ones, posing clear challenges for estimation. Although existing methods for building smooth language models tend to work well in general, they make assumptions that are not well suited to training with limited data. This thesis introduces a new approach to language modeling that makes different assumptions about how best to smooth the distributions, aimed at better handling the limited data scenario. Among these, it assumes that some words and word sequences behave similarly to others and that similarities can be learned by parameterizing a model with matrices or tensors and controlling the matrix or tensor rank. This thesis also demonstrates that sparsity acts as a complement to the low rank parameters: a low rank component learns the regularities that exist in language, while a sparse one captures the exceptional sequence phenomena. The sparse component not only improves the quality of the model, but the exceptions identified prove to be meaningful for other language processing tasks, making the models useful not only for computing probabilities but as tools for the analysis of language. Three new language models are introduced in this thesis. The first uses a factored low rank tensor to encode joint probabilities. It can be interpreted as a "mixture of unigrams" model and is evaluated on an English genre-adaptation task. The second is an exponential model parameterized by two matrices: one sparse and one low rank. This "Sparse Plus Low Rank Language Model" (SLR-LM) is evaluated with data from six languages, finding consistent gains over the standard baseline. Its ability to exploit features of words is used to incorporate morphological information in a Turkish language modeling experiment, with some improvements over a word-only model. Lastly, its use to discover words in an unsupervised fashion from sub-word segmented data is presented, showing good performance in finding dictionary words. The third model extends the SLR-LM in order to capture diverse and overlapping influences on text (e.g. topic, genre, speaker) using additive sparse matrices. The "Multi-Factor SLR-LM" is evaluated on three corpora with different factoring structures, showing improvements in perplexity and the ability to find high quality factor-dependent keywords. Finally, models and training algorithms are presented that extend the low rank ideas of the thesis to sequence tagging and acoustic modeling. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]