An NLP Framework for Non-Topical Text Analysis in Urdu--A Resource Poor Language.

Mukund, Smruthi

Language plays a very important role in understanding the culture and mindset of people. Given the abundance of electronic multilingual data, it is interesting to see what insight can be gained by automatic analysis of text. This in turn calls for text analysis which is focused on non-topical information such as emotions being expressed that is in contrast to topical text analysis designed to elicit factual information or classify documents into subject categories. Non-topical tasks such as sentiment analysis or emotion detection are dependent on identifying several useful linguistic cues or indicators and go beyond the bag of words model. Performing such tasks is additionally challenging when the text is written in a language such as Urdu. This is due to: (i) the paucity of annotated Urdu data, and (ii) the lack of natural language processing tools to preprocess text and extract useful features. The tasks of interest in Urdu NLP include analyzing data sources such as blogs and comments to news articles, which in turn provide insight into social and human behavior. All of this requires a robust NLP system. The first objective of this work is to develop an NLP infrastructure for Urdu that is customizable and capable of providing basic analysis on which more advanced information extraction tools can be built. Novel techniques based on bootstrap learning and resource sharing are developed to augment available annotated Urdu data needed to train the learning models. A unique Urdu-to-English named-entity transliteration method based on phoneme alignments is also provided to enable faceted search using entities keyed in Latin script. Each of the new Urdu text processing modules is further integrated into a general text-mining platform for future ease of use. The second objective of this work is to detect emotions in Urdu newswire data. In the process, interesting socio-cultural aspects of language usage, such as the marked use of formal Arabic words when expressing intense emotions and the correlation between gender and emotion being expressed are exposed. To facilitate such discoveries, we provide an annotated Urdu newswire corpus for emotion detection using the newly developed language specific non-topical annotation guidelines. Language specific features, resources borrowed from other languages and co-training techniques are leveraged to generate modules needed to quantify subjective cues. Novel methods that identify opinion entities, intensity of the opinions and contexts in which the opinions are expressed are also illustrated. Our analyses provide valuable insights into how language usage frames the reporting of news and thereby influences readers. The work here is not limited to only Urdu newswire data. Novel techniques to generate part of speech information and sentiment polarity in blog data exhibiting code-mixing and code-switching behavior are also illustrated. The work reported here advances the state of the art in both Urdu NLP and non-topical analysis; much of the newly developed framework can be extended to other Indic languages as well. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]