NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
ERIC Number: ED575776
Record Type: Non-Journal
Publication Date: 2016
Pages: 126
Abstractor: As Provided
ISBN: 978-1-3696-3848-6
ISSN: N/A
EISSN: N/A
Morphosyntactic Neural Analysis for Generalized Lexical Normalization
Leeman-Munk, Samuel Paul
ProQuest LLC, Ph.D. Dissertation, North Carolina State University
The phenomenal growth of social media, web forums, and online reviews has spurred a growing interest in automated analysis of user-generated text. At the same time, a proliferation of voice recordings and efforts to archive culture heritage documents are fueling demand for effective automatic speech recognition (ASR) and optical character recognition (OCR). These sources of text all have two qualities in common: they are high in volume, and they frequently diverge from standard language in their surface forms, making them difficult to analyze using conventional methods. To address these challenges, we either need to update our analysis methods to be robust to noisy text, or we need to design a technique to convert such text into a predetermined standard form, or "normalize" it. This document introduces an instance of the latter approach. Many techniques have been proposed to normalize ASR, OCR, and Twitter data, but they have always been treated as separate tasks despite having much in common. To our knowledge, the work presented here is the first to unite these tasks under a single umbrella task of generalized lexical normalization and develop an approach to this task based on deep learning. We introduce two architectures for this purpose. The first uses a simple feed-forward neural network to perform Twitter normalization. This approach is context-insensitive and achieved third place in the Lexical Normalization of English Tweets Challenge conducted with the ACL Workshop on Noisy User Text at the 2015 Annual Meeting of the Association for Computational Linguistics. Our second architecture is an extension of the first that, using concepts from neural machine translation, adds a gated bidirectional recurrent neural network to use the context in which a word appears as well as the characters in the word itself to normalize both Twitter and other sources of noisy text. We evaluate this second architecture on optical character recognition post-processing, automatic speech recognition post-processing, and Twitter text normalization. In comparison with specialized tools for OCR postprocessing and Twitter normalization, we find that our model performs comparably on each of these tasks to the competing model specialized for it and significantly outperforms the model specialized for the other task. This indicates the ability for our model to learn to normalize different types of noise from data, and suggests that it could similarly learn to be effective on other unseen types of noise without the need for expensive feature engineering. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A
Grant or Contract Numbers: N/A