Prosody Production and Perception with Conversational Speech.

Mo, Yoonsook

Speech utterances are more than the linear concatenation of individual phonemes or words. They are organized by prosodic structures comprising phonological units of different sizes (e.g., syllable, foot, word, and phrase) and the prominence relations among them. As the linguistic structure of spoken languages, prosody serves an important function in speech communication: prosodic phrasing groups words into pragmatically and semantically coherent small chunks and prosodic prominence encodes discourse-level status and rhythmic structure of a word within a phrase. In speech communication, speakers shape spoken language through the modulation of multiple acoustic parameters related to tempo, pitch, loudness, vocal effort, and strength of articulation in order to signal prosodic structures. Prosody is therefore a major source of phonetic variation in speech and in particular, elements at the edges of prosodic units and those assigned prominence are phonetically distinct from similar elements in different prosodic contexts. From a listener's standpoint, one must attend to this phonetic variation, and, more specifically, to acoustic variation in order to reconstruct the prosodic context and to understand the meaning of an utterance as intended by the speaker. This thesis concerns the communication of prosody in everyday speech, with a primary focus on acoustic variation arising from prosodic context and its interaction with other factors including syntactic, semantic, pragmatic structure, and word predictability. More specifically, the goal of the thesis is to understand prosody in terms of the mechanisms of speech production, to identify the cues that guide listeners' interpretation of prosodic structure, and to establish statistical models of the acoustic encoding of prosody, in everyday conversation. This thesis introduces a new method of prosody annotation, called "Rapid Prosody Transcription (RPT)", which provides reliable and consistent prosody annotations, is comparable to highly trained, expert listeners, and better approximates prosody perception in every speech communication. In RPT, prosody annotation is obtained through the real-time tasks of prosody transcription by a large group of "ordinary" (untrained, non-expert, and thus naive in terms of the phonetics and phonology of prosody annotation) listeners, on the basis of auditory impression only. On the basis of sets of prosodically-annotated speech excerpts extracted from the Buckeye Corpus of spontaneous conversational speech of American English through RPT, the rest of this thesis reports findings regarding prosody production and perception in everyday speech communication. With various statistical methods including non-parametric Spearman's correlation and multiple linear regression analysis, this thesis demonstrates that given the invariance in a set of acoustic parameters, prosodic prominence is signaled through a combination of multiple acoustic parameters from which each speaker may choose any subset as their selection, and prosodic boundary is cued by a single acoustic parameter relating to speech tempo, suggesting that the production mechanisms of prosodic prominence are underlyingly different from those of boundary production. Such difference in the acoustic encoding of prosodic features is further evidenced in the temporal structure of subsyllabic components of monosyllabic CVC words. Evaluating the role of speakers and listeners in the communication of prosody, this thesis reveals speaker-dependent variability in the acoustic encoding of prosody and listener-dependent variability in the decoding of prosody. These findings suggest that given the multiplicity of acoustic parameters, speakers choose any subset as their selection in order to signal prosodic structures and depending on the nature of the acoustic parameters, listeners attend to acoustic variation in particular forms (raw vs. normalized), within particular comparison domains (syntagmatic vs. paradigmatic) in order to correctly interpret prosodic structures. Confirming that acoustic variation in the speech signal guides a listener to perceive prosodic structure as produced by a speaker, this work further show that other factors (syntactic and semantic expectation and word predictability in discourse and in the language) interplay with acoustic variation in prosody perception. This research contributes both to large scale prosody research by introducing a new and innovative method for prosody annotation and to our understanding of the communication of prosody in everyday speech, by highlighting variation in the acoustic encoding of prosody depending on prosodic features as well as on speaker identity and the nature of prosody as an interface phenomenon relating various factors including phonology, syntax, discourse structure, and lexical entropy. Taking into account speaker-dependent variability in the implementation of prosody and a large role of listeners in the normalization of such variability, this thesis proposes the best models of the acoustic encoding of prosody in everyday speech communication. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]