Prosody and Landmarks in Automatic Speech Recognition

This project is a collaboration between faculty in the University of Illinois Departments of ECE, Linguistics, and Computer Science. Research described on this page includes three separate projects, spanning a ten-year time period, and funded by NSF grants 0703624, 0414117, and 0132900. For more information, consult the menus at the top of this page, or follow any of these links:

Prosody (προσῳδία) is the music of speech: its phrasing and prominence.

  • Phrasing is the way in which syllables are chunked, either consciously (for communicative effect) or epiphenomenally (because short-term memory only allows us to plan a limited number of syllables in advance). Phrasing is communicated primarily by lengthening phonemes in the rhyme of the phrase-final syllable.
  • Prominence is the emphasis placed upon particular syllables, either consciously (for communicative effect) or epiphenomenally (because prominence on certain syllables helps to convey phrase structure). Prominence is communicated by increased duration and energy of every articulator movement in the prominent syllable, and of the signal itself.
  • Both phrasing and prominence may be signalled by pitch movements (the "singing" of natural speech), but these pitch movements seem to be very much under the control of the speaker -- it's possible to communicate prosody with or without the pitch movements.

Landmarks are salient instantaneous acoustic events. Landmarks carry information: if the listener can decode the landmarks, then the listener can understand the signal. The syllable structure of speech is conveyed by the Stevens landmarks: consonant releases, consonant closures, syllable nucleus peaks, and intersyllabic dips. The acoustic events that occur in and around a landmark are shaped by the articulator movements that produced it. The articulator movements, in turn, are controlled by a master gestural score governed by (1) the word being spoken, and (2) the prosody with which it is spoken.

The goal of this research is to implement statistical models of phrasing, prominence, of the articulatory plans and actions that implement them, and of the acoustic landmarks that communicate them, in order to improve the accuracy of automatic speech recognition.