Statistical Speech Technology Group
The Statistical Speech Technology group conducts fundamental and applied research on the automatic recognition of speech and audio events.
For more information, consult the menus at the top of this page, or follow any of these links:
Inventions of this group include the technique of landmark-based audio understanding: a method in which binary detectors, observing heterogeneous acoustic features, are each trained to detect a relatively rare event. Each detector outputs a measure of detector surprise. Detector surprises are evaluated by a Bayesian fusion model, in order to compute a minimum-expected-error transcription of the acoustic signal.
- Heterogeneous acoustic features: Features observed by each binary detector may be pre-specified based on phonetic expert knowledge or auditory expert knowledge or using infograms, automatically generated by training optimal linear or nonlinear symplectic feature transformation methods, or automatically selected from a large pool of candidate features in order to maximize mutual information or in order to minimize Bayes risk.
- Training: Because landmarks are, by definition, rare events, their detectors must be trained using an algorithm that explicitly represents the generalization error (the expected difference between training corpus and test corpus). We have trained landmark detectors using support vector machines (Stevens speech landmarks), feedforward neural networks (Livescu speech landmarks), kernel metric learning for phoneme detection, both feedforward and recursive neural networks (prosodic landmarks), and Adaboost-style minimum probability of error detectors (non-speech acoustic event detectors).
- Rare Events: We have studied three types of landmarks: phonetic landmarks, prosodic landmarks, and non-speech landmarks. Phonetic landmarks include the Stevens landmarks (consonant release, consonant closure, syllable nucleus, and inter-syllabic dip) and the Livescu landmarks (change in place of articulation, manner, vowel, voicing, or nasality). A consonant-vowel-consonant syllable contains four Stevens landmarks, therefore the landmarks themselves are not that rare---but if you want to train a detector for particular types of consonant releases (e.g., palatal affricate consonant releases), then you must be prepared to seek rare events. Prosodic events may be even rarer: an intonational phrase boundary occurs once per nine or ten words in spontaneous English. Non-speech acoustic events are perhaps the rarest of all; it is not uncommon that a user wants to find a particular type of acoustic event (e.g., keys dropped on table) that occurs only once in your training data.
- Surprise is defined to be the log probability of event Y, given observation X and assumptions A (log p(Y|X,A)). In turbo-coding, this quantity is called "extrinsic information;" in psychophysics, it is sometimes called "salience." We sometimes find it useful for detectors to output log likelihood or log joint probability, instead of log posterior.
- Bayesian fusion methods used in our experiments include Gaussian mixture models, hidden Markov models, variable-parameter hidden Markov models, finite state transducers, dynamic Bayesian networks, Kalman smoothers, switching-state Kalman filters, stochastic segment models, and decoder-oriented ideal binary mask estimation.