Essential Readings in Speech Recognition
Foundations in Information Theory and Speech Perception
- Claude Shannon, "A
Mathematical Theory of Communication," Bell System Technical
Journal 27:379-423, October, 1948. Entropy, differential entropy,
information, channel capacity, and N-gram language models. Weaver's
introduction (available in the book) shows
the connection between Shannon's mathematical definition of
"information" and the word "information" as used in common-sense daily
- G. A. Miller and P. E. Nicely, "Analysis of Perceptual Confusions
Among Some English Consonants," Journal of the Acoustical Society of
America 27:338-352, 1955. Analysis of human speech recognition errors
in terms of Shannon's information theory.
- H. W. Sorenson and D. L. Alspach, "Recursive Bayesian Estimation
Using Gaussian Sums," Automatica 7:465-479, 1971. Mixture Gaussian
PDF proposed. Proof and examples show that the mixture Gaussian is a
universal approximator. Application to non-Gaussian and nonlinear
dynamic systems. N-best beam search.
- Kenneth N. Stevens, "The quantal nature of speech: evidence from
articulatory-acoustic data," pp. 51-56 in "Human Communication: A
Unified View," edited by Edward E. David, McGraw-Hill, 1972.
Phonological distinctive features are defined by natural
nonlinearities in the mapping from articulation to acoustics, and from
acoustics to perception.
- J. Makhoul, "Linear prediction: A tutorial review," Proceedings of
the IEEE 63:561-580, 1975.
- Steven Davis and Paul Mermelstein, "Comparison of Parametric
Representations for Monosyllabic Word Recognition in Continuously
Spoken Sentences," IEEE Transactions on Acoustics, Speech and Signal
Processing 28(4):357-366, 1980. Mel-frequency cepstral coefficients
- Robert M. Gray, "Vector Quantization," IEEE ASSP Magazine
- Hynek Hermansky, "Perceptual linear predictive (PLP) analysis of
speech," Journal of the Acoustical Society of America 87(4):1738-1752,
- Frederick Jelinek, "Continuous Speech Recognition by Statistical
Methods", Proceedings of the IEEE 64:532-556, 1976. The 3-state
left-to-right HMM (here called the Bakis model), forward-backward
algorithm, and Viterbi algorithm.
- Bin H. Juang, Stephen E. Levinson and Man Mohan Sondhi, "Maximum
Likelihood Estimation for Multivariate Mixture Observations of Markov
Chains," IEEE Transactions on Information Theory 32(2):307-309, 1986.
Training algorithm for the mixture Gaussian HMM.
- Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro
Shikano and Kevin J. Lang, "Phoneme Recognition Using Time-Delay
Neural Networks," IEEE Transactions on Acoustics, Speech, and Signal
Processing 37:328-339, 1989. One of the first successful uses of ANN
in speech recognition; stop classification accuracy has never been
- Yoshua Bengio, Renato De Mori, Giovanni Flammia, and Ralf Kompe,
"Global optimization of a neural network - hidden markov model
hybrid," IEEE Transactions on Neural Networks 3(2):252-259, 1992.
Training and test for the hybrid HMM-ANN.
- Mari Ostendorf, Vassilios V. Digilakis, Owen A. Kimball, "From
HMM's to Segment Models: A Unified View of Stochastic Modeling for
Speech Recognition," IEEE Transactions on Speech and Audio Processing
5:360-378, 1996. Demonstrates the existence of a continuum of
possible approaches between "segmental" and "frame-based" speech
- Shigeru Katagiri and Bin-Huang Juang and Chin-Hui Lee, "Pattern
Recognition Using a Family of Design Algorithms Based Upon the
Generalized Probabilistic Descent Method," Proceedings of the IEEE
86(11):2345-2373, 1998. Minimum-classification error (MCE) training,
and the relationship between HMMs and dynamic time warping (DTW).
- Madeleine Bates, "The Use of Syntax in a Speech Understanding
System," IEEE Transactions on Acoustics, Speech, and Signal Processing
23(1):112-117, 1975. Recursive transition networks (RTN).
- Stephen E. Levinson, "Structural methods in automatic speech
recognition," Proceedings of the IEEE 73:1625-1650, 1985. Reviews
training algorithms for Markov models, regular grammars, and
stochastic context-free grammars.
- Stephanie Seneff, "TINA: A Natural Language System for Spoken
Language Applications," Computational Linguistics 18(1):61-86, 1992.
- Martin Oerder and Hermann Ney, "Word Graphs: An Efficient
Interface Between Continuous-Speech Recognition and Language
Understanding," ICASSP (International Conference on Acoustics, Speech,
and Signal Processing), 119-123, 1993.
- James F. Allen, Lenhart K. Schubert, George Ferguson, Peter
Heeman, Chung Hee Hwang, Tsuneaki Kato, Marc Light, Nathaniel Martin,
Bradford Miller, Massimo Poesio and David R. Traum, "The TRAINS
project: a case study in building a conversational planning agent,"
Journal of Experimental and Theoretic Artifial Intelligence 7:7-48,
New Machine Learning Methods
- Partha Niyogi and Chris Burges, "Detecting and Interpreting
Acoustic Features by Support Vector Machines," University of Chicago
Computer Science Department Technical Report 2002-02. Kernel-based
support vector machines as a model for landmark detection and the
perceptual magnet effect.
- Paul Viola and Michael Jones, "Robust Real-time Object Detection,"
Workshop on Statistical and Computational Theories of
Vision---Modeling, Learning, Computing, and Sampling, 2001. ADABoost
combination of fast, simple classifiers results in high-accuracy image