Essential Readings in Speech Recognition

Foundations in Information Theory and Speech Perception

  • Claude Shannon, "A Mathematical Theory of Communication," Bell System Technical Journal 27:379-423, October, 1948. Entropy, differential entropy, information, channel capacity, and N-gram language models. Weaver's introduction (available in the book) shows the connection between Shannon's mathematical definition of "information" and the word "information" as used in common-sense daily conversation.
  • G. A. Miller and P. E. Nicely, "Analysis of Perceptual Confusions Among Some English Consonants," Journal of the Acoustical Society of America 27:338-352, 1955. Analysis of human speech recognition errors in terms of Shannon's information theory.
  • H. W. Sorenson and D. L. Alspach, "Recursive Bayesian Estimation Using Gaussian Sums," Automatica 7:465-479, 1971. Mixture Gaussian PDF proposed. Proof and examples show that the mixture Gaussian is a universal approximator. Application to non-Gaussian and nonlinear dynamic systems. N-best beam search.
  • Kenneth N. Stevens, "The quantal nature of speech: evidence from articulatory-acoustic data," pp. 51-56 in "Human Communication: A Unified View," edited by Edward E. David, McGraw-Hill, 1972. Phonological distinctive features are defined by natural nonlinearities in the mapping from articulation to acoustics, and from acoustics to perception.

Acoustic Features

  • J. Makhoul, "Linear prediction: A tutorial review," Proceedings of the IEEE 63:561-580, 1975.
  • Steven Davis and Paul Mermelstein, "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," IEEE Transactions on Acoustics, Speech and Signal Processing 28(4):357-366, 1980. Mel-frequency cepstral coefficients (MFCC).
  • Robert M. Gray, "Vector Quantization," IEEE ASSP Magazine 1(2):4-29, 1984.
  • Hynek Hermansky, "Perceptual linear predictive (PLP) analysis of speech," Journal of the Acoustical Society of America 87(4):1738-1752, 1990.

Acoustic Models

  • Frederick Jelinek, "Continuous Speech Recognition by Statistical Methods", Proceedings of the IEEE 64:532-556, 1976. The 3-state left-to-right HMM (here called the Bakis model), forward-backward algorithm, and Viterbi algorithm.
  • Bin H. Juang, Stephen E. Levinson and Man Mohan Sondhi, "Maximum Likelihood Estimation for Multivariate Mixture Observations of Markov Chains," IEEE Transactions on Information Theory 32(2):307-309, 1986. Training algorithm for the mixture Gaussian HMM.
  • Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano and Kevin J. Lang, "Phoneme Recognition Using Time-Delay Neural Networks," IEEE Transactions on Acoustics, Speech, and Signal Processing 37:328-339, 1989. One of the first successful uses of ANN in speech recognition; stop classification accuracy has never been beaten.
  • Yoshua Bengio, Renato De Mori, Giovanni Flammia, and Ralf Kompe, "Global optimization of a neural network - hidden markov model hybrid," IEEE Transactions on Neural Networks 3(2):252-259, 1992. Training and test for the hybrid HMM-ANN.
  • Mari Ostendorf, Vassilios V. Digilakis, Owen A. Kimball, "From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition," IEEE Transactions on Speech and Audio Processing 5:360-378, 1996. Demonstrates the existence of a continuum of possible approaches between "segmental" and "frame-based" speech recognition.
  • Shigeru Katagiri and Bin-Huang Juang and Chin-Hui Lee, "Pattern Recognition Using a Family of Design Algorithms Based Upon the Generalized Probabilistic Descent Method," Proceedings of the IEEE 86(11):2345-2373, 1998. Minimum-classification error (MCE) training, and the relationship between HMMs and dynamic time warping (DTW).

Speech Understanding

  • Madeleine Bates, "The Use of Syntax in a Speech Understanding System," IEEE Transactions on Acoustics, Speech, and Signal Processing 23(1):112-117, 1975. Recursive transition networks (RTN).
  • Stephen E. Levinson, "Structural methods in automatic speech recognition," Proceedings of the IEEE 73:1625-1650, 1985. Reviews training algorithms for Markov models, regular grammars, and stochastic context-free grammars.
  • Stephanie Seneff, "TINA: A Natural Language System for Spoken Language Applications," Computational Linguistics 18(1):61-86, 1992.
  • Martin Oerder and Hermann Ney, "Word Graphs: An Efficient Interface Between Continuous-Speech Recognition and Language Understanding," ICASSP (International Conference on Acoustics, Speech, and Signal Processing), 119-123, 1993.
  • James F. Allen, Lenhart K. Schubert, George Ferguson, Peter Heeman, Chung Hee Hwang, Tsuneaki Kato, Marc Light, Nathaniel Martin, Bradford Miller, Massimo Poesio and David R. Traum, "The TRAINS project: a case study in building a conversational planning agent," Journal of Experimental and Theoretic Artifial Intelligence 7:7-48, 1995.

New Machine Learning Methods

  • Partha Niyogi and Chris Burges, "Detecting and Interpreting Acoustic Features by Support Vector Machines," University of Chicago Computer Science Department Technical Report 2002-02. Kernel-based support vector machines as a model for landmark detection and the perceptual magnet effect.
  • Paul Viola and Michael Jones, "Robust Real-time Object Detection," Workshop on Statistical and Computational Theories of Vision---Modeling, Learning, Computing, and Sampling, 2001. ADABoost combination of fast, simple classifiers results in high-accuracy image recognition.