Speech Tools Minicourse 2009
Acoustic Features, Acoustic Modeling, and Language Modeling (HTK, bash, and ruby) Video
- makemmf.rb - make HTK master model files
- makemlf.rb - make HTK master label files
- isledict2htk.rb - Create an HTK dictionary by throwing away information in ISLEX that HTK can't use. Ten-line sample dictionaries: islesample.txt, and dictsample.txt.
- train_spid.sh - Train a UBM/MAP speaker identification system - not finished yet!
- Get a speech data training database. If you are working on a project for your own research, use that data; if not, download the AVICAR phonetically balanced sentences, because that is what I will use for most of the rest of the course.
- View the waveforms in Praat; convince yourself that you can view the spectrogram and the pitch track, and that you can listen to the waveform.
- If you're not confident of your matlab skills, work through the Matlab on athena tutorial.
- Load one of the 55D (55mph with the windows rolled down) waveforms in matlab, and perform spectral subtraction. Listen to the waveform before and after spectral subtraction; how well did it work?
- Try VAD on a relatively quiet waveform (IDL condition) and a relatively noisy waveform (55D); you should find that it works perfectly in IDL condition, but perhaps not perfectly in noise.
- Finally, try writing a matlab function that accepts, as argument, the names of input and output waveform files. The script should read in the waveform file, perform spectral subtraction, perform VAD, chop the waveform into multiple sub-files (each file contains no more than 300ms initial silence and 300ms final silence as estimated by the VAD), then save each sub-file to filenames constructed from the given output name (for example, if the output name was foo.wav, and VAD found speech segments starting at sample number 1109, 44038, and 140932, then you could save the ouput to files named foo001109.wav, foo044038.wav, and foo140932.wav).
- Using the material presented in Lectures 2 and 3, train a monophone recognition system. Sarah recommends doing
this as follows:
- Download and install the HTK tools if they are not already installed on your system, and some of the AVICAR data.
- Follow the tutorial overview presented in the HTK book.
- Contact Mark or Sarah if you have problems or questions.
- Use train.pl to train an HMM speech recognizer on AVICAR, UASpeech, Buckeye, or the AMI Corpus. Some additional Perl or Ruby processing may be required if your original corpus transcriptions are in a format not supported by train.pl. Contact Sarah (email@example.com, 2013 Beckman) or Mark (firstname.lastname@example.org, 2011 Beckman) if you get stuck or if you find bugs in train.pl.
- Using the auxiliary files provided in crypto.tgz, create an FSM decoder similar to the one Mark trained in lecture.
- Next, create an FSM encoder. Send Sarah (email@example.com) an encoded email, then use your decoder to read the reply!
- Use GMTK, together with the parameter files and training scripts in the demo archive, to train a set of monophone acoustic models. Note that if you are not on IFP, you may need to adjust your path to make sure that GMTK is in the path.
Tools Specifically Designed for Speech and Language
- NIST Multimodal Information Group
- Cognitive Computation Group Software
- Statistical Speech Technology Group Software
- Voicebox Toolbox for Matlab
- MVA (Mean, Variance and ARMA) Normalization
- AT&T FSM Library
- AT&T FSM-based speech recognition decoder
- ISIP Recognizer
Machine Learning Tools (neural nets, SVMs, etc)
- Buckeye Corpus
- AMI Meeting Corpus
- Linguistic Data Consortium
- ICSI Switchboard Phonetic Transcriptions
- ISIP Switchboard Orthographic Transcriptions
- ISLEX dictionary is created by merging Moby, ISIP, and CMUDICT, and converting to worldbet
- CMUDICT uses ARPABET phone symbols, similar to those used in TIMIT and ICSI Switchboard, but with a few small differences.