Practice Problem 2: Use any or all of bash, sed, awk, perl, python, or ruby to create auxliary htk files for one or more of the available corpora mentioned in Lecture 1, or for a corpus of your choice. For example, if you are working with the AVICAR Corpus, create: 1. The master label file (mlf) without time alignments for files containing TIMIT sentences and digits. You will need to create these files using the file AVICAR.txt. This file contains two kinds of entries D##_###_P0_C1 (9 2 6) 4 8 7 - 4 0 8 4 and D##_##_S4_C2 That dog chases cats mercilessly. The first letter in the filename specifies which block of utterances the audio files are from. There are 10 blocks. Each block is labeled A-J. The second letter in the file name (M|F) indicates the gender of the speaker. The third character in the file name indicates the speaker number. There are a total of 10 speakers in each block, 5 male and 5 female. The middle block of characters indicate the noise conditions in the file. There are 5 noise conditions. They are car idling (IDL), car going 35 MPH with the windows up (35U), car going 35 MPH with the windows down (35D), car going 55 MPH with the windows up (55U), and car going 55 MPH with the windows down (55D). In addition, each utterance can be from one of 8 microphone channels, so in order to specify which channel, each filename will need an _M(1-8). An example of an mlf file created from the two above utterances is: #!MLF!# "*/DM1_IDL_P0_C1_M1.lab'' nine two six four eight seven four zero eight four . "*/DM1_IDL_P0_C1_M2.lab'' nine two six four eight seven four zero eight four . "*/DF3_55D_S4_C2_M5.lab" That dog chases cats mercilessly . "*/DM4_35U_S4_C2_M8.lab" That dog chases cats mercilessly . 2. The script file that HTK will use to locate the acoustic feature files. This file should contain entries of the form /homes/sborys/AVICAR/htk_mfcc/DM1_IDL_P0_C1_M1.mfc /homes/sborys/AVICAR/htk_mfcc/DM1_IDL_P0_C1_M2.mfc /homes/sborys/AVICAR/htk_mfcc/DF3_55D_S4_C2_M5.mfc /homes/sborys/AVICAR/htk_mfcc/DM4_35U_S4_C2_M8.mfc 3. The script file that will tell HCopy what names to use when transforming audio files into acoustic feature files. This file should have entries in the following format: /homes/sborys/AVICAR/audio/DM1_IDL_P0_C1_M1.wav /homes/sborys/AVICAR/htk_mfcc/DM1_IDL_P0_C1_M1.mfc /homes/sborys/AVICAR/audio/DM1_IDL_P0_C1_M2.wav /homes/sborys/AVICAR/htk_mfcc/DM1_IDL_P0_C1_M2.mfc /homes/sborys/AVICAR/audio/DF3_55D_S4_C2_M5.wav /homes/sborys/AVICAR/htk_mfcc/DF3_55D_S4_C2_M5.mfc /homes/svorys/AVICAR/audio/DM4_35U_S4_C2_M8_wav /homes/sborys/AVICAR/htk_mfcc/DM4_35U_S4_C2_M8.mfc 4. Lists of words in the corpus. Each word need only be listed once. For example cat chased dog four one the three two For the UASpeech corpus, create the four kinds of files listed above. Audio filenames in the UASpeech corpus are in the format M16_B2_CW8_M5.wav The first three characters in the filename indicate which speaker produced the utterance. M16 means that this recording was produced by male speaker number 16. The next set of two characters indicates the block number. There are three blocks, B1, B2, and B3. The next part of the filename specifies what kind of utterance is in the file. CW8 indicates that this file contains speaker M16's production of Common Word number 8. The last lump of characters in the filename indicates which microphone channel was used. There were 7 microphone channels. To create mlf files for this corpus, you will need the auxilary file UASpeech.txt. This file contains entries like. oil CW86 equilibrium B1_UW35 The first column is the word utterance contained in the audiofile. The second column is can be used to reconstruct the UASpeech audio filenames. Entries in column 2 that do not contain a B? indicator are included in all three blocks so you will need to create a label file for each block. Create label files for speakers M01, M04, M05, M06, M07, M08, M09, M10, M12, M13, M14, M15, M16, F02, F03, and F04 for all seven microphone channels for all three blocks of data. Both script files will have the same file format as those shown above. The list of words will also have the same format shown above. Create the same four files for the Buckeye and AMI corpora. For these corpora, you will need to get the documentation available online. Feel free to email or visit Sarah (sborys@uiuc.edu, 2013 Beckman) or HJ (jhasegaw@uiuc.edu, 2011 Beckman) with questions, comments, concerns, complaints, complements regarding this ``assignment''.