\documentclass{article}
\setlength{\oddsidemargin}{0in}
\setlength{\topmargin}{0in}
\setlength{\textheight}{9in}
\setlength{\textwidth}{6.5in}
\title{Lecture 3: Acoustic Features}
\author{Lecturer: Mark Hasegawa-Johnson (jhasegaw@uiuc.edu)\\TA: Sarah
Borys (sborys@uiuc.edu)\\Web Page: http://www.ifp.uiuc.edu/speech/courses/minicourse/}
\begin{document}
\maketitle
\section{Where to get Software}
Which software should you use for your speech recognizer development?
It depends on the amount of customization you want to do. This
section will talk about four types of code: commercial off-the-shelf,
academic off-the-shelf, command line toolkit, and object libraries.
Most of the lectures in this course will focus on the ``command-line
toolkit'' category, and most Illinois speech recognition software to
date has been designed around these toolkits. That may soon change.
If you spend time modifying somebody else's source code to achieve
your own research goals, you want to know that you will be able to
redistribute your work. This section will describe the licensing
conditions of each package. A package is considered ``Open Source''
if the license permits the user to modify the source code, and to
redistribute the result, either for free or in the form of a
commercial software system. Open Source software includes Sphinx,
ISIP, MIT FST, LAPACK, and STL. Some authors distribute source, but
do not grant you the right to modify and redistribute: these include
QuickNet, HTK, LibSVM, and SVM-light.
{\bf Commercial Off-the-Shelf:} There are a number of commercial
off-the-shelf systems: some are specialized for dictation, some for
call-center applications, some for information retrieval. I have only
personally tested one commercial dictation system. After adapting the
system to individual voices, we achieved 95\% word recognition
accuracy on really unusual texts. The system adapts to your voice,
and to your language usage; it is possible to teach it special
vocabulary items. Most other customization is impossible: even
scripting (to do batch-mode recognition of multiple files) is only
possible using the ``professional edition.''
{\bf Academic Off-the-Shelf:} These systems are released with trained
recognizer parameters, so that you can use them out of the box to do
speech recognition, if your application uses exactly the same speaking
style and recording hardware as those used to train the recognizer.
All of these also include utilities for re-training the recognizer
using new speech data. Two of these are open source, and Berkeley's
is semi-open. Any open source system may be used as a ``template'' if
you want to modify the system, or create your own.
\begin{enumerate}
\item
Sphinx (http://cmusphinx.sourceforge.net/). License: Open Source,
similar to Apache. Language: Sphinx 3.5 is in C, Sphinx 4 is in Java.
Distributed recognition models: trained using DARPA Hub4, i.e.,
Broadcast News.
\item
ISIP (http://www.isip.msstate.edu). License: Open Source, all uses
explicitly permitted. Language: C. Distributed recognition models:
Resource Management (DARPA small-vocabulary dialog system).
\item
SPRACH/QuickNet
(http://www.icsi.berkeley.edu/\~dpwe/projects/sprach/sprachcore.html).
License: Not quite open source --- redistribution and non-commercial
use allowed, commercial use and redistribution of modified code not
allowed. Code: C. Distributed recognition models: demo is available
for the Berkeley Restaurant Project (BeRP) dialogue system.
\item
GMTK (http://ssli.ee.washington.edu/\~bilmes/gmtk/). License: release
1.3.12 is binaries only; release 2 will be open source. Distributed
models: Aurora digit recognition in noise.
\item
SONIC
(http://cslr.colorado.edu/beginweb/speech\_recognition/sonic.html).
License: binaries/libraries are downloadable; no source except for the
client/server example code. Tutorials are available on-line for
TI-Digits and the Resource Management corpus, both of which are
available at UIUC on the IFP network.
\end{enumerate}
{\bf Command Line Toolkits:}
Most of the lectures in this course will focus on these software
packages.
\begin{enumerate}
\item
HTK (http://htk.eng.cam.ac.uk/) is designed to facilitate customized
training of your own speech recognition models. Use: (1) compile the
toolkit, (2) create configuration files specifying your desired
architecture, (3) create bash or perl scripts to run training
programs. Advantages: HTK is perhaps the best-documented of all
academic recognizers
(http://htk.eng.cam.ac.uk/docs/docs.shtml,~\cite{YouEve02}); unlike
other recognizers, many modifications are possible without changing
any source code. Disadvantages: if you need to modify the source
code, you do not want to use HTK, because (1) the source code is
quirky, and violates POSIX style standards, (2) HTK is not open
source; the license allows modification but not redistribution.
\item
GMTK is designed to be used in the same way as HTK, i.e., system
architecture is specified using extensive configuration files.
Customization in GMTK is much, much more flexible than HTK.
\item
AT\&T FSM Library (http://www.research.att.com/sw/tools/fsm/) and decoder
Distribution: binaries only, controlled by configuration files and
scripting. More on this later.
\item
Support vector machines are usually trained using similar
configuration + scripting methods. LibSVM
(http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and SVM-light
(http://svmlight.joachims.org/) have complementary functionality,
described in the next lecture. NEITHER are open source!! Source code
is available, but licenses for both toolkits prohibit redistribution
prohibits redistributionor commercial use.
\item
PVTK (Periodic Vector Toolkit:
http://www.ifp.uiuc.edu/speech/software). License: Open Source,
Apache license --- but HTK must be installed in order for PVTK to
compile. This toolkit contains three command line programs useful for
concatenating and transforming spectral vectors, extracting spectral
vectors to train a neural net or SVM, and applying a bank of SVMs to a
huge speech database in batch mode. Code is messy and nonstandard,
and will change soon (if you'd like to volunteer to help, please do!),
but we will try to maintain the same command-line interface.
\end{enumerate}
{\bf Object Libraries:} These are libraries of functions designed to
be used inside of your own program. The two most important object
libraries are not speech libraries at all, but general purpose
time-savers: LAPACK and STL. In the speech domain, any open-source
recognizer is a potential ``object library,'' but (to my knowledge)
only three have been designed that way on purpose.
\begin{enumerate}
\item
LAPACK (http://www.netlib.org/lapack95, http://www.netlib.org/clapack,
http://www.netlib.org/lapack++, http://www.cs.utk.edu/java/f2j/).
License: open source, similar to LaTeX, meaning that if you modify and
redistribute any function, you should change the filename. Use:
LAPACK is a standard for efficient linear algebra routines (matrix
inverse, minimum-norm pseudo inverse, eigenvalues, SVD, et cetera).
Parallel implementations exist.
\item
Standard Template Library (STL: http://www.sgi.com/tech/stl/).
Provides templates (header files) that enrich the C++ type system with
hash tables, iterators, vectors, strings, et cetera.
\item
MIT Finite State Toolkit (alpha release available from Lee
Hetherington). License: Open Source. Advantage: an STL-compatible
finite state transducer template. Disadvantage: no documentation yet.
\item
SpeechLib (http://www.ifp.uiuc.edu/speech/software). License: Open
Source, Apache. Use: header files and functions implement Graphs
(graphs in general, and Annotation Graph transcription type in
particular), PDFs, neural networks, and some linear algebra. Code is
clean but nonstandard, and will change soon to better interface with
LAPACK and STL (if you'd like to volunteer to help, please do!).
\end{enumerate}
\section{Acoustic Features}
\label{sec:features}
All acoustic features commonly used in automatic speech recognition
are based on the spectrogram. A ``spectrogram'' is a time-frequency
representation of the signal, $X(t,f)$, where $t$ may be in seconds
and $f$ may be in Hertz. Specifically, $X(t_0,f)$ is the log
magnitude of the Fourier transform, at frequency $f$, of the signal
multiplied by a short window $w(t)$ ($w(t)$ is usually 20-30ms in
length):
\begin{eqnarray}
X(t_0,f) &=& \left| {\mathcal F}\left\{x(t+t_0)w(t)\right\} \right| \\
\mbox{SPEC}(t_0,f) &=& \log X(t_0,f)
\end{eqnarray}
Phonetically readable spectrogram displays compute $X(t_0,f)$ once per
2ms, and humans can hear acoustic events with a temporal precision of
about 2ms, but in order to save memory and computational complexity,
speech recognition algorithms usually only compute $X(t_0,f)$ once per
10ms. The only speech events not well represented by a 10ms skip are
stop bursts. Perhaps more important, a 10ms skip-time makes it
difficult to deal with impulsive background noise (clicks).
\subsection{Mel Spectra}
\label{sec:MFSC}
Human hearing is characterized by two different semi-logarithmic
frequency scales: the Bark scale, and the mel scale. First, the inner
ear integrates spectral energy in bands of width equal to one Bark;
for this reason, if the total noise power within 1/6 octave on either
side of a tone is more than about 4dB above the power of the tone, the
tone will be masked. One Bark is equal to about 1/6 octave (two
semitones, or 12\%), with a minimum value of about 312Hz. Second, the
just-noticeable-difference between the frequencies of two pure tones
is about 3 mels. One mel is equal to about 0.1\% (about 1/700
octave), with a minimum value of about 0.6Hz.
Both of these scales suggest that the ability of humans to
discriminate two sounds is determined by the spectral energy as a
function of semilog-frequency (Bark frequency or mel frequency), not
linear frequency (Hertz). Most speech recognition features average
the spectral energy within bandwidths of 1 Bark, or about 90-120 mels,
in order to eliminate distinctions that depend on the difference
between two frequencies that are within 1 Bark of each other. This
frequency-dependent smearing or smoothing process substantially
increases the accuracy of a speech recognizer.
Frequency-dependent smearing can be done using sub-band filters or
wavelets~\cite{ZhaHas05a}, but is most often done by, literally,
adding up the Fourier transform coefficients in bands of width equal
to about 90 mels (resulting in roughly 32 bands between 0Hz and
8000Hz). The resulting coefficients are called mel-frequency spectral
coefficients, or MFSC:
\begin{eqnarray}
\tilde{X}(t_0,k) &=& \sum_{f=0}^{4000} X(t_0,f) H_k(f)\\
\mbox{MFSC}(t_0,k) &=& \log \tilde{X}(t_0,k)
\end{eqnarray}
The weighting functions, $H_k(f)$, are usually triangular, centered at
frequencies $f_k$ that are about 90 mels apart~\cite{DavMer80}.
\subsection{Mel Cepstra}
\label{sec:MFCC}
Neighboring spectral coefficients are highly correlated: for example,
if there is a spectral peak near center frequency $f_k$, then
$\mbox{MFSC}(t,k)$ is often large, but usually, so is
$\mbox{MFSC}(t,k+1)$. Probabilistic models such as
diagonal-covariance Gaussians work best if the features are
uncorrelated. It has been shown that the MFSCs can be pretty well
decorrelated by transforming them using a discrete cosine transform or
DCT:
\begin{equation}
\mbox{MFCC}(t_0,m) = {\mathcal C}\left\{\mbox{MFSC}(t_0,k)\right\}
\label{eq:mfcc}
\end{equation}
The DCT, ${\mathcal C}$, is just the real part of a Fourier transform.
Thus, in effect, the MFSCs are the inverse Fourier transform of the
log of the frequency-warped spectrum. $\mbox{MFCC}(t,0)$ is similar
to the average spectral energy; $\mbox{MFCC}(t,1)$ is similar to the
average spectral tilt. The peak corresponding to F1 occurs at
$\mbox{MFCC}(t,F_s/F_1)$.
Eq.~\ref{eq:mfcc} is useful for decorrelating the MFSCs, but it
actually doesn't change the perceptual distance between two sounds.
Because the DCT is a unitary transform, the distance between two MFSC
vectors is exactly the same as the distance between two MFCC vectors:
\begin{equation}
\sum_{k} \left(\mbox{MFSC}_1(k)-\mbox{MFSC}_2(k)\right)^2 =
\sum_{m} \left(\mbox{MFCC}_1(m)-\mbox{MFCC}_2(m)\right)^2
\end{equation}
Thus it's possible to think of Eq.~\ref{eq:mfcc} as a meaningless
mathematical convenience --- it simplifies the probabilistic model,
but doesn't change the representation of human perception.
\subsection{LPCC}
\label{sec:LPCC}
Humans are most sensitive to the frequencies and amplitudes of
spectral peaks. There is some evidence that humans completely ignore
zeros in the spectrum, except to the extent that the zeros change the
amplitudes of their surrounding peaks. Thus it may be reasonable to
smooth the spectrum in some way that carefully represents the
frequencies and amplitudes of peaks, while ignoring inter-phoneme
differences in the amplitudes of spectral valleys. It has been
shown~\cite{RabSch78} that focusing on the peak can be accomplished by
using a parameterized spectrum $A(t_0,f)$, and by choosing its
parameters according to the following constrained minimization:
\begin{equation}
A(t_0,f) = \arg\min \sum_f \left(\frac{X^2(t_0,f)}{A^2(t_0,f)}
\right)~~~ \mbox{subject to}~\sum_f X^2(t_0,f)=\sum_f A^2(t_0,f)
\label{eq:min_log_err}
\end{equation}
When the vocal folds clap together, the resulting impulse-like sound
excites the resonance of the vocal tract (formants); the sound
pressure in the vocal tract continues to ring like a bell for a few
milliseconds afterward. During this time, the waveform looks like a
sum of exponentially decaying sine waves. Once you have figured out
the frequencies and decay rates of these sine waves (formants), you
can predict each sample of the speech signal from its $2N$ previous
samples ($N$ is the number of formants; $2N$ because you need to know
both the frequency and decay rate). The prediction coefficients are
called linear prediction coefficients, or LPC~\cite{AtaHan71}. The
LPC coefficients $a_i$ specify a model of the spectrum:
\begin{equation}
A(t_0,f) = \left| \frac{1}{1-\sum_{i=1}^{2N} a_i e^{j2\pi i f}{F_s}}\right|
\label{eq:lpspectrum}
\end{equation}
where $F_s$ is the sampling frequency. If $a_i$ are chosen according
to Eq.~\ref{eq:min_log_err} (as they usually are), then $A(t,f)$ is a
smoothed spectrum that accurately represents the amplitudes and
frequencies of spectral peaks, possibly at the cost of more
The linear predictive cepstral coefficients (LPCC) are computed by
taking the DCT of $\log A(t_0,f)$, in order to decorrelate it:
\begin{equation}
\mbox{LPCC}(t_0,m) = {\mathcal C}\left\{\log A(t_0,f)\right\}
\label{eq:LPCC}
\end{equation}
The LPC smoothing process (Eq.~\ref{eq:lpspectrum}) changes the
implied perceptual distance between two sounds: sounds with different
spectral valleys become more similar, while sounds with different
spectral peaks become less similar. The LPC$\rightarrow$LPCC
transformation (Eq.~\ref{eq:LPCC}) doesn't change the distance between
two sounds; it just decorrelates the features.
\subsection{Perceptual LPC (PLP)}
\label{sec:PLP}
Perceptual LPC combines together the frequency-dependent smoothing of
MFSC with the peak-focused smoothing of LPC~\cite{Her90}. The
spectral model $\tilde{A}(t_0,k)$ is chosen according to
\begin{equation}
\tilde{A}(t_0,k) = \arg\min
\sum_k \left(\frac{\tilde{X}^2(t_0,f)}{\tilde{A}^2(t_0,f)}\right)~~~
\mbox{subject to}~\sum_k \tilde{X}^2(t_0,k)=\sum_k \tilde{A}^2(t_0,k)
\label{eq:plp_error}
\end{equation}
PLP coefficients are never used directly, except for the purpose of
displaying a pretty smoothed spectrum. In speech recognition, they
are always DCT'd:
\begin{equation}
\mbox{PLP}(t_0,m) = {\mathcal C}\left\{\log\tilde{A}(t_0,k)\right\}
\end{equation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Computing Acoustic Features with HTK}
The HTK program {\tt HCopy} can convert a waveform into any of the
acoustic feature types described in Sec.~\ref{sec:features}.
If you are on a machine that doesn't have HTK already installed,
download the source code and/or binaries from
http://htk.eng.cam.ac.uk. You will have to register (free). While
you're there, download the HTK book. If you plan to do anything
serious, you will need chapters 4-17.
\subsection{Using HCopy}
The {\tt HCopy} program copies a speech file. In the process of
copying, it can (1) convert from one feature type to another, (2)
convert from other file formats to HTK file format, (3) extract
subsegments corresponding to labeled phonemes and/or specified start
and end times, or (4) concatenate multiple files. {\tt HCopy} is
usually called (from the unix or DOS command line) as
\begin{verbatim}
HCopy -S scriptfile.scp -T 1 -C config.txt
\end{verbatim}
The {\tt -T 1} option tells any HTK program to print trace (debugging)
information to standard error. Higher trace settings print more
information; {\tt -T 1} will just print the name of each speech file
as it is converted.
An HTK ``script file'' is a list of the files to be processed. All
HTK tools can take a script file using the {\tt -S} option. The
content of a script file is literally appended to the command line;
thus, for example, the command shown above could also be implemented
by typing
\begin{verbatim}
HCopy -S scriptfile.scp
\end{verbatim}
and by including, as the first line in the script file, the characters
\verb:-T 1 -C config.txt:. In general, it is considered good form to
specify options like {\tt -T} and {\tt -C} on the command line, and to
use the script file only to specify the filename arguments.
The filename arguments of {\tt HCopy} come in inputfile-outputfile
pairs, e.g., the file might include
\begin{verbatim}
timit/TRAIN/DR1/FCJF0/SA1.WAV data/fcfj0sa1.plp
timit/TRAIN/DR1/FCJF0/SA2.WAV data/fcfj0sa2.plp
...
\end{verbatim}
The line breaks are optional; you can specify all of {\tt HCopy}'s
filename arguments on the same line, but your script file will be less
readable that way. A script file like the one above could be
generated using the following bash script:
\begin{verbatim}
#!/bin/bash
foreach d `ls timit/TRAIN`; do
foreach s `ls timit/TRAIN/$d`; do
foreach f `ls timit/TRAIN/$d/$s/*.WAV`; do
output=`echo $f | sed '/WAV/s/.*\//data\//;/WAV/s/WAV/plp/;'`;
echo $f $output;
done
done
done
\end{verbatim}
All of the processing to be performed is specified in the
configuration file, which could be called {\tt config.txt}.
Here is an example:
\begin{verbatim}
SOURCEFORMAT = NIST
SOURCEKIND = WAVEFORM
TARGETKIND = PLP_E_D_A_Z_C
TARGETRATE = 100000.0
WINDOWSIZE = 250000.0
NUMCHANS = 32
CEPLIFTER = 27
NUMCEPS = 12
\end{verbatim}
Here are what the lines of the configuration file mean:
\begin{itemize}
\item
{\tt SOURCEFORMAT} specifies the file format of the input. This can
be {\tt NIST} (for NIST or LDC distributed SPHERE files), or {\tt WAV}
(for Microsoft WAV files), or {\tt HTK} (for HTK-format files).
\item
{\tt TARGETFORMAT}, if specified, would declare the desired format of
the output. Only waveform data can be output in any form other than
HTK. Since HTK is the default, TARGETFORMAT doesn't need to be
specified.
\item
{\tt SOURCEKIND} specifies that the input is waveform data. This line
is optional, since WAVEFORM is the default input data kind.
\item
{\tt TARGETKIND} specifies that the outputs are PLP cepstral
coefficients (Sec.~\ref{sec:PLP}). Other useful options include {\tt
MELSPEC} ($\tilde{X}(t_0,k)$, Sec.~\ref{sec:MFSC}), {\tt FBANK}
(MFSCs, Sec.~\ref{sec:MFSC}), {\tt MFCC} (Sec.~\ref{sec:MFCC}), and
{\tt LPCEPSTRA} (Sec.~\ref{sec:LPCC}). The modifiers have the
following meanings:
\begin{itemize}
\item {\tt \_E}: Append a normalized energy to each PLP vector. The
``normalized energy'' is the log Energy, normalized so that the
highest value in each utterance file is exactly 1.0.
\item {\tt \_D}: Append the delta-PLPs and delta-energy to each PLP
vector.
\item {\tt \_A}: Append delta-delta-PLPs and delta-delta-energy to
each PLP vector.
\item {\tt \_Z}: Perform cepstral mean subtraction: from each cepstral
vector, subtract the average cepstral vector, where the average is
computed over the entire file. This option is most useful when the
recognizer will be tested with a different microphone, or in a
different kind of room, than was used to train the recognizer.
Under other conditions, it may hurt recognition performance.
\item {\tt \_C}: Compress the coefficients. Compressed HTK files are
half as large as uncompressed files, and compressed files may
actually have {\em better} precision than uncompressed files.
\end{itemize}
\item {\tt WINDOWSIZE} specifies the length of the window $w(t)$, in
100ns units. The example specifies a 25ms window ($25ms = 25000\mu
s = 250000\times 100ns$).
\item {\tt TARGETRATE} specifies the frame skip parameter, in 100ns
units. The example specifies one frame per 10ms ($10ms = 10000\mu s
= 100000\times 100ns$).
\item {\tt NUMCHANS} specifies the number of mel-frequency filterbanks
to use. Thus, in Eq.~\ref{eq:plp_error}, $k$ goes from 1 to 32.
\item {\tt NUMCEPS} specifies the number of cepstra to keep. Thus the
total acoustic feature vector will have dimension 39: 12 cepstra,
then the energy, then the deltas of all 13 coefficients, then their
delta-deltas.
\item {\tt CEPLIFTER} specifies that the cepstrum should be
``liftered,'' in order to slightly de-emphasize the lower-order
cepstra. Liftering is useful because the low-order cepstra (small
$m$) are usually much larger in amplitude than the high-order cepstra
(large $m$). Without liftering, speech recognition distance measures
would completely ignore the high-order cepstra. Liftering multiplies
the cepstra by a lifter of length $L=27$:
\begin{equation}
\mbox{PLP}'(t,m) = \mbox{PLP}(t,m) \times
\left(1+\frac{L}{2}\mbox{sin}\frac{\pi m}{L}\right)
\end{equation}
\end{itemize}
\subsection{HTK File Format}
\label{sec:HTKfile}
The HTK file format is one of the best available formats for the
compact but precise storage of periodic real-valued spectral or
cepstral vectors (e.g., one vector every 10ms for 2 seconds, or one
vector per day for 365 days, or any other such file). The main
advantage of HTK format is the fast, fixed-length, high-accuracy
data compression scheme. The main disadvantage is the unnecessarily
restrictive type checking.
An HTK file is a 12-byte header, followed optionally by compression
information, followed by sample vectors stored in either 4-byte
floating point or 2-byte short integer format. All data are stored in
IEEE standard floating point, with big-endian data order (high-order
byte stored first), thus if you are writing C code to read or write an
HTK file on a little-endian machine (e.g., Intel or AMD), you will
need to implement byte swapping. The header contains the following
variables:
\centerline{
\begin{tabular}{ll}
nSamples & Number of sample vectors in the file (4 byte integer)\\
sampPeriod & Sample period in 100ns units (4 byte integer)\\
sampSize & Number of bytes in each sample vector (2 byte
integer)\\
parmKind & Two-byte code specifying feature kind
\end{tabular}}
The {\tt parmKind} code specifies the base kind (PLP, MFCC, etc), plus
all modifiers (\_E\_D\_A\ldots) using a code specified on p. 66 of the
HTK book. The {\tt sampSize} specifies the number of {\em bytes} per
vector, not the number of {\tt dimensions:} thus if vectors are
compressed, {\tt sampSize} is twice the number of dimensions,
otherwise four times.
If the file is uncompressed, the rest of the file is a series of
parameter vectors, stored in 4-byte floating point format (IEEE
standard big-endian).
If the file is compressed, the next $8N_d$ bytes contain $2N_d$
floating point numbers: $A[1]\ldots A[N_d]$, and $B[1]\ldots B[N_d]$,
where $N_d$ is the dimension of each sample vector. After the
compression information, the sample vectors themselves are stored as
2-byte big-endian integers, $D[t,m]$. The value of the real features
are computed as:
\begin{equation}
\mbox{PLP}(t,m) = \frac{1}{A[m]} \left(D[t,m] + B[m]\right)
\end{equation}
The values of $A(m)$ and $B(m)$ are chosen as specified on p.67 of the
HTK book, in order to make sure that $\max_t D[t,m]=1$ and $\min_t
D[t,m]=-1$. Since the coefficients are scaled up to take full
advantage of the 16-bit short integer, the signal to quantization
noise ratio (SQNR) of features encoded this way is truly $2^16$, or
96dB --- better, in some cases, than the SQNR of a naive
floating-point encoding.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Manipulating HTK Files with matlab, C, or PVTK}
If you are doing research on the acoustic or phonetic features of a
speech recognizer, you will want to read and write HTK-format files.
Using the information in Sec.~\ref{sec:HTKfile}, together with
pp. 66-67 of the HTK book, you can easily write your own code to read
and write HTK files. There are a few oddities to remember: (1) the
file should be compressed if and only if bit 11 of the {\tt parmKind}
is set (see p. 66 of HTK Book), (2) when writing files, clear bit 13
of the {\tt parmKind}, otherwise HTK tools will expect a checksum at
the end of the file, (3) when writing files that do not contain delta
or delta-delta coefficients, be sure to clear bits 9 and 10 of the
{\tt parmKind}; otherwise HTK will insert zeros on the end of each
feature vector in order to ensure that the feature vector dimension is
compatible with the presence of delta and/or delta-delta coefficients
(i.e., to make sure the feature dimension is a multiple of 3).
Of course you don't need to write your own. matlab code is available
in the {\tt speechfileformats.tgz} file available on the course web
page. There are three different C implementations available. (1)
{\tt PVTK} reads an HTK file as a matrix. The code is in file {\tt
PVTKLib.c}, functions {\tt ReadHParm} and {\tt WriteHParm}; the matrix
definition is specified in {\tt HTKLib/HMem.h}. (2) {\tt SpeechLib}
includes an ``observation'' data type (file {\tt Observation.h}), and
functions that read and write observations from HTK files. SpeechLib
is completely self contained, and does not depend on the HTK librares.
(3) HTK's code for reading and writing files is contained in the file
{\tt HTKLib/HParm.c}. HTK's code is extremly complex, so that it can
handle streaming audio input --- if you don't need streaming audio,
use either the {\tt PVTK} or the {\tt SpeechLib} versions.
\subsection{Using VTransform to Transform HTK Vectors}
In order to compile PVTK, you must first have HTK compiled. Once HTK
is compiled, download the PVTK source. Edit the following lines in the
PVTK Makefile, in order to specify the location of the HTK libary on
your system:
\begin{verbatim}
# Where are the HTK libraries?
hlib = htk
HLIBS = $(hlib)/HTKLib.$(CPU).a
\end{verbatim}
Then type {\tt make}.
PVTK contains three programs: {\tt VTransform}, {\tt VExtract}, and
{\tt VApplySvms}. In order to see a complete manual page, type the
name of the program without any arguments, and hit return.
{\tt VTransform} can be used to concatenate multiple HTK vectors
together into a single vector, or to implement arbitrary linear or
nonlinear transformations of the vectors in an HTK file. {\tt
VTransform} could be used to apply a neural network or SVM to an HTK
file --- but not to train the neural network or SVM!!
For example, consider the following (rather arbitrary) transformation.
Suppose you wish to accept pairs of input files, of type MFCC and PLP.
You want to concatenate the feature vectors from each input file, and
then concatenate together 21 successive frames of the result, creating
a very large data matrix $A_0$. Finally, you want to apply the
following transformation, in order to create data matrix $A_5$:
\begin{equation}
A_5 = \sin(A_0 B + b) - \tanh(A_0 C + c)
\end{equation}
Applying this transform consists of the following steps. First,
create the matrix files Bb.txt and Cc.txt, containing the matrices B
and C and the row vectors b and c in the format given below. Matrices
are loaded from text files (e.g. Bb.txt, Cc.txt) with the following
format. The first line must contain the number of rows (including the
offset vector) and the number of columns. The second line contains
the offset vector. Remaining lines contain the rows of the transform
matrix. Comments are not allowed. A non-square transform matrix will
result in a change in the dimension of the feature vector. Here is a
short example matrix file, for converting from a 5-dimensional input
feature vector to a 4-dimensional output feature vector. The offset
vector is here set to zero:
\begin{verbatim}
6 4
0 0 0 0
1 1 1 1
1 1 1 -1
1 1 -1 -1
1 -1 -1 -1
1 1 -0.5 0.5
\end{verbatim}
Once you have created the files Bb.txt and Cc.txt, then call
\begin{verbatim}
VTransform -S TRAIN.scp -c 21 -m Bb.txt 0 -n sin 1
-m Cc.txt 0 -n tanh 3 -n subtract 2 4
\end{verbatim}
The {\tt -S} option works just as in HTK tools. If there were only
one inputfile per outputfile, then the file {\tt TRAIN.scp} would
contain inputfile-outputfile pairs, e.g.,
\begin{verbatim}
data/fcjf0sa1.plp transformed/fcjf0sa1.vtt
data/fcjf0sa2.plp transformed/fcjf0sa2.vtt
...
\end{verbatim}
In this example, however, we want to concatenate feature vectors from
two different input files: an MFCC file, and a PLP file. This is done
by separating the two input files using the {\tt -h} option
(specifying ``horizontal'' concatenation of the two data matrices).
The file {\tt TRAIN.scp} therefore contains:
\begin{verbatim}
data/fcjf0sa1.mfc -h data/fcjf0sa1.plp transformed/fcjf0sa1.vtt
data/fcjf0sa2.mfc -h data/fcjf0sa2.plp transformed/fcjf0sa2.vtt
...
\end{verbatim}
The {\tt -c} option specifies that, at this point, {\tt VTransform}
should create a new data matrix that is 21 times as wide as the
existing matrix. This is done by concatenating frames. If the MFCC
and PLP input files each contained $N_f$ frames and $N_d$ dimensions,
then the new data matrix $A_0$ will be of size $N_f\times 42N_d$:
\begin{equation}
A_0(t,:) = [\mbox{MFCC}(t-10,:),
\mbox{PLP}(t-10,:),\cdots,\mbox{MFCC}(t+10,:), \mbox{PLP}(t+10,:)]
\label{eq:concatenation}
\end{equation}
The output matrix has exactly as many rows as the input matrix;
Eq.~\ref{eq:concatenation} assumes that frames $t\le 0$ are identical
to frame $t=1$, and that frames $t>N_f$ are identical to frame
$t=N_f$.
The remaining five command line options specify a series of five
linear and nonlinear transformations of the data. Linear
transformations are specified using {\tt -m}; nonlinear options are
specified using {\tt -n}. Data is transformed in the following
stages:
\begin{enumerate}
\item $A_1 = A_0 B + b$ --- linear transformation.
\item $A_2 = sin(A_1)$ --- element-wise nonlinear transformation.
\item $A_3 = A_0 C + c$ --- linear transformation. Notice that the
option specification ({\tt -m Cc.txt 0}) specifies that the input to
stage 3 should come from $A_0$.
\item $A_4 = \tanh(A_3)$
\item $A_5 = A_2 - A_4$.
\end{enumerate}
If {\tt VTransform} does not contain the nonlinear transform that you
are looking for, it is easy to add it. Open the file {\tt
VTransform.c}, and find the function {\tt FunctionToDVectors}. Here
there is a long list of string comparisons followed by arithmetic
operations:
\begin{verbatim}
...
else if (!strcmp(func,"exp")) z[i] = exp(x[i]);
else if (!strcmp(func,"cos")) z[i] = cos(x[i]);
else if (!strcmp(func,"sin")) z[i] = sin(x[i]);
...
\end{verbatim}
add your favorite nonlinear function to this list, and give it a
name. Now, at the command line, type {\tt make}; then when the
program has compiled, you can specify your new function with the {\tt
-n} option to {\tt VTransform}.
\bibliography{/home/hasegawa/ref/references}
\bibliographystyle{plain}
\end{document}