\documentclass[]{article}
\usepackage{psfig}
\setlength{\oddsidemargin}{0in}
\setlength{\topmargin}{0in}
\setlength{\textheight}{9in}
\setlength{\textwidth}{6.5in}
\newcommand{\nop}{\rule{0in}{0in}}
\title{Lecture 4: Phonetic Features, Neural Nets, and Support Vector Machines}
\author{Lecturer: Mark Hasegawa-Johnson (jhasegaw@uiuc.edu)\\TA: Sarah
Borys (sborys@uiuc.edu)\\Web Page: http://www.ifp.uiuc.edu/speech/courses/minicourse/}
\begin{document}
\maketitle
\section{Software Tools}
In order to do the labs in this section, you will need to install HTK,
PVTK, libSVM, and svm-light. If you are interested in training and
testing neural networks (instead of SVMs), you may also want to
install and compile QuickNet, or you may want to use the matlab neural
networks toolkit.
LibSVM~\cite{FanChe05} and svm\_light~\cite{Joa98} are distributed in
binary form, so you won't need to compile them. If you're serious
about SVMs, you will want both, because both have minor limitations
(unless I have just missed these features in the documentation).
LibSVM but not svm\_light is able to train multi-class SVMs.
SVM\_light but not LibSVM is able to output the SVM discriminant
function for each classified vector. Both libSVM and svm\_light are
built around two main command-line programs: one to train SVMs (called
{\tt svm\_learn} or {\tt svm-train}), and one to test SVMs (called
{\tt svm\_classify} or {\tt svm-predict}). LibSVM also includes a
useful pre-processing tool called {\tt svm-scale}, as well as a useful
directory full of python tools that automatically search for the best
SVM hyperparameters, and plot the result. Both LibSVM and svm\_light
use similar (but not identical) command line syntax, and similar (but
not identical) SVM definition file format. Both use the same file
format for data: data must be stored in text files, with one vector
per row. The first element in the row is the class label (-1 or 1 in
svm\_light, any digit in libSVM). Remaining columns are in the form
dimension:value, where dimension is the feature number, and value is
its value; dimensions not specified are assumed to have a value of
zero. For example, the following svm\_light input data file stores
one positive token $[4,0,1.2]$ and one negative token $[0,-3,0.6]$:
\begin{verbatim}
1 1:4 3:1.2
-1 2:-3 3:0.6
\end{verbatim}
In order to use PVTK and HTK: create a directory called ``programs,''
or ``src,'' or ``apps,'' or something like that. Create
subdirectories ``apps/htk'' and ``apps/pvtk.'' If you are on a system
that already has \verb:libHTK.a: available somewhere, e.g., in the
directory \verb:/workspace/HTK_V3.1/HTKLib/libHTK.linux.a:, then link
these directories into htk with commands like
\begin{verbatim}
ln -s /workspace/HTK_V3.1/HTKLib apps/htk
ln -s /workspace/HTK_V3.1/bin.linux apps/htk
\end{verbatim}
If your system does not have HTK installed, then install it: unpack
the HTK archive to get directories ``apps/htk/HTKLib'' and
``apps/htk/HTKTools.'' Create the directory apps/htk/bin.linux or
bin.win32, depending on which machine you're using. If you're on a
Windows machine with Visual C installed, you may follow the
instructions in htk/HTKLib/htk\_htklib\_nt.mkf and
htk/HTKTools/htk\_htktools\_nt.mkf --- or alternatively, you should be
able to install Cygwin (be sure to install all of the development
tools, including at least gcc and make), and then you could follow the
linux instructions in your cygwin window. If you're on a linux
machine, edit HTKLib/Makefile and HTKTools/Makefile so that they
contain the following lines:
\begin{verbatim}
CPU = linux
HTKCC = gcc
HTKCF = -ansi
\end{verbatim}
Then type {\tt cd HTKLib; make; cd ../HTKTools; make}.
Now unpack the PVTK archive in apps/pvtk. You should be able to type
{\tt cd apps/pvtk; make}. If that doesn't work, check the Makefile to
make sure that the HLIBS variable is pointing to the true location of
the HTK library archive.
In order to use all of these tools, make sure that these directories
are in your path:
\verb|export PATH=|\$\verb|{PATH}:~/apps/htk/bin.linux:~/apps/pvtk:~/apps/svm_light:~/apps/libsvm-2.7|
\section{Binary Classifiers: LDA, NN, SVM}
\subsection{Definition of Terms}
\begin{eqnarray*}
\mbox{Observation:} &&
\vec{x} \in \Re^K,~~k^{th}~\mbox{element is}~x_k\\
\mbox{Label:} &&
y \in \left\{-1,1\right\}\\
\mbox{Trainable parameters:} &&
\vec\theta \in \Re^D\\
\mbox{Training tokens:} &&
(\vec{x}_m,y_m),~~\vec{x}_m=[x_{m1},\ldots,x_{mK}]'
\end{eqnarray*}
An observation can be a spectral slice, or it can be a whole
spectrogram. For example, by concatenating 27 MFSC vectors (from
$t-13$ to $t+13$), each with 32 dimensions, we get a total observation
vector $\vec{x}_t$ with a dimension of $32\times 27=864$:
\centerline{\psfig{figure=melgram.ps,width=3in}\psfig{figure=vectorized.ps,width=3in}\nop}
A ``binary classifier'' is a function $h(\vec{x})$ that observes a
vector $\vec{x}$ (e.g., a spectrum or spectrogram), and computes a
binary output (e.g., the value of a phonological distinctive feature
at some specified time):
\begin{equation}
h(\vec{x}|\vec\theta) \in \left\{-1,1\right\}
\end{equation}
Assume that labels and observations obey some constant joint
probability distribution, $p(\vec{x},y)$. The quality of a classifier
is it's ``expected risk'' or ``risk'' or ``expected test corpus error
rate:''
\begin{equation}
R(\vec\theta) = E |y-h(\vec{x}|\vec\theta)|
= \sum_y \int |y-h(\vec{x}|\vec\theta)| p(\vec{x},y) dx
\end{equation}
In most practical cases, $p(\vec{x},y)$ is unknown, therefore the
expected risk is unknown. Instead, all that we have available are $M$
labeled training tokens, $(y_m,\vec{x}_m)$. Given the training
tokens, all that we can compute is the ``empirical risk'' or
``training corpus error:''
\begin{equation}
R_{emp}(\vec\theta) = \frac{1}{M} \sum_{m=1}^M |y_m-h(\vec{x}_m|\vec\theta)|
\end{equation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Maximum A Posterior Classification}
If the joint probability density $p(\vec{x},y)$ is known, then the
classifier risk can be explicitly minimized. The optimal classifier
is the classifier that chooses a label, $y$, in order to maximize the
{\em a posteriori} probability $p(y|\vec{x})$:
\begin{equation}
E |y-h(\vec{x})| = \int_{h(\vec{x})=-1} p(y=1|\vec{x})dx +
\int_{h(\vec{x})=1}p(y=-1|\vec{x})dx
\end{equation}
\begin{equation}
h_{opt}(\vec{x}) = \arg\max_y p(y|\vec{x}) = \left\{\begin{array}{ll}
1 & p(y=1|\vec{x})~\mbox{is larger}\\
-1 & p(y=-1|\vec{x})~\mbox{is larger}
\end{array}\right.
\end{equation}
The MAP classifier can be written as a threshold operation, applied to
a real-valued discriminant function ${\mathcal L}(\vec{x})$:
\begin{equation}
h(\vec{x}) = \mbox{sign}\left({\mathcal L}(\vec{x}) - b\right)
\end{equation}
\begin{equation}
{\mathcal L}(\vec{x}) = \log\left(\frac{p(\vec{x}|y=1)}{p(\vec{x}|y=-1)}\right)
\end{equation}
\begin{equation}
b=\log\left(\frac{p(y=-1)}{p(y=1)}\right)
\end{equation}
\subsubsection{MAP Example: Gaussians with Equal Covariance}
Suppose we assume that $p(\vec{x}|y)$ is Gaussian, with a covariance
matrix $R$ that is independent of the value of $y$:
\begin{equation}
p(\vec{x}|y) = \frac{1}{\sqrt{2\pi |R|}}e^{-0.5(\vec{x}-
\vec\mu_y)'R^{-1}(\vec{x}-\vec\mu_y)}
\end{equation}
Then the MAP classifier is a Linear Discriminant:
\begin{equation}
h(\vec{x}) = \mbox{sign}\left(\vec{v}^T\vec{x} - b\right)
\label{eq:lda}
\end{equation}
where the Linear Discriminant vector $v$ and threshold $b$ are:
\begin{equation}
\vec{v} = R^{-1}(\vec\mu_1-\vec\mu_{-1})
\end{equation}
\begin{equation}
b = \log\left(\frac{p(y=-1)}{p(y=1)}\right) + 0.5\vec{v}^T(\vec\mu_1-
\vec\mu_{-1})
\end{equation}
For example, consider the problem of classifying stop place of
articulation (alveolar vs. non-alveolar) based on 864-dimensional
spectrogram observation vectors. The linear discriminant projection
of all samples extracted from the TIMIT/TRAIN directory looks like
this:
\centerline{\psfig{figure=lda25d.ps,width=3in}\nop}
\subsubsection{Other Linear Classifiers: Empirical Risk Minimization (ERM)}
\label{sec:erm}
It is possible to choose a linear discriminant vector, $\vec{v}$,
according to any optimality criterion. For example, consider a
one-node ``neural network;'' a one-node neural network computes the
function shown in Eq.~\ref{eq:lda}. A neural network is usually
trained by choosing some initial value of the coefficient vector
$\vec{v}$, and then iteratively adjusting the coefficients in order to
reduce classification error on the training corpus. If training is
successful, the neural network will learn a vector $\vec{v}$ that
minimizes the training-corpus error $R_{emp}(\vec{v},b)$:
\begin{equation}
\mbox{Training Rule: Choose}~(\vec{v},b) = \arg\min R_{emp}(\vec{v},b)
\label{eq:empirical_risk}
\end{equation}
\begin{equation}
\mbox{...where}~R_{emp}(\vec{v},b) =
\frac{1}{M}\sum_{m=1}^M |y_m-h(\vec{x}_m|\vec{v},b)|,~~~
\mbox{...and}~h(\vec{x}|\vec{v},b) =
\mbox{sign}\left(\vec{v}^T\vec{x} - b\right)
\end{equation}
\centerline{\psfig{figure=mcelda_train.ps,width=3in}\nop}
\subsubsection{Over-Training}
If the true probability distribution $p(\vec{x},y)$ is unknown, then,
in order to design an MAP classifier, it must be estimated from
training tokens. A PDF model learned from training tokens may be
incorrect. If the PDF is represented by a family of functions with
too many degrees of freedom, the learned PDF will learn accidental
details of the distribution of training tokens that are not
characteristic of the true underlying PDF. The resulting
classification function $h(\vec{x},y)$ may achieve very low error on
the training database, but the low training error may not generalize
to novel test tokens drawn from the same underlying probability
distribution $p(\vec{x},y)$.
An ERM classifier (Sec.~\ref{sec:erm}) suffers similar problems. The
vector $\vec{v}$ that is optimal for minimizing training corpus error
may not also be optimal for minimizing the expected test corpus
error. For example:
\centerline{\psfig{figure=mcelda_train.ps,width=3in}
\psfig{figure=mcelda_test.ps,width=3in}\nop}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Structural Risk Minimization}
\subsubsection{Generalization Error and VC Dimension}
The objective of machine learning is to minimize the structural risk,
not the empirical risk:
\begin{equation}
(\vec{v},b) = \arg\min R(\vec{v},b)
\end{equation}
where
\begin{equation}
R(\vec{v},b) = E_{p(\vec{x},y)} |y-h(\vec{x}|\vec{v},b)|,~~~
\end{equation}
for the unknown true distribution $p(\vec{x},y)$.
The difference between expected and empirical risk is called the
``generalization error:''
\begin{equation}
R(\vec{v},b) = R_{emp}(\vec{v},b) + \mbox{Generalization Error},~~~
R_{emp}=\mbox{Known},~\mbox{Generalization Error=Unknown}
\end{equation}
Vapnik and Chervonenkis~\cite{Vap98,Bur98} showed that generalization
error is bounded by a nearly-linear function of the VC Dimension,
$D_{VC}$:
\begin{equation}
{\mathcal P}\left\{R(\vec{v},b) \le R_{emp}(\vec{v},b) +
f\left(\frac{D_{VC}-\log\delta}{M}\right)\right\} \ge 1-\delta
\label{eq:vc_theorem}
\end{equation}
where $f(\cdot)$ is monotonically increasing and roughly linear. The
VC dimension is a measure of the flexibility of the classifier
function $h(\vec{x}|\theta)$. If, by varying the parameter vector
$\theta$, it is possible to label an arbitrarily large training corpus
in a large number of different ways, then the classifier has a high VC
dimension. If, on the other hand, the classifier function is limited
to only a few different possible ``shatterings'' of the training data,
then the VC dimension is small.
The VC dimension of a linear classifier is strictly less than the
number of trainable parameters:
\begin{equation}
\mbox{Linear Classifier:}~~D_{VC} \le K+1,~~~
K=\mbox{length}(\vec{v})=\mbox{length}(\vec{x})
\label{eq:vc_equals_K}
\end{equation}
Eq.~\ref{eq:vc_equals_K}, combined with Eq.~\ref{eq:vc_theorem},
suggests the following rule: in order to minimize $R(\vec\theta)$, we
should choose a classifier that has (1) the lowest possible empirical
risk, but also (2) the smallest possible parameter dimension. The
tradeoff between the parameter dimension of the classifier, and its
empirical risk, is captured by the Bayesian Information Criterion
(BIC)~\cite{Sch78}. By optimizing the BIC, it is possible to choose
among classifiers with many different levels of complexity.
Vapnik demonstrated that, in many cases, the upper bound in
Eq.~\ref{eq:vc_equals_K} is too large; that the true VC dimension of a
linear classifier may be much lower. Consider the following
situation. For any given training corpus $X$, normalize $v,b$ so
\begin{equation}
\min_m |\vec{v}^T\vec{x}_m-b|=1
\label{eq:h_normalization}
\end{equation}
Eq.~\ref{eq:h_normalization} says that the minimum distance between
the hyperplane and any individual data point is $r=1/|\vec{v}|$. Let
$R$ be radius of a ball encircling all of the data points $\vec{x}_m$.
Then
\begin{equation}
D_{VC} \le \left(\frac{R}{r}\right)^2 = \left(R|\vec{v}|\right)^2
\label{eq:ball_size}
\end{equation}
According to Eq.~\ref{eq:ball_size}, it is possible to control the
generalization error of the classifier by controlling the magnitude of
$\vec{v}$, subject to the constraint in Eq.~\ref{eq:ball_size}:
\begin{equation}
R(\vec{v},b) \le R_{emp}(\vec{v},b) + R^2|\vec{v}|^2
\label{eq:Remp_fv}
\end{equation}
Schematically, $|\vec{v}|$ controls the expressiveness of the
classifier, and a less expressive classifier is less prone to
over-training:
\centerline{\psfig{figure=svm_smallradius.ps,width=3in}
\psfig{figure=svm_largeradius.ps,width=3in}\nop}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
\subsubsection{Linear Support Vector Machine}
Notice that, in any linear classifier, a classification error occurs
if $y_m(\vec{v}^T\vec{x}_m-b) < 0$. Define a ``partial error'' as
$y_m(\vec{v}^T\vec{x}_m-b) < 1$, and define the ``slack variable''
$\xi_m$ as:
\begin{equation}
\xi_m = \max\left(0, 1 - y_m(\vec{v}^T\vec{x}_m-b)\right)
\label{eq:slackvariable}
\end{equation}
Then
the training corpus error is bounded by
\begin{equation}
R_{emp}(\vec{v},b) \le \sum_{m=1}^M \xi_m
\label{eq:Remp_slackvariables}
\end{equation}
Combining Eq.~\ref{eq:Remp_slackvariables} with Eq.~\ref{eq:Remp_fv},
we find that the structural risk of a linear classifier is bounded,
with high probability, by
\begin{equation}
R(\vec{v},b) \le R^2|\vec{v}|^2 + \sum_{m=1}^M \xi_m
\label{eq:svm_criterion}
\end{equation}
The left-hand-side of Eq.~\ref{eq:svm_criterion} is the optimality
criterion minimized by a support vector machine.
The goal of training an SVM, thus, is to minimize
Eq.~\ref{eq:svm_criterion}, subject to the constraint in
Eq.~\ref{eq:slackvariable}. This particular constrained optimization
problem is called a ``quadratic programming'' problem.
\begin{eqnarray}
\mbox{Minimize:} &&
(\vec{v},b) = \arg\min \frac{1}{2}|\vec{v}|^2 + C\sum_{m=1}^M \xi_m
\label{eq:qpv1p1}\\
\mbox{Subject to:} && \xi_m = \max\left(0,1-y_m(\vec{v}^T\vec{x}_m-b)\right)
\label{eq:qpv1p2}
\end{eqnarray}
Equations~\ref{eq:qpv1p1} and~\ref{eq:qpv1p2} can be transformed into
the following equivalent quadratic programming problem: find the
coefficients $\alpha_m$ such that
\begin{equation}
\vec{v} = \sum_{m=1}^M \alpha_m \vec{x}_m
\label{eq:svm_v}
\end{equation}
where
\begin{eqnarray}
&& \alpha_m = \arg\min \sum_m\sum_n\alpha_m \vec{x}_m'\vec{x}_n \alpha_n -
\sum_m y_m\alpha_m \\
\mbox{subject to...} &&\sum_{m=1}^M \alpha_m = 0,~~ 0 \le y_m\alpha_m \le C
\end{eqnarray}
By minimizing the structural risk (Eq.~\ref{eq:svm_criterion}) instead
of the empirical risk (Eq.~\ref{eq:empirical_risk}), the SVM avoids
over-training on data in which either (1) there are too few training
tokens, or (2) the observation vector is too large for a standard
classifier (LDA or neural network) to work well.
\centerline{\psfig{figure=linsvm_test.ps,width=2.5in}
\psfig{figure=linsvm_hist.ps,width=2.5in}\nop}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{From Linear to Nonlinear SVM}
Notice that, according to Eq.~\ref{eq:svm_v}, the optimum discriminant
vector $\vec{v}$ is a weighted combination of the training tokens.
Therefore the classifier function $h(\vec{x})$ can be written as
\begin{equation}
h(\vec{x}) = \mbox{sign}(\vec{v}^T\vec{x}-b) =
\mbox{sign}\left(-b+\sum_{m=1}^M \alpha_m \vec{x}_m^T\vec{x}\right)
\label{eq:svm_as_nonparametric}
\end{equation}
Eq.~\ref{eq:svm_as_nonparametric} shows that $h(\vec{x})$ depends only
on the dot-products between training vectors $\vec{x}_m$ and the
unknown test vector $\vec{x}$. But there is no reason why the
dot-product needs to be the only allowable way that we can combine
pairs of vectors. In fact, it is possible to use any symmetric,
positive-definite function $K(\vec{x}_m,\vec{x})$:
\begin{equation}
h(\vec{x}) =
\mbox{sign}\left(-b+\sum_{m=1}^M \alpha_m K(\vec{x}_m,\vec{x})\right)
\end{equation}
A common, flexible, and extremely useful case is the RBF support
vector machine, defined by the kernel function
\begin{equation}
K(\vec{x}_m,\vec{x}) = e^{-\gamma |\vec{x}_m-\vec{x}|^2}
\end{equation}
The resulting classifier function is
\begin{equation}
h(\vec{x}) = \mbox{sign}(g(\vec{x})-b)
\end{equation}
where the nonlinear discriminant function $g(\vec{x})$ is defined as
\begin{equation}
g(\vec{x}) = \sum_{m=1}^M \alpha_m K(\vec{x}_m,\vec{x})
\end{equation}
A general RBF classifier is infinitely flexible: it is possible for an
RBF classifier to ``fracture'' a training corpus in an infinite number
of ways, therefore there is no theoretical upper bound on its VC
dimension. An RBF classifier trained using the SVM criterion
(Eq.~\ref{eq:ball_size}), however, has a constrained VC dimension, and
therefore it is possible to use the SVM training method to learn an
RBF classifier with very low generalization error:
\begin{equation}
D_{VC} \le f(|\vec{v}|^2) =
f\left(\sum_m\sum_n\alpha_m\alpha_n K(\vec{x}_m,\vec{x}_n)\right)
\end{equation}
Here is an example of the classification boundary computed by an RBF
classifier:
\centerline{\psfig{figure=rbf_boundary.ps,width=3in}\nop}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Practical Training of SVMs}
In practice, most people just use svm\_light or libSVM to train their
SVMs. One needs to choose the ``hyper-parameters'' $\gamma$ and $C$.
According to Eq.~\ref{eq:ball_size}, $C$ should be roughly $1/R$,
where $R$ is the radius of a ball containing all of the training
tokens. If the training tokens are first normalized using {\tt
svm-scale}, then the best value of $C$ should be somewhere around
$C\approx 1$. The RBF kernel function is $e^{-\gamma
\vec{x}_m^T\vec{x}}$, so $\gamma$ should be on the order of one over
the typical value of $\vec{x}_m^T\vec{x}$; if the data are first
normalized to unit standard deviation, or scaled to unit maximum
amplitude, then $\gamma$ should be on the order of $\gamma\sim 1/K$,
where $K$ is the dimension of the data vector. In practice, the best
values of $C$ and $\gamma$ may differ from $1$ and $1/K$ by one or two
orders of magnitude, therefore it is often useful to find the optimum
hyper-parameters by experimenting using held-out ``development test''
data.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Phonological Distinctive Features}
A phonological feature is a binary-valued label,
$y\in\left\{-1,1\right\}$, that distinguishes one set of phonemes from
another. Distinctive features were originally based on perceptual
characteristics of phonemes~\cite{JakFan52,MilNic55}, then were
redefined in terms of the articulatory or speech production
characteristics of phonemes~\cite{ChoHal68};
Stevens~\cite{Ste89,Ste99b} claims that, in order to be used commonly
by the languages of the world, a distinctive feature must have compact
articulatory correlates (compact=capable of being distinguished using
a low-dimensional classifier) and compact perceptual correlates.
Table~\ref{tab:features} lists one set of distinctive features that
may be useful. ``Sonorant'' is a sound that may be sung.
``Continuant'' is a sound with uninterrupted airflow through the vocal
tract. ``Labial'' is made with the lips. ``Anterior'' is made
anterior to the alveolar ridge. ``Front'' requires a fronted tongue
body position; this feature is only distinctive in English for glides
(\verb:[+sonorant,+continuant]: phonemese). ``Strident'' involves
high-amplitude turbulence, and ``voiced'' has vocal fold vibration;
these features are only distinctive in English for \verb:[-sonorant]:
consonants. Stevens~\cite{Ste99b} argues that /hh/ is
\verb:[+sonorant]:.
\begin{table}
\centerline{
\begin{tabular}{|l|ccccccc|}\hline
phone & sonorant & continuant & labial &
anterior & front & strident & voiced \\ \hline
w & + & + & + & + & - && \\
l & + & + & - & + & - && \\
r & + & + & - & - & - && \\
y & + & + & - & - & + && \\
m & + & - & + & + &&& \\
n & + & - & - & + &&& \\
ng & + & - & - & - &&& \\
b & - & - & + & + && - & + \\
d & - & - & - & + && - & + \\
jh & - & - & - & - && + & + \\
g & - & - & - & - && - & + \\
p & - & - & + & + && - & - \\
t & - & - & - & + && - & - \\
ch & - & - & - & - && + & - \\
k & - & - & - & - && - & - \\
f & - & + & + & + && - & - \\
th & - & + & - & + && - & - \\
s & - & + & - & + && + & - \\
sh & - & + & - & - && + & - \\
hh & - & + & - & - && - & - \\
v & - & + & + & + && - & + \\
dh & - & + & - & + && - & + \\
z & - & + & - & + && + & + \\
zh & - & + & - & - && + & + \\ \hline
\end{tabular}}
\caption{An example distinctive feature notation for the consonants of
English.}
\label{tab:features}
\end{table}
``Acoustic correlates'' of a distinctive feature are acoustic
measurements that can be used to determine whether a phoneme
is~\verb:[+feature]: or~\verb:[-feature]:. Phoneticians are divided
into two camps: (1) those who believe that each distinctive feature is
primarily cued by pinpoint measurements at particular times, in
particular frequency bands~\cite{BluSte79}, and (2) those who believe
that every possible acoustic measurement carries information about
every possible distinctive feature~\cite{Kew83}. The engineering
response to this debate is, as always, to conclude that both camps are
correct: every measurement informs us about almost every distinctive
feature, but some measurements are more informative than others (in
very precise terms: some measurements have higher mutual information
with the target distinctive feature than others).
Table~\ref{tab:correlates} lists some widely attested acoustic
correlates. With some practice, you can use this table to teach
yourself spectrogram reading. These acoustic correlates can also be
useful in automatic speech recognition, but they should usually
augment the mel-scale spectral observation vector, not replace it. In
this table, a ``formant'' is a resonant frequency of the vocal tract;
it shows up in the spectrogram as a thick fuzzy bar, like a horizontal
caterpillar. During most vowels, $250\le F_1\le 1000$Hz, $900\le
F_2\le 2400$Hz, $2200\le F_3\le 3000$Hz; F3 may or may not be
observable on spectrograms computed from telephone speech. The
``formant locus'' is the frequency that the formant would take right
at the instant of consonant closure or release, {\em if} you could
actually measure the formant at that time---often it is impossible to
actually measure the formant at that time, so you must interpolate the
formant backward in time from a following vowel, or forward in time
from a preceding vowel. All English consonants have an F1 locus of
about 200Hz. The F2 and F3 loci are variable but useful cues for
place of articulation (lips vs. blade vs. body). Other place cues
include stop burst spectrum (the spectrum of the noise that occurs at
$t=0$) and frication spectrum (the spectrum of noise during the closed
portion of a fricative). Stop consonant voicing, in English, is
primarily cued by voice onset time (VOT), the time delay between the
burst and the onset of vowel voicing.
\begin{table}
\begin{tabular}{|c|c|p{4in}|}\hline
FEATURE & CONTEXT & INFORMATIVE ACOUSTIC CORRELATES \\ \hline
sonorant & all & strong periodic voicing, with a total energy
that doesn't change much from frame to frame during consonant closure,
and with a spectral peak during closure between 250Hz and 1000Hz \\ \hline
continuant & all & high-frequency energy during closure (above 1000Hz)
is not more than 30dB below the low-frequency energy during closure,
and high-frequency energy is therefore visible in the spectrogram \\ \hline
lips & any & F2 and F3 formant loci are lower than the formant frequencies of
any preceding or following vowel, thus formants rise into a following
vowel, fall from a preceding vowel\\
& fricative & frication spectrum (during closure) has very low amplitude,
with roughly equal energy at all frequencies above 1000Hz \\
& stop release & burst spectrum (at t=0ms) has very low amplitude,
with roughly equal energy at all frequencies above 1000Hz \\
& stop release & VOT is shorter than predicted by voicing features \\ \hline
blade & any & formant loci are $1600 group1.log;
VExtract -T 1 -b -p /r/ /+1/ -p /l/ /+1/ -p /s/ /-1/ -p /sh/ /-1/ \
-o group2.toks -I mlf/phn_wrd_tones.mlf -S scp/VExtract2.scp \
&> group2.log;
\end{verbatim}
The command line options have the following meanings.
\begin{itemize}
\item
The ``-T'' option tells VExtract to output trace (debugging)
information. Specifically, it will output the frame number and
filename of every vector that it selects from the database, along with
the label it believes appropriate. The notation \verb:&> group1.log:
at the end will cause all of the debugging information to be saved to
the file group1.log (if your unix system is using bash to parse the
command line. If the command above doesn't work, try typing ``bash''
first to get a bash parser, then re-enter the line above).
\item
Each of the ``-p'' options specifies a phoneme label that VExtract
should look for, and the class (+1 or -1) to which that phoneme
belongs. The ``-b'' option specifies that these phoneme labels should
be surrounded by word boundaries (whitespace or newline), e.g., so
that the symbol ``er'' will not count as equivalent to ``r''.
\item
The ``-o'' option specifies the SVM tokens file to which data should
be stored.
\item
The ``-S'' option specifies a script file that lists the PLP files
from which data should be extracted.
\item
The ``-I'' option specifies that VExtract should read through the file
\verb:mlf/phn_wrd_tones.mlf: in order to find example phonemes. Open
this file with a text editor, and take a look. This file contains
transcriptions of all of the ICSI Switchboard utterances, not just the
prosodically transcribed ones. If you want to see how the
prosodically transcribed ones come out, search through the file for
the symbol \%. This file was created by using \verb:rnc2mlf.pl: to
convert the ICSI Switchboard phn transcriptions, \verb:praat2mlf.pl:
to convert the Illinois Switchboard TOBI transcriptions, and
\verb:layermlfs.pl: to combine the two.
\end{itemize}
Read through the log files, \verb:group1.log: and \verb:group2.log:.
Open the MLF, \verb:phn_wrd_tones.mlf:. Check a few of the extracted
vector times, in order to make sure that they really do correspond to
the phoneme labels that you were trying to extract. Use \verb:HList:
to find one such frame in the original PLP file, and compare the
numbers in the PLP file with the numbers in \verb:group1.toks:, in
order to make sure that they are the same.
VExtract is not yet bug-free. Most likely, if you have followed
instructions so far, your files group1.toks and group2.toks have nans
in them (not-a-number entries). Open one with a text editor, and take
a look. NaNs will screw up libSVM or svm\_light, so you need to get
rid of them. Try the following, to set all nans to zero (and thereby
get them out of the way):
\begin{verbatim}
cp group1.toks foo.toks; sed 's/nan/0/g' foo.toks > group1.toks;
cp group2.toks foo.toks; sed 's/nan/0/g' foo.toks > group2.toks;
\end{verbatim}
Now, use libSVM to scale your tokens. First, decide which toks file
will be training data, and which will be testing data. Suppose that
you decide to use group2.toks as training data, and group1.toks as
testing data. In order to compile the scaling statistics, and scale
group2, enter
\begin{verbatim}
svm-scale -l -1 -u 1 -s scaling.stats group2.toks > g2_scaled.toks;
\end{verbatim}
Now, you can use the same scaling statistics (compiled from the
training tokens) in order to scale the testing tokens (because if
you're going to scale the data, you'd better scale both training and
testing data equally!!). Type
\begin{verbatim}
svm-scale -r scaling.stats group1.toks > g1_scaled.toks;
\end{verbatim}
... in order to read in scaling stats from \verb:scaling.stats:, and
apply them to group1.toks.
Train a linear SVM, using the default value of $C$. Type
\begin{verbatim}
svm-learn -t 0 g2_scaled.toks my.svm
\end{verbatim}
Open the file \verb:my.svm: with a text editor. The header specifies
the kernel type (linear or RBF), the number of classes, and the number
of support vectors (the number of training vectors that are used to
compute the discriminant function). After the codeword \verb:SV:, the
support vectors themselves are stored.
Test the SVM on the training data, then on the independent test data:
\begin{verbatim}
svm-predict g2_scaled.toks my.svm g2.out;
svm-predict g1_scaled.toks my.svm g1.out;
\end{verbatim}
How well did you do?
Warning: I have found that, because of the very small size of this
training corpus, I sometimes get quite poor phoneme classification
performance. Please don't be discouraged if that happens to you...
\bibliography{/home/hasegawa/ref/references}
\bibliographystyle{plain}
\end{document}