\documentclass[]{article}
\usepackage{graphicx}\setlength{\oddsidemargin}{0in}
\setlength{\topmargin}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{9in}
\title{Lecture 6: OpenFST}
\author{Lecturer: Mark Hasegawa-Johnson (jhasegaw@uiuc.edu)\\TA: Sarah
Borys (sborys@uiuc.edu)}
\begin{document}
\maketitle
\tableofcontents
\section{Introduction: Finite State Transducers}
A hidden Markov model can be viewed as either a finite state machine, or
a dynamic Bayesian network. This lecture will discuss finite state
machines.
The previous lectures described how we can model each context-dependent
phone with a mixture Gaussian hidden Markov model. Typically, the
triphone states are clustered until we have perhaps 1000-2000 different
states. The possible transitions among states are specified by the
transition probability matrices of the triphone models. The possible
transitions among triphones are specified by the dictionary. The
possible transitions among words are specified by the language model.
During recognition (``search''), the recognizer must compute, at each
time step, the probabilities of the N highest paths leading up to time
$t$, $p(Q_1(t)),\ldots,p(Q_N(t))$, where $Q_i(t)$ is the path up to time
$t$ that has the $i$th highest probability:
\begin{equation}
Q_i(t) = \left[\begin{array}{cccc}
\mbox{state}(1) & \mbox{state}(2) & \ldots & \mbox{state}(t) \\
\mbox{phone}(1) & \mbox{phone}(2) & \ldots & \mbox{phone}(t) \\
\mbox{word}(1) & \mbox{word}(2) & \ldots & \mbox{word}(t)
\end{array}\right]
\label{eq:dynamicsearch}
\end{equation}
Dynamic search algorithms, like HVite, keep a vector of state
information at each time step, as suggested by
Eq.~\ref{eq:dynamicsearch}. Dynamic Bayesian networks provide a formal
representation of dynamic search, which we will discuss in lecture 7
(GMTK).
Static search algorithms, like HDecode and julius, pre-compile all of
the information in Eq.~\ref{eq:dynamicsearch} into a single integer
state index at each time step. This is done by (1) creating search
graphs (finite state transducers or FSTs) that describe the sequencing
information at each level of abstraction, e.g., one FST for the language
model, one for the dictionary, and one for each phone model, and (2)
composing together the search graphs.
OpenFST (http://www.openfst.org/,~\cite{Allauzen07}) is a tool for
creating, composing, and optimizing finite state transducers. OpenFST
started as an open-source re-implementation of the AT\&T finite state
machine library (http://www.research.att.com/~fsmtools/fsm/). OpenFST
and FSMlib use the same text file format to specify finite state
machines, so it's easy to mix and match the tools provided the two
toolkits, in case you find a tool that works better in one toolkit than
the other. In particular, AT\&T has also released their finite-state
transducer based speech recognition decoder,
http://www.research.att.com/~fsmtools/dcd/, which uses FSM networks
specified in the same format as FSMlib. Mohri's tutorial, based on the
AT\&T decoder, is still the best available introduction to the use of
FSTs in speech recognition~\cite{Mohri02}. It's probably also possible
to read OpenFST-optimized search graphs into Julius~\cite{Kawahara00},
but I haven't done that yet.
\section{Tutorial}
This entire section is copied almost verbatim from an FSMlib tutorial
presented in 2004 by Eric Fosler-Lussier
(http://www.cse.ohio-state.edu/~fosler/) at the NAACL summer course at
Johns Hopkins University
(http://www.cs.cornell.edu/home/llee/naacl/summer-school/04/;
http://www.clsp.jhu.edu/workshops/). I have changed program names and
command line options to their OpenFST equivalents, and checked to make
sure that the commands run.
\subsection{Shake-and-bake language generation}
To get acquainted with OpenFST, we’ll make a little language
generator. We’ll generate ``sentences'' based on parts of speech and
fill in the lexical items randomly.
Open up your favorite editor, and type in the following:
\begin{verbatim}
0 1 DET
1 2 N
2 3 V
3 4 DET
4 5 N
5
\end{verbatim}
This means “from state 0 to state 1, we have a DET”(a determiner) and
so forth; the “5” by itself indicates that it's a final state. The
first state mentioned in the file (0) is the initial state. DET stands
for determiner, N stands for noun, and V stands for verb. Save the file
as sent.fsa.txt. Now, open another file (pos.voc) which we’ll put the
part of speech vocabulary into. This is a file that gives mappings from
symbols to integers. The epsilon symbol (-) should always be symbol 0.
\begin{verbatim}
- 0
DET 1
N 2
V 3
\end{verbatim}
Save this file (pos.voc).
The first file gives the textual representation of the fsa. To compile
it into binary form, use the following command:
\begin{verbatim}
fstcompile --acceptor –-isymbols=pos.voc sent.fsa.txt > sent.fsa;
\end{verbatim}
The \verb:--acceptor: option tells it that this is a finite state
acceptor (only one symbol per edge) rather than a finite state
transducer (two symbols per edge: one input, one output).
You can print it out into text form using
\begin{verbatim}
fstprint –-isymbols=pos.voc sent.fsa
\end{verbatim}
or you can draw it using the following commands:
\begin{verbatim}
fstdraw –i pos.voc sent.fsa | dot –Tps > sent.ps
\end{verbatim}
then you can view it using any postscript viewer, e.g., evince, gv,
ghostview, or Acrobat. \verb:dot: is part of the graphviz package,
which can be installed under ubuntu (for example) by typing
\begin{verbatim}
sudo apt-get install graphviz
\end{verbatim}
The resulting plot of the finite state transducer should look like
Fig.~\ref{fig:sent}.
\begin{figure}
\centerline{\includegraphics[angle=-90,width=5in]{sent.ps}}
\caption{Finite state acceptor: a regular grammar specifying the parts
of speech (POS) that compose a sentence.}\label{fig:sent}
\end{figure}
Now, create a second file that maps parts of speech to words. For example:
\begin{verbatim}
0 0 DET the
0 0 DET a
0 0 N cat
0 0 N dog
0 0 V chased
0 0 V bit
0
\end{verbatim}
Notice that this file specifies a finite state transducer (each edge has
two symbols: an input part of speech symbol, and an output word symbol).
Notice also that there is only one state in this FST: state 0. Every
edge is a self-loop from state 0 to itself; they differ only in the edge
labels (Fig.~\ref{fig:dict}).
\begin{figure}
\centerline{\includegraphics[angle=-90,width=1in]{dict.ps}}
\caption{Finite state transducer: a dictionary mapping POS to words.}
\label{fig:dict}
\end{figure}
Save this file as dict.fst.txt. You’ll need to create a second
vocabulary file for the words:
\begin{verbatim}
- 0
a 1
the 2
cat 3
dog 4
chased 5
bit 6
\end{verbatim}
save this as word.voc.
Now you can compile the transducer. This uses fstcompile with the –t
option:
\begin{verbatim}
fstcompile –-isymbols=pos.voc –-osymbols=word.voc dict.fst.txt > dict.fst
\end{verbatim}
Compose the two together and you get all possible strings in the
language.
\begin{verbatim}
fstcompose sent.fsa dict.fst > strings.fst
fstdraw –i pos.voc –o word.voc strings.fst | ./dot –Tps > strings.ps
ghostview strings.ps
\end{verbatim}
To get a random string from this language, use
\begin{verbatim}
fstrandgen strings.fst | fstproject --project_output |
fstprint --acceptor --isymbols=word.voc |
awk 'BEGIN{printf("\n")}{printf("%s ",$3)}END{printf("\n")}'
\end{verbatim}
\begin{figure}
\centerline{\includegraphics[angle=-90,width=5in]{strings.ps}}
\caption{Finite state transducer: a regular grammar specifying a
language in which there are 32 possible sentences.}
\label{fig:strings}
\end{figure}
To do:
\begin{enumerate}
\item Add more vocabulary to the POS/word map. What is the silliest
sentence you can create?
\item How would you handle ``The dog barked''?
\end{enumerate}
\subsection{Cryptography}
The idea of this task is to try and decipher some encrypted text. You
have intercepted five messages that you believe are using the same
(simple) substitution cipher. Each letter is transformed into one and
only one other letter. (You may see this type of puzzle in the newspaper
as the ``Cryptoquip''). For example, if you believe that ``B'' is ``E''
in one instance, it is ``E'' in all other instances in the message, and
there is no other letter that can be changed to ``E''. To figure out
how to break this code, we will use some known information about the
statistics of letter frequencies in English. Then we will build a
decoder (similar to a decoder in a speech recognition system) to try to
figure out the code. The models we will use will include
\begin{itemize}
\item A finite state transducer to map cryptletters to letter pairs
\item A finite state transducer to map letter pairs to regular (unencrypted) letters
\item A finite state transducer to map letters to words
\item A finite state automaton describing the input text
\item A finite state automaton describing the output text
\end{itemize}
\subsubsection{Unpack the distribution}
Download crypto.tar.gz from the web page, then execute the following
command:
\begin{verbatim}
tar –xzf crypto.tar.gz
\end{verbatim}
This will create a directory crypto. In the directory ``data'' you will
see the file crypttext, which contains the text you need to break.
\subsubsection{Gather statistics}
The main thing that makes this system run is that the statistics of the
letters in the encrypted message are assumed to be similar to that of
general English. If you look in the file data/nyt-letterstats.txt, you
will see statistics of letters from the New York Times (in June 2002),
where the text was split into roughly 200-character chunks from which
the mean and standard deviation of the frequency of each letter were
computed. We now need to do something similar for the crypttext. Run
the following script
\begin{verbatim}
scripts/get-letterstats.pl data/crypttext > data/cryptstats
\end{verbatim}
You should get a file out with the probability of each letter in the
crypttext. You should satisfy yourself that the result is somewhat close
to right.
\subsubsection{Build a letter-to-word FST}
This is where scripting skills will come in handy. First, let’s make a
transducer that can convert the letters B O X into the word BOX. Open up
a text editor and enter the following transducer (BOX.fst.txt)
\begin{verbatim}
0 1 B -
1 2 O –
2 3 X BOX
3
\end{verbatim}
Save the file and compile it:
\begin{verbatim}
fstcompile –-isymbols=data/letter.voc –-osymbols=data/word.voc BOX.fst.txt > BOX.fst
\end{verbatim}
You can test to see if it’s working by composing it with test1.fsa
(which contains B O X).
\begin{verbatim}
fstcompile --isymbols=data/letter.voc data/test1.fsa.txt |
fstcompose - BOX.fst |
fstprint –-isymbols=data/letter.voc –-osymbols=data/word.voc
\end{verbatim}
If you get nothing back, then there’s a bug somewhere. Now, for the
scripting part: write a little script that takes an argument and creates
the appropriate FST for that word. For example, if you called the script
“my\_word\_generator”, by calling
\begin{verbatim}
my_word_generator BOX
\end{verbatim}
you should get out exactly the fst above. If you need help with this,
contact me or the TA.
\subsubsection{Build a dictionary}
Make a directory ``dict.'' Remember that a dictionary is just the union
of a whole bunch of individual words. Write another script that takes
every word in data/wordlist and creates an fst for each word in the
wordlist, and puts it in the dict directory. You may want to look at
scripts/gen-allwords.sh for an example. Gather all of the fsts into one
big dictionary by taking the union:
\begin{verbatim}
for txt in `ls dict/*.fst.txt`; do
fst=`echo $txt | sed 's/\.txt//'`;
fstcompile --isymbols=data/letter.voc --osymbols=data/word.voc $txt >
$fst;
done
cp dict/ABLE.fst dict.fst;
for fst in `ls dict/*.fst`; do
fstunion dict.fst $fst > tmp.fst
mv tmp.fst dict.fst;
done
\end{verbatim}
Take a look to see how big the resulting fst is by using \verb:fstinfo
dict.fst:. We can compact it by determinizing the fst:
\begin{verbatim}
fstdeterminize dict.fst > dict-det.fst
\end{verbatim}
You now have an fst which can convert one set of letters into one
word. To get word sequences, you need to concatenate that with a word
boundary marker and then take the closure. First, create a file
``pound.fst.txt'' which has the following contents:
\begin{verbatim}
0 1 # #
1
\end{verbatim}
Compile this into a transducer:
\begin{verbatim}
fstcompile –-isymbols=data/letter.voc –-osymbols=data/word.voc pound.fst.txt >
pound.fst
\end{verbatim}
Now do the concatenation and closure. Here, I’ve also included some
determinization and minimization. It turns out that the transducer
itself is not determinizable, so there is a way to encode the transducer
as an automaton, do the determinization/minimization, and then turn it
back into a transducer. Note how you can pipe all of these FSTs into
each program, and do a seqence of operations
\begin{verbatim}
rm –f x.fst
fstconcat dict-det.fst pound.fst | fstclosure |
fstrmepsilon | fstencode –-encode_labels – x.fst | fstdeterminize |
fstminimize | fstencode –-decode –-encode_labels - x.fst > dictstar.fst
rm –f x.fst
\end{verbatim}
\subsubsection{Generate the cryptletter-to-real letter mapping}
Running the following script will compute $P(\mbox{cryptletter} |
\mbox{actualletter})$. What happens is that we take the frequency of
each cryptletter and compare it against the Gaussian distribution for
each actual letter. For example, the letter ``E'' (in real text) occurs
roughly 12\% of the time. The cryptletters ``A'' ``X'' and ``U'' might
be good candidates for ``E''.
The script \verb:generate-likelihoods.pl: will take two statistics
files, plus an optional hypothesis file. The hypothesis file gives a
guess as to a cryptletter/real letter combination. Originally, you don’t
have a hypothesis, so there is no hypothesis file.
\begin{verbatim}
scripts/generate-likelihoods.pl data/nyt-letterstats.txt
data/cryptstats > subs1.fst.txt
fstcompile –-isymbols=data/letter.voc –-osymbols=data/pairs.voc
subs1.fst.txt > subs1.fst
\end{verbatim}
\subsubsection{Generate a crypttext word sequence for input}
You now need to generate, for each sentence, a fsa that represents the
sentence. You can extend your my\_word\_generator script to do this; make
sure that a ``\#'' sign appears after each word. You also don’t need to
put out words (i.e. this doesn’t need to be a transducer). Here's one
way to do it. First, you should create 5 files with one line each in them:
\begin{verbatim}
split -1 data/crypttext crypt_
\end{verbatim}
This will create crypt\_aa through crypt\_ae.
Now, generate your fsa:
\begin{verbatim}
scripts/gen-sent.sh crypt_aa > crypt_aa.fsa.txt
fstcompile -acceptor --isymbols=data/letter.voc crypt_aa.fsa.txt > crypt_aa.fsa
\end{verbatim}
Now, if you compose these together, you’ll get the weighted graph of
all possible words. It’s easier to look at if you just project to the
output words (i.e., get rid of the input letters).
\begin{verbatim}
fstcompile --isymbols=data/pairs.voc --osymbols=data/letter.voc
data/pair2real.fst.txt > pair2real.fst
fstcompose crypt_aa.fsa subs1.fst | fstcompose - pair2real.fst |
fstcompose - dictstar.fst | fstproject --project_output | fstrmepsilon |
fstdeterminize | fstminimize |
fstprint --isymbols=data/word.voc --acceptor | less
\end{verbatim}
Or, if you want to look at it graphically
\begin{verbatim}
fstcompose crypt_aa.fsa subs1.fst | fstcompose - pair2real.fst |
fstcompose - dictstar.fst | fstproject --project_output | fstrmepsilon |
fstdeterminize | fstminimize | fstdraw --acceptor --isymbols=data/word.voc |
dot -Tps > crypt_aa.ps
evince crypt_aa.ps
\end{verbatim}
Yikes! It’s huge!
Try this out with the other sentences (crypt\_ab..crypt\_ae).
You can also put some pruning into the loop... this means that you’re
not guaranteed to get the right answer, but it can help you guess. Try
pruning paths that have costs > 1:
\begin{verbatim}
fstcompose crypt_aa.fsa subs1.fst | fstcompose - pair2real.fst |
fstcompose - dictstar.fst | fstproject --project_output | fstrmepsilon |
fstdeterminize | fstminimize | fstprune –-weight=1 |
fstprint --acceptor --isymbols=data/word.voc | less
\end{verbatim}
You’ll notice that in one of the files you only get one possible answer
for a word mapping somewhere in the file. This gives you some
constraints that you can use as a first guess (see below).
You can also see the bestpath by doing
\begin{verbatim}
fstcompose crypt_aa.fsa subs1.fst | fstcompose - pair2real.fst |
fstcompose - dictstar.fst | fstbestpath | fstdraw --isymbols=data/letter.voc
--osymbols=word.voc | dot –Tps > crypt_aa_bp.ps
\end{verbatim}
Notice that this doesn’t end up making much sense. Why? All of the
letter decisions are made independently at this point – we don’t have a
constraint that says “if you choose T for E, then you always choose T
for E”. We’ll deal with that a bit later in step 7. When you figure
out the one word that has no alternative, you’ll want to put the
corresponding letters into a hypothesis file as a guess. To get the
corresponding letters out, take the file you found, and run (making sure
to replace XX with the appropriate file)
\begin{verbatim}
fstcompose crypt_XX.fsa subs1.fst | fstproject --project_output > tmp.fsa
fstcompose tmp.fsa data/pair2real.fst dictstar.fst |
fstshortestpath |
fstprint --isymbols=data/pairs.voc --osymbols=data/word.voc | less
\end{verbatim}
Extract the letter pairs and put it into a file guess1. For example, if
you believe XEF means CAT, put into the file
\begin{verbatim}
XC
EA
FT
\end{verbatim}
Now, generate a second substitution fst, with your new guesses:
\begin{verbatim}
scripts/generate-likelihoods.pl data/nyt-letterstats.txt
data/cryptstats guess1 > subs2.fst.txt
fstcompile –-isymbols=data/letter.voc –-osymbols=data/pairs.voc
subs2.fst.txt > subs2.fst
\end{verbatim}
Repeat (with subs2.fst instead of subs1.fst) until you have a complete
mapping. The second round should be much better than the first, and you
should have most of it by the third round.
\subsubsection{Path constraints}
OK, you’ve solved it, but now what? Well, it turns out that we could
have made the problem easier to solve by adding more constraints. One
constraint is that the letter substitutions have to apply to the whole
string. Theoretically, you can do this with an automaton on the pairs
alphabet. For example, if X is really E, let PAIR be all of the pairs
that either do not start with X. You can write this language constraint
as:
\begin{equation}
\mbox{PAIR}^* \mbox{XE} (\mbox{PAIR}\bigcap\mbox{XE})^*
\end{equation}
Which says that XE can only occur in the string (and must occur once),
but XA (for example) can’t. Of course, we can write a similar
constraint for XA, XB, etc. If you take the union of all 26 of these
constraints, then you end up with the language that says “X must pair
with one and only one of these 26 letters”. Of course, this doesn’t
say that only one cryptletter can be converted into E. For that
constraint, let PAIR2 be all of the pairs that do not end with E.
\begin{equation}
\mbox{PAIR2}^* \mbox{XE} (\mbox{PAIR2}\bigcap\mbox{XE})^*
\end{equation}
says that only X (and no other letter) can be converted into E. These
are the ``backward'' constraints, whereas the previous constraints are
the ``forward'' constraints.
By intersecting all 26 forward and 26 backward constraints, in theory
you can provide the complete constraints on the language. However, the
state space becomes huge if you try this. On the plus side, though, we
can intersect each one of the constraints in turn, followed by
determinization and minimization, which often doesn’t make the
resulting FSTs grow too large. To try this out, first find the valid
pairs of letters according to the word models, and then apply the
constraints to these pairs.
\begin{verbatim}
# link to the forward/backward constraints
ln –s /export/fosler/crypto/fwdconstr .
ln –s /export/fosler/crypto/revconstr .
# determine the pairs that can possibly arise
fstcompose crypt_aa.fsa subs1.fst | fstproject –-project_output > tmp1.fsa
# eliminate the pairs that don’t make valid words
fstcompose tmp1.fsa data/pair2real.fst dictstar.fsa |
fstproject > tmp2.fsa
# apply the constraints by intersecting one at a time
scripts/do-constraints.pl tmp2.fsa > tmp3.fsa
\end{verbatim}
You will find if you use “fstinfo –n” on tmp2.fsa and tmp3.fsa that
the latter is much smaller. Now, if you compose with the dictionary
again, you’ll see that there are many fewer hypotheses:
\begin{verbatim}
fstcompose tmp3.fsa data/pair2real.fst dictstar.fsa |
fstprint –-isymbols=data/pairs.voc –-osymbols=data/word.voc | less
\end{verbatim}
Try this with some of the other sample sentences. What happens if you
concatenate all five sentences together? To think about: what other
types of constraints could you put into the system to make the decoding
process faster (in terms of the number of iterations)?
\bibliographystyle{plain} \bibliography{/home/jhasegaw/ref/references}
\end{document}