Exercise: Using a Hidden Markov Model
Description:
In this exercise, we use a hidden Markov model (HMM) as a model of word
generation from part-of-speech sequences. We will:
- Train an HMM on a sample of English-like text
- Inspect the resulting model
- Generate sentences at random from the model
- Create test sentences and find the most likely hidden state sequence
To turn in: Do everything suggested in the
directions, and answer all questions that are asked. A question is
anything that ends in a question mark!
Credits: The HMM
package we are using in this exercise was written by Tapas Kanungo, and this
exercise, the accompanying scripts, etc. were written by by Philip Resnik.
Prerequisites: This exercise assumes basic
familiarity with typical Unix commands, and the ability to create text
files (e.g. using a text editor such as vi or
emacs). No programming is required.
Notational Convention: The symbols <== will be used to
identify a comment from the instructor, on lines where you're typing
something in. So, for example, in
% cp file1.txt file2.txt <== The "cp" is short for "copy"
what you're supposed to type at the prompt (identified by the percent
sign, here) is
cp file1.txt file2.txt
followed by a carriage return.
Getting Logged In and Getting the Code
- Log in to your course account.
- Use anonymous ftp to get the code:
% ftp umiacs.umd.edu <== Invoke the "ftp" program
Name (yourname): anonymous <== Type "anonymous" (without quotes)
Password: name@address <== Type your e-mail address
ftp> cd pub/resnik/723 <== Go to directory pub/resnik/hmm
ftp> binary <== Tell ftp to use binary transfer mode
ftp> get hmm.tar <== Download the file containing exercise code
ftp> get hmm_src.tar <== Download the file containing the HMM source code
ftp> bye <== Exit from ftp
% tar xvf hmm.tar <== Extract code from the file
% rm hmm.tar <== Delete the file (to conserve space)
% mv hmm_src.tar hmm <== Move the tarfile into the subdirectory
% cd hmm <== Go into hmm subdirectory
% pickperl /usr/imports/bin/perl `which perl` <== get correct perl (note backquotes!)
% chmod u+x *.pl <== Make perl programs executable
- Execute uname -sr to see what operating system
you're running. If it's SunOS or Solaris, you can use the executable
programs that are part of the tarfile; skip to the next step.
If not, you'll need to do the
extra step of downloading and compiling the HMM source code. Here's
how:
First, execute the following lines:
% tar xvf hmm_src.tar <== Extract the HMM source code
% rm hmm_src.tar <== Delete the file (to conserve space)
% cd src <== Go into source subdirectory
Do one of the following to update the makefile:
- Execute 'make depend'
- Edit Makefile to delete /usr/include/sys/feature_tests.h
everywhere -- not sure what this was but we don't need it.
- Now that the makefile is updated, do the following:
% make <== Compile files. You can ignore warnings.
% mv esthmm genseq testvit .. <== Move executables to parent directory
Say yes if you're asked whether it's ok to overwrite the existing
files with the same names -- you're replacing the Solaris
executables with the executables that are appropriate for the
machine you're on.
- Now go into directory example0:
% cd ../example0
- Run link to make the HMM programs available for
running in this directory:
% link
Creating the Training Data
- Look at file example0.train, which contains a very small corpus generated
using a very simple English-like grammar.
% more example0.train
Alternatively, open the file in an editor. What parts of speech would
you say are represented in this file? Write down a list of parts of
speech and the words you think belong to each one. Notice that the
same word can appear in multiple parts of speech.
- Turn example0.train into training input for the HMM code, by
running the create_key.pl program. This reads in the
training file and converts each unique word into a unique number,
since the HMM program uses numbers rather than symbols. For example,
"the tall man saw the short man" might be translated into the sequence
"1 2 3 4 1 5 3".
% create_key.pl example0.key < example0.train > example0.seq
Take a look at file example0.key, which now contains the mapping from
words to their corresponding numbers, and at file example0.seq, which
now contains the training input represented as numbers rather than
words. The "T=" value at the top of example0.seq tells you how many
symbols there are in the training sequence. (In this case the number
should be 590.)
Training the HMM
- To train the model we will use the esthmm program.
This program needs to know the number of symbols in the alphabet of
the HMM (that is, symbols that can be emitted); you can obtain this by
typing
% wc -l example0.key
In this case the number of symbols is 13. The number of states is
something you can choose. Recall that for a HMM-based part-of-speech
model like this one, each state corresponds to a part of speech
(i.e. a syntactic category like noun, verb, etc.), so the number of
states to choose corresponds to the number of parts of speech you
believe are represented in the corpus. In this case, let's use 6
states. We run the esthmm program as follows:
% esthmm 6 13 example0.seq > example0.hmm
This creates file example0.hmm, which contains the trained model.
Depending on the computer you're running this on, this might take
a minute.
- Edit file example0.hmm -- it's not the easiest
thing in the world to read, but you can see how the model is
represented there. At the top are specified the number of states and
the number of symbols (M=13 symbols, N=6 states). Then you have the
complete A matrix, i.e. the 6-by-6 transition probability matrix.
Next you have the 6-by-13 emission probability matrix, B. Finally you
have the pi vector, giving initial probabilities for the 6 states.
- Delete the first line of example0.hmm, that is, the
line containing the words "num iterations". If you don't do so,
you'll get errors that look like "Bad format at A in example0.hmm"
later on.
Inspecting the Model
- To view the model you've just created a bit more readably,
keeping only the most useful information, run the
viewhmm.pl program. This shows the state-to-state
transition probabilities when they are above a certain threshold
probability, since very low transition probabilities probably don't
tell you much about the structure of the model. Similarly, for the
emission probabilities, it shows only the symbols emitted with a
sufficiently high probability. (And it shows them as words, not
numbers, for easier readability.) Run viewhmm.pl as
follows:
% viewhmm.pl example0.hmm example0.key 0.01 0.01
Chances are it scrolls by too fast, you you may want to do this
instead:
% viewhmm.pl example0.hmm example0.key 0.01 0.01 | more
Or, you might want to save the output to a file:
% viewhmm.pl example0.hmm example0.key 0.01 0.01 > example0.view
- Based on what you see when you run viewhmm.pl,
draw, on paper, a transition diagram for this HMM. That is, write
each state as a node labeled by the state number, and write
probabilities on the arcs from state to state.
- Now, based on what you see when you run
viewhmm.pl, label each state in your transition
diagram with a part of speech. How good a match is there between your
intuitions, earlier, and the way the model has automatically decided
which are the high-probability symbols for each state? If there are
mismatches, are they linguistically interesting?
Generating Sentences at Random from the Model
- Using your transition diagram and the output of
viewhmm.pl, start at the the most likely start state,
and write down the most likely symbol to be emitted there. (Break
ties at random.) Then follow the most likely arc to the next state,
and write down the most likely symbol to be emitted there.
Continue in this fashion until you emit a punctuation mark, or until
you get bored.
- Now we'll have the computer do this same process. Since it
doesn't ever get bored, we'll have to tell it how many symbols to emit
before it stops -- say, 100. To do this, run the
genseq program:
% genseq example0.hmm 100
Notice that the output isn't very readable, since the program
generates symbols as numbers. We can take that output, though, and
run it through the ints2words.pl program to replace
the numbers with the corresponding words:
% genseq example0.hmm 100 | ints2words.pl example0.key
Finding the Hidden State Sequence for a Test Sentence
- Let's create a sentence to use as input to the model. First,
create a file example0.test.words containing the word
sequence that you got when you ran through the model state by state by
hand, above. You can do this using an editor, or you can do it by
typing
% cat > example0.test.words
then typing the sentence in, hitting return, and then typing
control-D. Don't forget to make sure everything is lowercase, and make
sure the punctuation mark is a separate word, not attached to the last
word of the sentence.
- Turn the file you've just created into the right format for the
HMM programs, by running words2seq.pl program as follows:
% words2seq.pl example0.key < example0.test.words > example0.test
Examine file example0.test to see what the input sequence
looks like.
- Find the sequence of hidden states most likely to have generated
the symbol sequence in example0.test, by using the
testvit program:
% testvit example0.hmm example0.test
The program reports the probability of the symbol sequence, given the
most likely sequence of states, and it also reports the optimal state
sequence.
- Take that optimal state sequence, and replace each state number
with the part-of-speech label you assigned to that state.
- Congratulations! You've just done some HMM-based part-of-speech
tagging. Does the sequence of parts of speech correspond to what you
expected? Explain.
Time for Fun
Now that you've gone through this exercise, we move on to
further exploration:
- Try getting the state sequence (part-of-speech tags) for some
more sentences -- you can look at example0.key to see what words
you're allowed to use. To speed things up, you can skip the step of
creating example0.test.words by executing:
% cat | words2seq.pl example0.key > example0.testA
Typing your sentence in, hitting return, and then typing control-D to
end and create file example0.testA. (You can use suffixes testA,
testB, etc. for new sentences.) As before, don't forget to make sure
everything is lowercase, and make sure punctuation are separated from
other words by spaces. Once you've created example0.testA, run the
testvit program on that file, as described above.
- Go back to "Training the HMM", but this time increase the number
of states to 7 or 8 or 9. Go through the rest of the exercise of
labeling the resulting states with part-of-speech tags. What does the
model do when it has more states to play with, for this training set?
What do you think are are some possible consequences of having more
states, in terms of the model's ability to tag accurately, and in
terms of the linguistic facts the model captures?
- Go back to "Training the HMM", but this time decrease the number
of states to, say, 3 or 4. Same questions as in the previous
paragraph: what are the consequences?
- Let's look at some more interesting data, using a more
interesting English-like language. Go into the
example1 directory:
% cd ../example1
Now do the following:
- Run the link program as described earlier. This
time, though, you won't do the "Creating the Training Data"
or "Training the HMM" steps, because the training would take too long.
(Could be a few hours, depending on the machine.) Instead, go
straight to "Inspecting the Model". Notice that this is a bigger HMM
with a richer language: there are 36 symbols in the vocabulary, and
the HMM has 12 states.
- Do some of the same steps as we did for example0, particularly
assigning a part-of-speech label by hand to each state, and generating
some text at random using the genseq.pl program.
-
- If you're familiar with context-free grammars,
examine example1.train and construct (and turn in)
a context-free grammar that generates this language, or
something close to it.
- If you're not already familiar
with context free grammars, look at *grammar1* in file gen.lisp
in your hmm directory: the format of the grammar is pretty
easy to interpret; e.g. it says that an noun phrase NP
can consist of a determiner (DET) followed by a noun (N),
or it consist of a DET followed by an adjective (ADJ) followed
by a noun (N), etc. In what ways does the language described
by this grammar diverge from English?
- Looking at the random text you generated using the
genseq.pl program, do you see some sentences
that could not have been generated by the context-free grammar
in the previous question?
Why do those sentences get generated by the HMM but not the CFG?