Exercise: Using a Hidden Markov Model


Description: In this exercise, we use a hidden Markov model (HMM) as a model of word generation from part-of-speech sequences. We will:

To turn in: Do everything suggested in the directions, and answer all questions that are asked. A question is anything that ends in a question mark!

Credits: The HMM package we are using in this exercise was written by Tapas Kanungo, and this exercise, the accompanying scripts, etc. were written by by Philip Resnik.

Prerequisites: This exercise assumes basic familiarity with typical Unix commands, and the ability to create text files (e.g. using a text editor such as vi or emacs). No programming is required.

Notational Convention: The symbols <== will be used to identify a comment from the instructor, on lines where you're typing something in. So, for example, in

    
    %  cp file1.txt file2.txt   <== The "cp" is short for "copy"
  

what you're supposed to type at the prompt (identified by the percent sign, here) is

    cp file1.txt file2.txt
  
followed by a carriage return.


Getting Logged In and Getting the Code

  1. Log in to your course account.

  2. Use anonymous ftp to get the code:
      % ftp umiacs.umd.edu        <== Invoke the "ftp" program
    
      Name (yourname): anonymous  <==   Type "anonymous" (without quotes)
      Password: name@address      <==   Type your e-mail address
    
      ftp> cd pub/resnik/723      <==   Go to directory pub/resnik/hmm
      ftp> binary                 <==   Tell ftp to use binary transfer mode
      ftp> get hmm.tar            <==   Download the file containing exercise code
      ftp> get hmm_src.tar        <==   Download the file containing the HMM source code
      ftp> bye                    <==   Exit from ftp
    
      % tar xvf hmm.tar           <== Extract code from the file
      % rm hmm.tar                <== Delete the file (to conserve space)
      % mv hmm_src.tar hmm        <== Move the tarfile into the subdirectory
      % cd hmm                    <== Go into hmm subdirectory
      % pickperl /usr/imports/bin/perl `which perl`  <== get correct perl (note backquotes!)
      % chmod u+x *.pl            <== Make perl programs executable
    

  3. Execute uname -sr to see what operating system you're running. If it's SunOS or Solaris, you can use the executable programs that are part of the tarfile; skip to the next step. If not, you'll need to do the extra step of downloading and compiling the HMM source code. Here's how:

    First, execute the following lines:

      % tar xvf hmm_src.tar       <== Extract the HMM source code
      % rm hmm_src.tar            <== Delete the file (to conserve space)
      % cd src                    <== Go into source subdirectory
    
    Do one of the following to update the makefile:

  4. Now that the makefile is updated, do the following:
      % make                         <== Compile files.  You can ignore warnings.
      % mv esthmm genseq testvit ..  <== Move executables to parent directory
    
    Say yes if you're asked whether it's ok to overwrite the existing files with the same names -- you're replacing the Solaris executables with the executables that are appropriate for the machine you're on.

  5. Now go into directory example0:
        
        %  cd ../example0
      
  6. Run link to make the HMM programs available for running in this directory:
        %  link 
      


Creating the Training Data

  1. Look at file example0.train, which contains a very small corpus generated using a very simple English-like grammar.
        
        %  more example0.train 
      
    Alternatively, open the file in an editor. What parts of speech would you say are represented in this file? Write down a list of parts of speech and the words you think belong to each one. Notice that the same word can appear in multiple parts of speech.

  2. Turn example0.train into training input for the HMM code, by running the create_key.pl program. This reads in the training file and converts each unique word into a unique number, since the HMM program uses numbers rather than symbols. For example, "the tall man saw the short man" might be translated into the sequence "1 2 3 4 1 5 3".
        %  create_key.pl example0.key < example0.train > example0.seq
      
    Take a look at file example0.key, which now contains the mapping from words to their corresponding numbers, and at file example0.seq, which now contains the training input represented as numbers rather than words. The "T=" value at the top of example0.seq tells you how many symbols there are in the training sequence. (In this case the number should be 590.)


Training the HMM

  1. To train the model we will use the esthmm program. This program needs to know the number of symbols in the alphabet of the HMM (that is, symbols that can be emitted); you can obtain this by typing
        %  wc -l example0.key
      
    In this case the number of symbols is 13. The number of states is something you can choose. Recall that for a HMM-based part-of-speech model like this one, each state corresponds to a part of speech (i.e. a syntactic category like noun, verb, etc.), so the number of states to choose corresponds to the number of parts of speech you believe are represented in the corpus. In this case, let's use 6 states. We run the esthmm program as follows:
        %  esthmm 6 13 example0.seq > example0.hmm
      
    This creates file example0.hmm, which contains the trained model. Depending on the computer you're running this on, this might take a minute.

  2. Edit file example0.hmm -- it's not the easiest thing in the world to read, but you can see how the model is represented there. At the top are specified the number of states and the number of symbols (M=13 symbols, N=6 states). Then you have the complete A matrix, i.e. the 6-by-6 transition probability matrix. Next you have the 6-by-13 emission probability matrix, B. Finally you have the pi vector, giving initial probabilities for the 6 states.

  3. Delete the first line of example0.hmm, that is, the line containing the words "num iterations". If you don't do so, you'll get errors that look like "Bad format at A in example0.hmm" later on.


Inspecting the Model

  1. To view the model you've just created a bit more readably, keeping only the most useful information, run the viewhmm.pl program. This shows the state-to-state transition probabilities when they are above a certain threshold probability, since very low transition probabilities probably don't tell you much about the structure of the model. Similarly, for the emission probabilities, it shows only the symbols emitted with a sufficiently high probability. (And it shows them as words, not numbers, for easier readability.) Run viewhmm.pl as follows:
        %  viewhmm.pl example0.hmm example0.key 0.01 0.01 
      
    Chances are it scrolls by too fast, you you may want to do this instead:
        %  viewhmm.pl example0.hmm example0.key 0.01 0.01 | more
      
    Or, you might want to save the output to a file:
        %  viewhmm.pl example0.hmm example0.key 0.01 0.01 > example0.view
      

  2. Based on what you see when you run viewhmm.pl, draw, on paper, a transition diagram for this HMM. That is, write each state as a node labeled by the state number, and write probabilities on the arcs from state to state.

  3. Now, based on what you see when you run viewhmm.pl, label each state in your transition diagram with a part of speech. How good a match is there between your intuitions, earlier, and the way the model has automatically decided which are the high-probability symbols for each state? If there are mismatches, are they linguistically interesting?


Generating Sentences at Random from the Model

  1. Using your transition diagram and the output of viewhmm.pl, start at the the most likely start state, and write down the most likely symbol to be emitted there. (Break ties at random.) Then follow the most likely arc to the next state, and write down the most likely symbol to be emitted there. Continue in this fashion until you emit a punctuation mark, or until you get bored.

  2. Now we'll have the computer do this same process. Since it doesn't ever get bored, we'll have to tell it how many symbols to emit before it stops -- say, 100. To do this, run the genseq program:
        %  genseq example0.hmm 100
      
    Notice that the output isn't very readable, since the program generates symbols as numbers. We can take that output, though, and run it through the ints2words.pl program to replace the numbers with the corresponding words:
        %  genseq example0.hmm 100 | ints2words.pl example0.key
      


Finding the Hidden State Sequence for a Test Sentence

  1. Let's create a sentence to use as input to the model. First, create a file example0.test.words containing the word sequence that you got when you ran through the model state by state by hand, above. You can do this using an editor, or you can do it by typing
        %  cat > example0.test.words
      
    then typing the sentence in, hitting return, and then typing control-D. Don't forget to make sure everything is lowercase, and make sure the punctuation mark is a separate word, not attached to the last word of the sentence.

  2. Turn the file you've just created into the right format for the HMM programs, by running words2seq.pl program as follows:
        %  words2seq.pl example0.key < example0.test.words > example0.test
      
    Examine file example0.test to see what the input sequence looks like.

  3. Find the sequence of hidden states most likely to have generated the symbol sequence in example0.test, by using the testvit program:
        %  testvit example0.hmm example0.test
      
    The program reports the probability of the symbol sequence, given the most likely sequence of states, and it also reports the optimal state sequence.

  4. Take that optimal state sequence, and replace each state number with the part-of-speech label you assigned to that state.

  5. Congratulations! You've just done some HMM-based part-of-speech tagging. Does the sequence of parts of speech correspond to what you expected? Explain.


Time for Fun

Now that you've gone through this exercise, we move on to further exploration:
  1. Try getting the state sequence (part-of-speech tags) for some more sentences -- you can look at example0.key to see what words you're allowed to use. To speed things up, you can skip the step of creating example0.test.words by executing:
      %  cat | words2seq.pl example0.key > example0.testA
    
    Typing your sentence in, hitting return, and then typing control-D to end and create file example0.testA. (You can use suffixes testA, testB, etc. for new sentences.) As before, don't forget to make sure everything is lowercase, and make sure punctuation are separated from other words by spaces. Once you've created example0.testA, run the testvit program on that file, as described above.

  2. Go back to "Training the HMM", but this time increase the number of states to 7 or 8 or 9. Go through the rest of the exercise of labeling the resulting states with part-of-speech tags. What does the model do when it has more states to play with, for this training set? What do you think are are some possible consequences of having more states, in terms of the model's ability to tag accurately, and in terms of the linguistic facts the model captures?

  3. Go back to "Training the HMM", but this time decrease the number of states to, say, 3 or 4. Same questions as in the previous paragraph: what are the consequences?

  4. Let's look at some more interesting data, using a more interesting English-like language. Go into the example1 directory:
      %  cd ../example1
      
    Now do the following:

    1. Run the link program as described earlier. This time, though, you won't do the "Creating the Training Data" or "Training the HMM" steps, because the training would take too long. (Could be a few hours, depending on the machine.) Instead, go straight to "Inspecting the Model". Notice that this is a bigger HMM with a richer language: there are 36 symbols in the vocabulary, and the HMM has 12 states.

    2. Do some of the same steps as we did for example0, particularly assigning a part-of-speech label by hand to each state, and generating some text at random using the genseq.pl program.

      • If you're familiar with context-free grammars, examine example1.train and construct (and turn in) a context-free grammar that generates this language, or something close to it.
      • If you're not already familiar with context free grammars, look at *grammar1* in file gen.lisp in your hmm directory: the format of the grammar is pretty easy to interpret; e.g. it says that an noun phrase NP can consist of a determiner (DET) followed by a noun (N), or it consist of a DET followed by an adjective (ADJ) followed by a noun (N), etc. In what ways does the language described by this grammar diverge from English?

    3. Looking at the random text you generated using the genseq.pl program, do you see some sentences that could not have been generated by the context-free grammar in the previous question? Why do those sentences get generated by the HMM but not the CFG?