Bigram Counts and Association Statistics


Description: In this exercise, we apply basic counts and some association statistics to a small corpus. We will:

To turn in: You should do everything that's suggested in this assignment and answer all questions -- if a sentence ends with a question mark, it's a question you need to answer.

Credits: The scripts used in this exercise, written by Philip Resnik, were derived from Ken Church's "NGRAMS" tutorial at ACL-1995. The likelihood ratio code was adapted from code written by Ted Dunning.

Prerequisites: This exercise assumes basic familiarity with typical Unix commands, and the ability to create text files (e.g. using a text editor such as vi or emacs). No programming is required.

Notational Convention: The symbols <== will be used to identify a comment from the instructor, on lines where you're typing something in. So, for example, in

    
    %  cp file1.txt file2.txt   <== The "cp" is short for "copy"
  

what you're supposed to type at the prompt (identified by the percent sign, here) is

    cp file1.txt file2.txt
  
followed by a carriage return.


Getting the code

  1. Log in to your course account.

  2. You will use ftp (file transfer protocol) to get the software for this exercise. You'll be downloading source code and compiling it.

    Here are the steps:

      % mkdir stats               <== Create a subdirectory called "stats"
      % cd stats                  <== Go into that directory
      % ftp umiacs.umd.edu        <== Invoke the "ftp" program
    
      Name (yourname): anonymous    <==   Type "anonymous" (without quotes)
      Password: name@address        <==   Type your e-mail address
    
      ftp> cd pub/resnik/723        <==   Go to directory pub/resnik/723
      ftp> binary                   <==   USe binary transfer mode
      ftp> get ngrams.tar          <==   Download the file
      ftp> bye                      <==   Exit from ftp
    
      % tar xvf ngrams.tar       <== Extract code from the file
      % rm ngrams.tar            <== Delete to conserve space
      % cd ngrams                   <==   Go into the directory with the code
      % chmod u+x *.pl            <== Make perl scripts executable
      % gcc -o filter_stopwords filter_stopwords.c   <== Compile filter_stopwords
      % gcc -o lr_simple lr_simple.c -lm             <== Compile lr_simple
    
    Don't forget the -lm flag when you're compiling lr_simple! If you forget it you'll get a warning that the log function is unresolved, since that flag tells the compiler to link to the built-in math library.

    Execute the command

      % which perl
    
    to see the full path name for the version of perl you're running on your system. Edit file lr_filter.pl and replace /usr/imports/bin/perl with that path name (on most Unix systems you'll be replacing it with /usr/local/bin/perl).


    Generating Statistics for a Corpus

    1. Take a look at file corpora/GEN.EN. You can do this as follows:
        %  more corpora/GEN.EN
      
      (Type spacebar for more pages, and "q" for "quit".) This contains an annotated version of the book of Genesis, King James version. It is a small corpus, by current standards -- somewhere on the order of 40,000 or 50,000 words. What words (unigrams) would you expect to have high frequency in this corpus? What bigrams do you think might be frequent?

    2. Create a subdirectory called genesis to contain the files with statistics generated from this corpus:
        %  mkdir genesis
      

      Then run the Stats program to analyze the corpus. The program requires an input file, and a "prefix" to be used in creating output files. The input file will be corpora/GEN.EN, and the prefix will be genesis/out, so that output files will be created in the genesis subdirectory. That is, you should execute the following:

        %  Stats corpora/GEN.EN genesis/out
      
      The program will tell you what it's doing, as it counts unigrams, counts bigrams, computes mutual information, and computes likelihood ratio statistics. Depending on the machine you're working on, this may take differing amount of time to run, but it should be less than 5 minutes for all but the DEC5000 machines, which are VERY slow! (It takes around 15-20 minutes on those machines.)

    3. If you're on a slow machine, or don't care about doing your own computing, you can skip the above step (or hit control-C to cancel it), and do the following to download a copy of the results:
        % ftp umiacs.umd.edu        <== Invoke the "ftp" program
      
        Name (yourname): anonymous    <==   Type "anonymous" (without quotes)
        Password: name@address        <==   Type your e-mail address
      
        ftp> cd pub/resnik/723        <==   Go to directory pub/resnik/723
        ftp> binary                   <==   Use binary transfer mode
        ftp> get ngram_out.tar        <==   Download the file 
        ftp> bye                      <==   Exit from ftp
      
        % tar xvf ngram_out.tar
      

    4. Whether you computed yourself or downloaded the results, you should now have a subdirectory called genesis containing a bunch of files that begin with out.


    Examining Unigram and Bigram Counts

    1. Go into directory genesis.
        %  cd genesis
      

    2. Look at file out.unigrams:
        %  more out.unigrams
      
      Seeing the vocabulary in alphabetical order isn't very useful, so let's sort the file by the unigram frequency, from highest to lowest:
        %  sort -nr out.unigrams > out.unigrams.sorted
        %  more out.unigrams.sorted
      
      Now examine out.unigrams.sorted. Note that v (verse), c (chapter), id, and GEN are part of the markup in file GEN.EN, for identifying verse boundaries. Other than those (which are a good example of why we need pre-processing to handle markup), are the high frequency words what you would expect?

    3. Analogously, look at the bigram counts out.bigrams:
        %  sort -nr out.bigrams > out.bigrams.sorted
        %  more out.bigrams.sorted
      
      Markup aside, again, are the high frequency bigrams what you would expect?


    Getting Quantitative with Mutual Information

    1. Now let's look at mutual information. File out.mi contains bigrams sorted by mutual information value. Each line contains:

      1. I(wordX,wordY)
      2. freq(wordX)
      3. freq(wordY)
      4. freq(wordX,wordY)
      5. wordX
      6. wordY

      Low-frequency bigrams (bigram count less than 5) were excluded.

      As an exercise, compute mutual information by hand for the first bigram on the list, "savoury meat". Recall that

         I(x,y)  =  log2 [p(x,y)/(p(x)p(y))]
      
      and that the simplest estimates of probabilities, the maximum likelihood estimates, are given by
        p(x)   = freq(x)/N
        p(y)   = freq(y)/N
        p(x,y) = freq(x,y)/N
      
      where N is the number of observed words in the corpus, 44850. (You can get this by counting the words in file out.words; it's also what you get by summing the frequencies in either out.unigrams or out.bigrams.)

      You can get a calculator on your screen on some systems (at least sunos and solaris) by executing:

        %  xcalc &
      
      Here's a sequence you can use to do the calculation:
        Compute p(savoury)         = freq(savoury)/N
        Compute p(meat)            = freq(meat)/N
        Compute p(savoury meat)    = freq(savoury,meat)/N
        Compute p(savoury)p(meat)  = p(savoury) * p(meat)
        Divide p(savoury,meat) by this value
        Take the log of the result (which in xcalc is log to the base 10)
        Convert that result to log base-2 by dividing by 0.30103 
          This uses the fact that for all M, N: logM(x) = logN(x)/logN(M).
      
      At some point, the calculator may give you scientific notation for a number. If you need to enter a number in scientific notation, you use EE:
         EE         Used for entering exponential numbers.  For example
                    to get "-2.3E-4" you'd enter "2 . 3 +/- EE 4 +/-".   
      
      The number you some up with should be close to the mutual information reported in out.mi. It may be slightly different because your calculation used different precision than the program's.

    2. As you've just seen, probabilities can be very low numbers. All the more so when using n-grams for n=3 or above! Underflow can be a problem in these sorts of calculations: when the probabilities are too low, they exceed the representational capacity of the computer. For this reason it's very common to do such calculations using the logs of the probability values (often called "log probabilities" or "logprobs"), using the following handy identities:
        log(a * b) = log(a) + log(b)
        log(a / b) = log(a) - log(b)
      
      Try converting the formula for mutual information using these identities so that probabilities are never multiplied or divided, before reading further.

      Solution: log[p(x,y)/p(x)p(y)] = log p(x,y) - log p(x) + log p(y)

      To really get a feel for things, first first substitute the maximum likelihood estimates in and then convert to using log probabilities, i.e.

      log[ (freq(x,y)/N)/(freq(x)/N)(freq(y)/N) ]


    Examining the Results

    1. Look at out.mi and the bigrams selected by mutual information as being strongly associated. What do you think of them? Notice how very many of them are low-frequency bigrams: it's well known that mutual information has overly high values for bigrams of low frequency, i.e. it reports word pairs as associated when they probably are not really that strongly associated after all.

    2. Compare this to out.lr, where the leftmost column is the likelihood ratio. There are a lot of common words of English in there, so try filtering those out using the filter_stopwords program. First, access the program so it's easy to run in this directory:
        %  ln -s ../filter_stopwords      <== Creates a symbolic link
        %  ln -s ../stop.wrd              <== Creates a symbolic link
      
      Then run it:
        %  filter_stopwords stop.wrd < out.lr > out.lr.filtered
      

      How does out.lr.filtered look as a file containing bigrams that are characteristic of this corpus?


    Time for Fun

    1. One thing you may have noticed is that there's more data sparseness because uppercase and lowercase are distinct, e.g. "Door" is treated as a different word from "door". In the corpora directory, you can create an all- lowercase version of GEN.EN by doing this:
        %   cat GEN.EN | tr "A-Z" "a-z" > GEN.EN.lc
      
      To save disk space, assuming you're done with GEN.EN, delete the original:
        %   rm GEN.EN
      
      Try re-doing the entire exercise with this version. (Yes, all of it!) What, if anything, changes?

    2. Ok, perhaps that last one wasn't exactly fun. But this probably will be. Go into your corpora subdirectory. Then ftp to site umiacs.umd.edu and go to directory pub/resnik/723/ebooks. Type dir to look at several options, all of them Sherlock Holmes stories: A Study in Scarlet, The Hound of the Baskervilles, or Adventures of Sherlock Holmes, the last of which contains 12 different stories. Get one or all of them. E.g.:
        % cd corpora                
        % ftp umiacs.umd.edu        
      
        Name (yourname): anonymous  
        Password: name@address      
      
        ftp> cd pub/resnik/723/ebooks
        ftp> dir
        ftp> get adventures.dyl        <== Choose one or more
        ftp> get hound.dyl       
        ftp> get study.dyl
        ftp> bye                       <==   Exit from ftp
      

      Now get back into your stats directory, create an output directory, say, holmes1, and run the Stats program for the file of interest, e.g.:

        %   cd ..
        %   mkdir holmes1
        %   Stats corpora/study.dyl holmes1/out
        %   cd holmes1
      

      Or perhaps convert to lowercase before running Stats:

        %   cd corpora
        %   cat study.dyl | tr "A-Z" "a-z" > study.lc
        %   rm study.dyl
        %   cat hound.dyl | tr "A-Z" "a-z" > hound.lc
        %   rm hound.dyl
        %   cat adventures.dyl | tr "A-Z" "a-z" > adventures.lc
        %   rm adventures.dyl
        %   cd ..
      

      Look at out.lr, etc. for this corpus. Now go through the same process again, but creating a directory holmes2 and using a different file. Same author, same main character, same genre... how do the high-association bigrams compare between the two cases? If you use filter_stopwords, how do the results look -- what kinds of bigrams are you getting? What natural language processing problems might this be useful for?


    When You are Done, We Need to Recover Disk Space

    Corpora and data take up a lot of disk space. When you are done, PLEASE delete the output directories you have created, and even the corpus directory itself if you no longer need it. For example, if you are in your stats directory, you can type:
      %   /bin/rm -rf corpora genesis holmes1 holmes2 holmes3
    
    to delete the entire directories. Your housekeeping will be much appreciated.