Description: In this exercise, we apply basic counts and some association statistics to a small corpus. We will:
Prerequisites: This exercise assumes basic familiarity with typical Unix commands, and the ability to create text files (e.g. using a text editor such as vi or emacs). No programming is required.
Notational Convention: The symbols <== will be used to identify a comment from the instructor, on lines where you're typing something in. So, for example, in
% cp file1.txt file2.txt <== The "cp" is short for "copy"
what you're supposed to type at the prompt (identified by the percent sign, here) is
cp file1.txt file2.txtfollowed by a carriage return.
% passwdFor security reasons, if you have not changed your password by the beginning of class on July 10, your account will be cancelled. (Note that the prompt % may be different on your system; for example, another frequently seen prompt is [thismachine], where "thismachine" is the name of your workstation, or simply ">".)
See the front of the machine for whether it's Sun or DEC, and execute uname -sr to see what operating system it's running. In the AVW1453 Lab, you should use the Dec3000 machines if at all possible, since they're a LOT faster.
If you are on... Substitute this where it says tarfile.tar... Sun, Solaris 5.5 solaris.tar Sun, SunOS 4.x sunos.tar DECStation3000 (alpha) dec3000.tar DECStation5000 dec5000.tar
Here are the steps:
% uname -sr <== See what OS you're running % mkdir stats <== Create a subdirectory called "stats" % cd stats <== Go into that directory % ftp umiacs.umd.edu <== Invoke the "ftp" program Name (yourname): anonymous <== Type "anonymous" (without quotes) Password: name@address <== Type your e-mail address ftp> cd pub/resnik/723 <== Go to directory pub/resnik/723 ftp> binary <== USe binary transfer mode ftp> get tarfile.tar <== Download the file <== (Substituting the appropriate <== tarfile name for your machine -- <== see above!) ftp> bye <== Exit from ftp % tar xvf tarfile.tar <== Extract code from the file % rm tarfile.tar <== Delete to conserve space % chmod u+x *.pl <== Make perl scripts executableSo, for example, on a DEC3000, you would do:
% mkdir stats <== Create a subdirectory called "stats" % cd stats <== Go into that directory % ftp umiacs.umd.edu <== Invoke the "ftp" program Name (yourname): anonymous <== Type "anonymous" (without quotes) Password: name@address <== Type your e-mail address ftp> cd pub/resnik/723 <== Go to directory pub/resnik/723 ftp> binary <== Use binary transfer mode ftp> get dec3000.tar <== Download the file ftp> bye <== Exit from ftp % tar xvf dec3000.tar <== Extract code from the file % rm dec3000.tar <== Delete to conserve space % chmod u+x *.pl <== Make perl scripts executable
% more corpora/GEN.EN(Type spacebar for more pages, and "q" for "quit".) This contains an annotated version of the book of Genesis, King James version. It is a small corpus, by current standards -- somewhere on the order of 40,000 or 50,000 words. What words (unigrams) would you expect to have high frequency in this corpus? What bigrams do you think might be frequent?
% mkdir genesis
Then run the Stats program to analyze the corpus. The program requires an input file, and a "prefix" to be used in creating output files. The input file will be corpora/GEN.EN, and the prefix will be genesis/out, so that output files will be created in the genesis subdirectory. That is, you should execute the following:
% Stats corpora/GEN.EN genesis/outThe program will tell you what it's doing, as it counts unigrams, counts bigrams, computes mutual information, and computes likelihood ratio statistics. Depending on the machine you're working on, this may take differing amount of time to run, but it should be less than 5 minutes for all but the DEC5000 machines, which are VERY slow! (It takes around 15-20 minutes on those machines.)
% ftp umiacs.umd.edu <== Invoke the "ftp" program Name (yourname): anonymous <== Type "anonymous" (without quotes) Password: name@address <== Type your e-mail address ftp> cd pub/resnik/723 <== Go to directory pub/resnik/723 ftp> binary <== Use binary transfer mode ftp> get ngram_out.tar <== Download the file ftp> bye <== Exit from ftp % tar xvf ngram_out.tar
% cd genesis
% more out.unigramsSeeing the vocabulary in alphabetical order isn't very useful, so let's sort the file by the unigram frequency, from highest to lowest:
% sort -nr out.unigrams > out.unigrams.sorted % more out.unigrams.sortedNow examine out.unigrams.sorted. Note that v (verse), c (chapter), id, and GEN are part of the markup in file GEN.EN, for identifying verse boundaries. Other than those (which are a good example of why we need pre-processing to handle markup), are the high frequency words what you would expect?
% sort -nr out.bigrams > out.bigrams.sorted % more out.bigrams.sortedMarkup aside, again, are the high frequency bigrams what you would expect?
Low-frequency bigrams (bigram count less than 5) were excluded.
As an exercise, compute mutual information by hand for the first bigram on the list, "savoury meat". Recall that
I(x,y) = log2 [p(x,y)/(p(x)p(y))]and that the simplest estimates of probabilities, the maximum likelihood estimates, are given by
p(x) = freq(x)/N p(y) = freq(y)/N p(x,y) = freq(x,y)/Nwhere N is the number of observed words in the corpus, 44850. (You can get this by counting the words in file out.words; it's also what you get by summing the frequencies in either out.unigrams or out.bigrams.)
You can get a calculator on your screen on some systems (at least sunos and solaris) by executing:
% xcalc &Here's a sequence you can use to do the calculation:
Compute p(savoury) = freq(savoury)/N Compute p(meat) = freq(meat)/N Compute p(savoury meat) = freq(savoury,meat)/N Compute p(savoury)p(meat) = p(savoury) * p(meat) Divide p(savoury,meat) by this value Take the log of the result (which in xcalc is log to the base 10) Convert that result to log base-2 by dividing by 0.30103 This uses the fact that for all M, N: logM(x) = logN(x)/logN(M).At some point, the calculator may give you scientific notation for a number. If you need to enter a number in scientific notation, you use EE:
EE Used for entering exponential numbers. For example to get "-2.3E-4" you'd enter "2 . 3 +/- EE 4 +/-".The number you some up with should be close to the mutual information reported in out.mi. It may be slightly different because your calculation used different precision than the program's.
log(a * b) = log(a) + log(b) log(a / b) = log(a) - log(b)Try converting the formula for mutual information using these identities so that probabilities are never multiplied or divided, before reading further.
Solution: log[p(x,y)/p(x)p(y)] = log p(x,y) - log p(x) + log p(y)
To really get a feel for things, first first substitute the maximum likelihood estimates in and then convert to using log probabilities, i.e.
log[ (freq(x,y)/N)/(freq(x)/N)(freq(y)/N) ]
% ln -s ../filter_stopwords <== Creates a symbolic link % ln -s ../stop.wrd <== Creates a symbolic linkThen run it:
% filter_stopwords stop.wrd < out.lr > out.lr.filtered
How does out.lr.filtered look as a file containing bigrams that are characteristic of this corpus?
% cat GEN.EN | tr "A-Z" "a-z" > GEN.EN.lcTo save disk space, assuming you're done with GEN.EN, delete the original:
% rm GEN.ENTry re-doing the exercise with this version. What, if anything, changes?
% cd corpora % ftp umiacs.umd.edu Name (yourname): anonymous Password: name@address ftp> cd pub/resnik/723/ebooks ftp> dir ftp> get adventures.dyl <== Choose one or more ftp> get hound.dyl ftp> get study.dyl ftp> bye <== Exit from ftp
Now get back into your stats directory, create an output directory, say, holmes1, and run the Stats program for the file of interest, e.g.:
% cd .. % mkdir holmes1 % Stats corpora/study.dyl holmes1/out % cd holmes1
Or perhaps convert to lowercase before running Stats:
% cd corpora % cat study.dyl | tr "A-Z" "a-z" > study.lc % rm study.dyl % cat hound.dyl | tr "A-Z" "a-z" > hound.lc % rm hound.dyl % cat adventures.dyl | tr "A-Z" "a-z" > adventures.lc % rm adventures.dyl % cd ..
Look at out.lr, etc. for this corpus. Now go through the same process again, but creating a directory holmes2 and using a different file. Same author, same main character, same genre... how do the high-association bigrams compare between the two cases? If you use filter_stopwords, how do the results look -- what kinds of bigrams are you getting? What natural language processing problems might this be useful for?
% /bin/rm -rf corpora genesis holmes1 holmes2 holmes3to delete the entire directories. Your housekeeping will be much appreciated.