10.1 Using Hidden Markov Models for part-of-speech tagging Re-read Allen, Sec 7.3 from p. 195 to p. 201, but don't worry about the Viterbi algorithm; we're covering only the "brute force algorithm" Allen refers to on p. 201, which is to compute ALL POSSIBLE part-of-speech category sequences and compare their probabilities given the observed words. AN IMPORTANT NOTE TO AVOID CONFUSION: Allen uses C to denote part of speech sequences and W for words, but in class I used W to denote the part of speech sequences and the letter O for the sequence of words. Make sure you're aware of this notational difference when you're comparing your class notes to the book. (a) Given the table of frequencies in Allen, Figure 7.5 (p. 199), what would the estimated probability of "flies like flowers" be according to a unigram model? Use maximum likelihood estimates to estimate the probabilities from frequencies. (Don't just give a number; show how you got it.) [10 pts] (b) Using the probability estimates in the rightmost column of Allen's Figure 7.4, what is the estimated probability of "N V N" according to a bigram model? (Again, show why.) Notice (p. 197) that Allen uses a special symbol, a 0 with a slash through it, as a special "start" symbol, in just the way we used '#' in Problem 9.1(b) of your previous assignment. [10 pts] (c) Derive an expression for Pr("N V N" | "flies like flowers") using Bayes' rule. Give the corresponding expression for Pr("N N V" | "flies like flowers"). Using the numbers in Figures 7.5 and 7.6, show which of these two probabilities is higher. [20 pts] NOTE: Figure 7.6 is incomplete! Use .059 as the value of Pr(flowers | N), and use .07 as the value of Pr(flowers | V). (d) Draw a diagram identifying the two components of the noisy channel model for part of speech tagging, showing the part of speech sequence "N V N" and the sentence "flies like flowers" in the appropriate places. Next to each component, give an expression for the probability computed by that component. [5 pts] 10.2 [45 pts] Read Church and Hanks, "Word Association Norms, Mutual Information, and Lexicography". If you have not already done so, I recommend that you download the statistical software described in Problem 9.3 of the previous assignment, execute Run as described there, and have a look at "infile.mi". [Note that I will use the terms "mutual information" and "the association ratio" more or less interchangeably.] (a) The association ratio can be expressed as log(A/B). Explain what A and B are, and why A > B would indicate that two words are associated. Give an example. (b) Explain how mutual information might be used by the following people, including both WHAT they might do and WHY mutual information might be useful or problematic for the task. Support any arguments you make by reference to the paper, to material covered in class, and/or by reference to your experience using the statistical software. (i) A lexicographer writing the "Dictionary of English as Used on the World Wide Web". (ii) A technical writer at Sun Microsystems (a major computer manufacturer) writing the glossary to be used with a new set of software documentation (length approximately 200,000 words). (iii) A publishing house that needs to quickly create a back-of-the book index for a new psychology textbook. (iv) A computational linguist creating a lexicon and grammar for English following the approach James Allen uses in his textbook. The grading for this problem will focus on what you write for part (b), including such factors as correctness, the logic of the points you make, the quality of writing, etc.

Return to the course home page.