PROBABILISTIC MODELS Do TWO of the following THREE problems (grade is from the highest two if you do all three, which I encourage) and have a good Thanksgiving! 9.1 Recall that you compute the probability of a string W = w1...wN in a bigram model as Pr(w[1]...w[N]) = Pr(w[1]) x Pr(w[2]|w[1]) x Pr(w[3]|w[2]) ... x Pr(w[N]|w[N-1]) = Pr(w[1]) x PRODUCT Pr(w[i]|w[i-1]) {i = 2 to N} Here I've used [i] to indicate that [i] is a subscript, and PRODUCT is written using the capital Greek letter pi. (a) What is the corresponding expression for the probability of a string using a trigram model? [10 pts] (b) Suppose you can assume that *every* string W begins with the symbols "#" acting as a start-of-string marker. For example, the string "Edgar likes sushi" would be # Edgar likes sushi And assume that the symbol '#' is special, i.e. it's never used anywhere except as a start-of-string marker. Finally, assume that the first "real" word of W is still w[1], and that '#' is numbered as w[0]. E.g. # Edgar likes sushi w[0] w[1] w[2] w[3] Can you revise your expression for Pr(W) in a bigram model so that it's just a product of probabilities? Do the same for the trigram model, assuming strings start with TWO of the '#' symbols. [10 pts] 9.2 Consider the following probabilistic CFG S --> NP VP (0.5) S --> S PP (0.35) S --> V NP (0.15) VP --> V NP (0.6) VP --> VP PP (0.4)Corrigiendum: replaceNP --> Det N (0.6) NP --> Adj N (0.1)with:CNP --> Adj CNP (0.143) CNP --> N (0.857) NP --> Det CNP (0.7) NP --> Name (0.2) NP --> NP PP (0.1) PP --> P NP (1.0) Name --> Edgar (0.5) | Susan (0.5) Det --> the (0.4) | a (0.4) | every (0.1) | some (0.1) Adj --> tall (0.2) | roasted (0.1) | big (0.4) | yummy (.3) N --> sushi (0.2) | restaurants (0.2) | peanuts (0.3) | people (0.3) V --> eat (0.2) | eats (0.2) | like (0.3) | likes (0.2) | detests (0.1) P --> in (0.2) | with (0.3) | for (0.3) | near (0.2) Notice that the last group of rules use a space-saving abbreviation, showing all the possible expansions and their probabilities on the right hand side. (a) What is the probability of "Edgar eats the sushi with the peanuts"? Show your work; you can show how you would compute the probability rather than multiplying out the numbers. Just make sure it's clear where the numbers come from. [10 pts] (b) What is the probability of "The people like the roasted peanuts"? What about "The peanuts like the roasted people"? What's the problem here, and how might one solve it? [10 pts] (The last part is open ended; feel free to speculate on some possible solutions, but argue why they make sense.) 9.3 Create a directory hw9, go into it, and use ftp to connect to umiacs.umd.edu (logname "anonymous", your e-mail address as password). Then: cd pub/resnik/ling645/hw9 mget * quit In directory hw9, you need to make several files executable. It's easiest to just make all of them executable: chmod a+x * File GEN.EN is the book of Genesis, King James version. Execute Run To count unigrams and bigrams in that document. File infile.unigrams contains the unigrams and infile.bigrams contains bigrams, sorted with most frequent at the top. Make a note of what the total counts are for observed unigrams and bigrams. (a) Look at the most frequent 10 or 20 unigrams. What can you say about the linguistic character of the most frequent words? What about the most frequent bigrams? [5 points] (b) Using a maximum likelihood estimate of probabilities, what is the probability of the following string according to a UNIGRAM model (same assumptions about showing the probablities as before): I said unto him that it shall be thy land (Note that you can bring up infile.unigrams in emacs and search for words to find their frequencies). [5 points] (c) Look at the most frequent bigrams in infile.bigrams. Does it look like bigram frequency is a good way of identifying linguistically salient phrases? Why or why not? [5 points] (d) Look at infile.mi. The columns are I(w1,w2) freq(w1) freq(w2) freq(w1,w2) w1 w2 For example, the first row is 11.7737 6 11 6 savoury meat indicating that "savoury" occurred 6 times, "meat" occurred 11 times, and "savoury meat" occurred adjecent 6 times. The value I(w1,w2) is the mutual information of w1 and w2, which is described in your reading for this week. Look at the top 20 bigrams in infile.mi -- can you give a linguistic characterization of pairs that have high mutual information, perhaps grouping them into classes that have similar characteristics? [5 points] IN ORDER NOT TO WASTE DISK SPACE: When you are done, run Clean and say yes, removing infile.unigrams, etc. You can always re-run Run in order to get them back if you need to.

Return to the course home page.