Assignment 9, Ling 645/CMSC 723, Fall 1997

# Assignment 9, Ling 645/CMSC 723, Fall 1997

```

PROBABILISTIC MODELS

Do TWO of the following THREE problems (grade is from the highest two
if you do all three, which I encourage) and have a good Thanksgiving!

9.1  Recall that you compute the probability of a string W = w1...wN
in a bigram model as

Pr(w[1]...w[N])

= Pr(w[1]) x Pr(w[2]|w[1]) x Pr(w[3]|w[2]) ... x Pr(w[N]|w[N-1])

= Pr(w[1]) x    PRODUCT    Pr(w[i]|w[i-1])
{i = 2 to N}

Here I've used [i] to indicate that [i] is a subscript, and PRODUCT is
written using the capital Greek letter pi.

(a) What is the corresponding expression for the probability of a
string using a trigram model?  [10 pts]

(b) Suppose you can assume that *every* string W begins with
the symbols "#" acting as a start-of-string marker.
For example, the string "Edgar likes sushi" would be

# Edgar likes sushi

And assume that the symbol '#' is special, i.e. it's never
used anywhere except as a start-of-string marker.  Finally,
assume that the first "real" word of W is still w[1], and
that '#' is numbered as w[0].  E.g.

#     Edgar  likes   sushi
w[0]  w[1]   w[2]    w[3]

Can you revise your expression for Pr(W) in a bigram model
so that it's just a product of probabilities? Do the same
'#' symbols. [10 pts]

9.2  Consider the following probabilistic CFG

S    --> NP VP   (0.5)
S    --> S PP    (0.35)
S    --> V NP    (0.15)
VP   --> V NP    (0.6)
VP   --> VP PP   (0.4)

Corrigiendum: replace
NP   --> Det N   (0.6)
with:
CNP  --> N         (0.857)
NP   --> Det CNP   (0.7)

NP   --> Name    (0.2)
NP   --> NP PP   (0.1)
PP   --> P NP    (1.0)

Name --> Edgar (0.5) | Susan (0.5)
Det  --> the (0.4) | a (0.4) | every (0.1) | some (0.1)
Adj  --> tall (0.2) | roasted (0.1) | big (0.4) | yummy (.3)
N    --> sushi (0.2) | restaurants (0.2) | peanuts (0.3) | people (0.3)
V    --> eat (0.2) | eats (0.2) | like (0.3) | likes (0.2) | detests (0.1)
P    --> in (0.2) | with (0.3) | for (0.3) | near (0.2)

Notice that the last group of rules use a space-saving abbreviation,
showing all the possible expansions and their probabilities on the
right hand side.

(a)  What is the probability of "Edgar eats the sushi with the
peanuts"?  Show your work; you can show how you would compute
the probability rather than multiplying out the numbers.   Just
make sure it's clear where the numbers come from. [10 pts]

(b)  What is the probability of "The people like the roasted peanuts"?
What about "The peanuts like the roasted people"?  What's the
problem here, and how might one solve it?   [10 pts]  (The last
part is open ended; feel free to speculate on some possible
solutions, but argue why they make sense.)

9.3  Create a directory hw9, go into it, and use ftp to connect to
Then:

cd pub/resnik/ling645/hw9
mget *
quit

In directory hw9, you need to make several files executable.
It's easiest to just make all of them executable:

chmod a+x *

File GEN.EN is the book of Genesis, King James version.
Execute

Run

To count unigrams and bigrams in that document.  File
infile.unigrams contains the unigrams and infile.bigrams
contains bigrams, sorted with most frequent at the top.
Make a note of what the total counts are for observed unigrams
and bigrams.

(a) Look at the most frequent 10 or 20 unigrams.  What can you
say about the linguistic character of the most frequent
words?  What about the most frequent bigrams?  [5 points]

(b) Using a maximum likelihood estimate of probabilities,
what is the probability of the following string according to
a UNIGRAM model (same assumptions about showing the probablities
as before):

I said unto him that it shall be thy land

(Note that you can bring up infile.unigrams in emacs and
search for words to find their frequencies).  [5 points]

(c) Look at the most frequent bigrams in infile.bigrams.  Does it
look like bigram frequency is a good way of identifying
linguistically salient phrases?  Why or why not?  [5 points]

(d) Look at infile.mi.  The columns are

I(w1,w2)  freq(w1)  freq(w2)  freq(w1,w2)   w1   w2

For example, the first row is

11.7737          6         11    6 savoury   meat

indicating that "savoury" occurred 6 times, "meat" occurred 11
times, and "savoury meat" occurred adjecent 6 times.  The
value I(w1,w2) is the mutual information of w1 and w2, which
is described in your reading for this week.  Look at the top
20 bigrams in infile.mi -- can you give a linguistic
characterization of pairs that have high mutual information,
perhaps grouping them into classes that have similar
characteristics?  [5 points]

IN ORDER NOT TO WASTE DISK SPACE:   When you are done, run

Clean

and say yes, removing infile.unigrams, etc.  You can always
re-run Run in order to get them back if you need to.

```