Assignment 9, Ling 645/CMSC 723, Fall 1997

Assignment 9, Ling 645/CMSC 723, Fall 1997


Do TWO of the following THREE problems (grade is from the highest two
if you do all three, which I encourage) and have a good Thanksgiving!

9.1  Recall that you compute the probability of a string W = w1...wN
     in a bigram model as


	    = Pr(w[1]) x Pr(w[2]|w[1]) x Pr(w[3]|w[2]) ... x Pr(w[N]|w[N-1])

	    = Pr(w[1]) x    PRODUCT    Pr(w[i]|w[i-1])
			 {i = 2 to N}

     Here I've used [i] to indicate that [i] is a subscript, and PRODUCT is
     written using the capital Greek letter pi.

     (a) What is the corresponding expression for the probability of a
         string using a trigram model?  [10 pts]

     (b) Suppose you can assume that *every* string W begins with
         the symbols "#" acting as a start-of-string marker.
         For example, the string "Edgar likes sushi" would be

            # Edgar likes sushi

         And assume that the symbol '#' is special, i.e. it's never
	 used anywhere except as a start-of-string marker.  Finally,
	 assume that the first "real" word of W is still w[1], and
	 that '#' is numbered as w[0].  E.g.

            #     Edgar  likes   sushi
            w[0]  w[1]   w[2]    w[3]

	 Can you revise your expression for Pr(W) in a bigram model 
         so that it's just a product of probabilities? Do the same
         for the trigram model, assuming strings start with TWO of the
	 '#' symbols. [10 pts]

9.2  Consider the following probabilistic CFG

  S    --> NP VP   (0.5)   
  S    --> S PP    (0.35)  
  S    --> V NP    (0.15)  
  VP   --> V NP    (0.6) 
  VP   --> VP PP   (0.4) 

  Corrigiendum: replace
  NP   --> Det N   (0.6)
  NP   --> Adj N   (0.1)
  CNP  --> Adj CNP   (0.143)
  CNP  --> N         (0.857)
  NP   --> Det CNP   (0.7)

  NP   --> Name    (0.2)
  NP   --> NP PP   (0.1)
  PP   --> P NP    (1.0)

  Name --> Edgar (0.5) | Susan (0.5)
  Det  --> the (0.4) | a (0.4) | every (0.1) | some (0.1)
  Adj  --> tall (0.2) | roasted (0.1) | big (0.4) | yummy (.3)
  N    --> sushi (0.2) | restaurants (0.2) | peanuts (0.3) | people (0.3)
  V    --> eat (0.2) | eats (0.2) | like (0.3) | likes (0.2) | detests (0.1)
  P    --> in (0.2) | with (0.3) | for (0.3) | near (0.2)

  Notice that the last group of rules use a space-saving abbreviation,
  showing all the possible expansions and their probabilities on the
  right hand side.

  (a)  What is the probability of "Edgar eats the sushi with the
       peanuts"?  Show your work; you can show how you would compute
       the probability rather than multiplying out the numbers.   Just
       make sure it's clear where the numbers come from. [10 pts]

  (b)  What is the probability of "The people like the roasted peanuts"? 
       What about "The peanuts like the roasted people"?  What's the
       problem here, and how might one solve it?   [10 pts]  (The last
       part is open ended; feel free to speculate on some possible
       solutions, but argue why they make sense.) 

9.3  Create a directory hw9, go into it, and use ftp to connect to (logname "anonymous", your e-mail address as password).

        cd pub/resnik/ling645/hw9
        mget *

     In directory hw9, you need to make several files executable.
     It's easiest to just make all of them executable:

        chmod a+x *

     File GEN.EN is the book of Genesis, King James version.


     To count unigrams and bigrams in that document.  File
     infile.unigrams contains the unigrams and infile.bigrams
     contains bigrams, sorted with most frequent at the top.
     Make a note of what the total counts are for observed unigrams 
     and bigrams.

     (a) Look at the most frequent 10 or 20 unigrams.  What can you 
         say about the linguistic character of the most frequent
	 words?  What about the most frequent bigrams?  [5 points]

     (b) Using a maximum likelihood estimate of probabilities,
         what is the probability of the following string according to
	 a UNIGRAM model (same assumptions about showing the probablities
	 as before):

           I said unto him that it shall be thy land

         (Note that you can bring up infile.unigrams in emacs and
	 search for words to find their frequencies).  [5 points]

     (c) Look at the most frequent bigrams in infile.bigrams.  Does it
         look like bigram frequency is a good way of identifying
	 linguistically salient phrases?  Why or why not?  [5 points]

     (d) Look at infile.mi.  The columns are

           I(w1,w2)  freq(w1)  freq(w2)  freq(w1,w2)   w1   w2

         For example, the first row is 

           11.7737          6         11    6 savoury   meat

        indicating that "savoury" occurred 6 times, "meat" occurred 11
	times, and "savoury meat" occurred adjecent 6 times.  The
	value I(w1,w2) is the mutual information of w1 and w2, which
	is described in your reading for this week.  Look at the top
	20 bigrams in infile.mi -- can you give a linguistic 
        characterization of pairs that have high mutual information,
	perhaps grouping them into classes that have similar
	characteristics?  [5 points]

      IN ORDER NOT TO WASTE DISK SPACE:   When you are done, run


      and say yes, removing infile.unigrams, etc.  You can always
      re-run Run in order to get them back if you need to.

Return to the course home page.