Project

Project
Acquiring a Probabilistic Context-Free Grammar From a Treebank

Summary

In this project, students will write a program that learns a probabilistic context-free grammar from a treebank (i.e. from a set of correct parse trees).

Some Definitions

A context-free grammar (CFG) contains rewrite rules of the form

  X -> Y1 Y2 ... Yn

where X is a non-terminal symbol, and each symbol on the right-hand side of the rule is either a non-terminal symbol or a terminal symbol. We've seen and used simple grammars for a fragment of English many times in this course. Here's an example:

  S  -> NP VP            Adj -> tall   
  NP -> Det N            Adj -> short  
  NP -> Det Adj N        N -> man      
  NP -> John             N -> woman    
  NP -> Mary             N -> building 
  VP -> Vi               N -> song     
  VP -> Vt NP            Vi -> ate     
  Det -> the             Vi -> laughed 
  Det -> a               Vi -> sang    
  Det -> some            Vt -> saw     
                         Vt -> sang

A probabilistic context-free grammar (PCFG) is a context-free grammar where, for every non-terminal symbol, the possible right-hand sides all have probabilities associated with them. For any given non-terminal symbol (e.g. NP), all the probabilities of the different right-hand sides must sum to 1. For example:

  S  -> NP VP      [1.0]

  NP -> Det N      [0.6]
  NP -> Det Adj N  [0.2]
  NP -> John       [0.1]
  NP -> Mary       [0.1]

  VP -> Vi         [0.3]
  VP -> Vt NP      [0.7]

  ...etc.

Learning a PCFG

A common approach to learning PCFGs is to take a corpus of parsed sentences ("treebank") as input, and to extract context-free rules and their probabilities from the parse trees. For example, given a treebank containing these two trees,

      S                        S         
     / \		      / \        
    /   \		     /   \       
   NP   VP		    NP   VP      
   |    | \		    |    |
  John  Vt NP  		   Mary  Vi
        |   \		         |
       saw  Mary	        laughed

the first tree can be seen as containing the context-free rules

  S -> NP VP
  NP -> John
  VP -> Vt NP
  Vt -> saw
  NP -> Mary

and the second tree as containing the rules

  S -> NP VP
  NP -> Mary
  VP -> Vi
  Vi -> laughed

Thus the treebank contains the following rules and frequencies:

  S -> NP VP     (2)

  NP -> John     (1)
  NP -> Mary     (2)

  VP -> Vt NP    (1)
  VP -> Vi       (1)

  Vt -> saw      (1)

  Vi -> laughed  (1)

These can be converted to probabilities by normalizing so that the right-hand sides for each non-terminal have probabilities that sum to 1:

  S -> NP VP       Prob = 2/2 = 1.0

  NP -> John       Prob = 1/3 = 0.3333
  NP -> Mary       Prob = 2/3 = 0.6666

  VP -> Vt NP      Prob = 1/2 = 0.5
  VP -> Vi         Prob = 1/2 = 0.5

  Vt -> saw        Prob = 1/1 = 1.0

  Vi -> laughed    Prob = 1/1 = 1.0

To take one example in detail, the NP rule probabilities can be calculated by keeping track of (or computing) the total number of NP rules (here equal to 3), and converting the frequency of each NP rule to a probability by dividing by that total. In this example, we see that for this corpus, whenever there's an NP, there is a 1/3 chance that it will be expanded to the symbol John and a 2/3 chance it will be expanded as the symbol Mary.

Let's express this more as an algorithm (what follows is only one of many different ways of doing it). Given a treebank:

  For each tree in the treebank
    Get the context-free rules from the tree
    For each such context-free rule R 
      Update the frequency of R
      Update the frequency of LHS(R)

  For every context-free rule R encountered in the treebank
    Let freq(R) be the frequency of R 
    Let freq(LHS(R)) be the frequency of the LHS of R
    Let prob = freq(R)/freq(LHS(R))
    Output:  R prob

Thus the end result of the algorithm is a probabilistic CFG, represented as a table of context-free rules and their probabilities. (LHS = left-hand-side, i.e. the nonterminal being expanded.)

The Project

You will be given a LISP-readable treebank file, treebank.dat, containing parse trees from the Penn Treebank. For example:


  ( (S 
      (NP (DT The) (NNP Fulton) (NNP County) (NNP Grand) (NNP Jury) )
      (VP (VBD said) 
	(NP (NNP Friday) )
	(SBAR (-NONE- 0) 
	  (S 
	    (NP (DT an) (NN investigation) 
	      (PP (IN of) 
		(NP 
		  (NP (NNP Atlanta) )
		  (POS 's) (JJ recent) (JJ primary) (NN election) )))
	    (VP (VBD produced) 
	      (NP (OQUOTE OQUOTE) (DT no) (NN evidence) (CQUOTE CQUOTE) 
		(SBAR (IN that) 
		  (S 
		    (NP (DT any) (NNS irregularities) )
		    (VP (VBD took) 
		      (NP (NN place) ))))))))))
    (PERIOD PERIOD) )

  ((S 
     (NP (DT the) (JJ new) (NN management) )
     (VP (VBZ takes) 
       (NP (NN charge) ))
     (NP (NNP JanPERIOD) (CD 1) ))
   (PERIOD PERIOD) )

You should produce a file called "pcfg.dat" that contains, on each line, a PCFG rule followed by its probability. The rules don't have to be in any particular order, and the probabilities can be represented either as rational numbers (e.g. 1/2) or floating-point numbers (e.g. 0.5), it doesn't matter which. For example, here are some rules from an output file, sorted so matching left-hand-sides of rules are grouped together:

  (DT -> (A)) 54/245
  (DT -> (AN)) 6/245
  (DT -> (ANY)) 1/245
  (DT -> (BOTH)) 3/245
  (MD -> (MAY)) 1/35
  (MD -> (MIGHT)) 1/35
  (MD -> (MUST)) 3/35
  (MD -> (SHOULD)) 6/35
  (MD -> (WILL)) 2/5
  (NNP -> (WILLIAM)) 2/251
  (NNP -> (WILLIAMS)) 5/251
  (NNP -> (WPERIOD)) 1/251
  (NNS -> (ACTIONS)) 1/115
  (NNS -> (ADJUSTMENTS)) 1/115
  (NNS -> (ADMINISTRATORS)) 1/115
  (NP -> (DT JJ NN)) 12/727
  (NP -> (DT JJ NNP)) 1/727
  (NP -> (DT NN PP)) 23/727
  (NP -> (DT NN)) 82/727

Notice the output format for CFG rules: a "rule" is a list containing three elements:

The left-hand-side (LHS) nonterminal being expanded
The symbol '-> (included for easier readability)
A list containing the sequence of symbols to which the LHS expands

What to Turn In

I'd like to see an on-paper sketch of your overall design by November 30.
Give me a hardcopy of your code, including header documentation (author, date, etc., plus a top-level sketch of your approach)
E-mail me the output file
Due date: December 19, 2000.

Hints

I strongly recommend you get started on this early.
I also strongly recommend that you start by going through the top-down process we've talked about, sketching out the high-level structure of what needs to be done and iteratively refining each piece. (You don't have to use the algorithm I presented above if you don't want to.)
You need to understand how the trees are represented in LISP. Look at the examples above for reference.
It might be useful for you to define a tree or two to test on in developing your code, and/or to create an input file with just one or two trees in it when you get to the file I/O. If you try to work with the entire treebank file to start with you are headed for disaster. Start with tiny little trees and work your way up.
I strongly recommend you get started on this early. (This is repeated on purpose.)

Project Acquiring a Probabilistic Context-Free Grammar From a Treebank