CS 723 Project 2: Syntactic Parsing

In contrast to P1, where most of the info was here, this doc is now just a schematic and the main details are in the relevant .py files. We're doing parsing :).

The code for this project consists of several Python files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. You can download all the code and supporting files (including this description) as a tar archive.

Files you'll edit:
extractGrammar.py The place where you'll put your code for part III (parsing English).
grammar.py The place where you'll put your code for part II (time flies).
parser.py The place where you'll put your code for part I (cky).
Files you might want to look at:
tree.py The tree data structure we'll use.
util.py A handful of useful utility functions: these will undoubtedly be helpful to you, so take a look!

Unary Rules in CKY (25%)

Open up parser.py. I've given you an almost-complete implementation of CKY. It initializes the bottom cells of the chart, then applies unary rules there, and applies binary rules in the entire chart. All you have to do is apply unary rules in the recursive case. If you've correctly implemented this, and loaded all the relevant files (util, tree, grammar and parser), you should be able to run:
>>> print str(timeFliesPCFG)
Noun => flies	| 0.4
Noun => arrow	| 0.4
Noun => time	| 0.2
TOP => S	| 1.0
Det => an	| 1.0
VP => Verb PP	| 0.2
VP => Verb NP_PP	| 0.1
VP => Verb NP	| 0.6
VP => Verb	| 0.1
S => VP	| 0.2
S => VP PP	| 0.1
S => NP VP_PP	| 0.2
S => NP VP	| 0.5
VP_PP => VP PP	| 1.0
NP_PP => NP PP	| 1.0
Prep => like	| 1.0
PP => Prep NP	| 1.0
Verb => flies	| 0.5
Verb => time	| 0.3
Verb => like	| 0.2
NP => Noun	| 0.3
NP => Det Noun	| 0.7

>>> print timeFliesSent
['time', 'flies', 'like', 'an', 'arrow']

>>> parse(timeFliesPCFG, timeFliesSent)
(TOP: (S: (NP: (Noun: 'time')) (VP: (Verb: 'flies') (PP: (Prep: 'like') (NP: (Det: 'an') (Noun: 'arrow'))))))
Note that you cannot get this output without correctly handling unary rules, because you'll never be able to get the TOP -> S at the top.

Warming up with Time Flies (25%)

The analysis from before gave us a specific interpretation of the time flies sentence. Suppose we want to adjust the grammar so that we get a different (specific) interpretation. Open up grammar.py and take a look at the definition of timeFliesPCFG and the desired analysis, in desiredTimeFliesParse. Create a new grammar called timeFliesPCFG2 that gives the same parse as in desiredTimeFliesParse. Note: you may NOT change the rules of the grammar, and none of your probabilities may be less than 0.1 (and they must add to one in the appropriate places). You can compare your output to the desired tree with:
>>> myTree = parse(timeFliesPCFG, timeFliesSent)
>>> print myTree
    (NP: (Noun: 'time'))
      (Verb: 'flies')
      (PP: (Prep: 'like') (NP: (Det: 'an') (Noun: 'arrow'))))))

>>> print desiredTimeFliesParse
      (Verb: 'time')
      (NP: (Noun: 'flies'))
      (PP: (Prep: 'like') (NP: (Det: 'an') (Noun: 'arrow'))))))

>>> evaluate(desiredTimeFliesParse, myTree)
This is the result using the grammar I gave you. You should be able to get an evaluation (accuray) of 1.0.

Parsing English (50%)

Your final task is to do a good job of parsing English. Open up extractGrammar.py and take a look at computePCFG. This will read data and compute a PCFG out of it. For instance:
>>> pcfg = computePCFG('wsj.dev')

>>> len(pcfg)

>>> print str(pcfg)
PP => VBG PP	| 1
PP => TO NP	| 23
PP => IN ADJP	| 1
PP => IN_NP NP	| 2
PP => TO S	| 1
PP => VBN PP	| 3
PP => IN SBAR	| 1
PP => IN NP	| 139
This shows that there are 650 unique rules in this PCFG, and that the most frequent PP rules were "TO NP" (count of 23) and "IN NP" (count of 139). We can look at a larger data set:
>>> pcfg = computePCFG('wsj.train')

>>> len(pcfg)
By default, this will learn a completely unlexicalize PCFG, which means that it can only parse POS sequences, as in the following two time-flies-esque examples:
>>> parse(pcfg, ['NN', 'VBZ', 'IN', 'DT', 'NN'])
(TOP: (S: (NP: 'NN') (VP: 'VBZ' (PP: 'IN' (NP: 'DT' 'NN')))))

>>> parse(pcfg, ['VBZ', 'NN', 'IN', 'DT', 'NN'])
(TOP: (S: (VP: (_VBZ_NP: 'VBZ' (NP: 'NN')) (PP: 'IN' (NP: 'DT' 'NN')))))
You'll notice that the tree that came out the second time is binarized. We can de-binarize it:
>>> print nonBinaryTree
      (DT: 'the')
      (RB: 'really')
      (JJ: 'happy')
      (NN: 'computer')
      (NN: 'science')
      (NN: 'student'))
    (VP: (VBD: 'loves') (NP: (NNP: 'CL1')))
    (.: '.')))

>>> print binarizeTree(nonBinaryTree)
              (_DT_RB: (DT: 'the') (RB: 'really'))
              (JJ: 'happy'))
            (NN: 'computer'))
          (NN: 'science'))
        (NN: 'student'))
      (VP: (VBD: 'loves') (NP: (NNP: 'CL1'))))
    (.: '.')))

>>> print debinarizeTree(binarizeTree(nonBinaryTree))
      (DT: 'the')
      (RB: 'really')
      (JJ: 'happy')
      (NN: 'computer')
      (NN: 'science')
      (NN: 'student'))
    (VP: (VBD: 'loves') (NP: (NNP: 'CL1')))
    (.: '.')))
You'll note that this implementation of binarization does NO parent annotations (i.e., vertical order is 1 in the Klein+Manning notation) and complete markovization (i.e., horizontal order is infinity). One very important thing is that the debinarization assumes that any rule that's been binarized starts with "_". So please maintain this invariant or debinarization won't work!

You can evaluate this PCFG by loading parser.py and running:

>>> evaluateParser(pcfg, 'wsj.dev')
You might notice this takes forever. So to make it faster, you can specify a pruning threshold. A pruning threshold of 0.1 means that once a cell is filled up, take the probability of the best item in it. Say that's probabiliy 0.23. Multiply it by 0.1 to get 0.023. Now, anything else in that cell whose probability is less than 0.023 will be deleted. This will make parsing faster at the expense of accuracy. You can pass this threshold to evaluateParser:
>>> evaluateParser(pcfg, 'wsj.dev', pruningPercent=0.00001)

>>> evaluateParser(pcfg, 'wsj.dev', pruningPercent=0.001)

>>> evaluateParser(pcfg, 'wsj.dev', pruningPercent=0.1)
Your first task is to implement horizontal Markovization in the extractGrammar file. You should be able to test this with:
>>> pcfg = computePCFG('wsj.train', horizSize=2)
>>> evaluateParser(pcfg, 'wsj.dev', pruningPercent=0.00001, horizSize=2)
This implementation is worth 20%.

Your second task is to implement vertical annotation (i.e., ancestor annotations). Likewise, you can test this with:

>>> pcfg = computePCFG('wsj.train', verticSize=2)
>>> evaluateParser(pcfg, 'wsj.dev', pruningPercent=0.00001, verticSize=2)
This implementation is also worth 20%.

The final challenge (worth the last 10%) is to make the best unlexicalized grammar that you can. You need to beat the best results of varying horizSize in the range {1, 2, 3, 4} and verticSize in { 1, 2, 3, 4 }. Any percentage point above the best of those that you get in accuracy, you'll get TWO percentage points on this task (up to 5*2=10, of course :P). Moreover, the top three teams will get extra credit of 10%, 7% and 5%, respectively, on this project. You can run the test as:

>>> runParserOnTest(pcfg, 'wsj.test', 'wsj.test.out', pruningPercent=0.001)
You can also pass verticSize, horizSize to runParserOnTest. You can also pass "runFancyCode=True" to both evaluateParser and runParserOnTest, which will pass this flag down to the binarizeTree function, and you can do whatever you want in there to try to get better performance. The output of this will go to wsj.test.out, which you can submit for evaluation (running once per night until the last few days, at which point it will run once per hour).