X -> Y1 Y2 ... Ynwhere X is a non-terminal symbol, and each symbol on the right-hand side of the rule is either a non-terminal symbol or a terminal symbol. We've seen and used simple grammars for a fragment of English many times in this course. Here's an example:
S -> NP VP Adj -> tall NP -> Det N Adj -> short NP -> Det Adj N N -> man NP -> John N -> woman NP -> Mary N -> building VP -> Vi N -> song VP -> Vt NP Vi -> ate Det -> the Vi -> laughed Det -> a Vi -> sang Det -> some Vt -> saw Vt -> sang
A probabilistic context-free grammar (PCFG) is a context-free grammar where, for every non-terminal symbol, the possible right-hand sides all have probabilities associated with them. For any given non-terminal symbol (e.g. NP), all the probabilities of the different right-hand sides must sum to 1. For example:
S -> NP VP [1.0] NP -> Det N [0.6] NP -> Det Adj N [0.2] NP -> John [0.1] NP -> Mary [0.1] VP -> Vi [0.3] VP -> Vt NP [0.7] ...etc.
S S / \ / \ / \ / \ NP VP NP VP | | \ | | John Vt NP Mary Vi | \ | saw Mary laughedthe first tree can be seen as containing the context-free rules
S -> NP VP NP -> John VP -> Vt NP Vt -> saw NP -> Maryand the second tree as containing the rules
S -> NP VP NP -> Mary VP -> Vi Vi -> laughedThus the treebank contains the following rules and frequencies:
S -> NP VP (2) NP -> John (1) NP -> Mary (2) VP -> Vt NP (1) VP -> Vi (1) Vt -> saw (1) Vi -> laughed (1)These can be converted to probabilities by normalizing so that the right-hand sides for each non-terminal have probabilities that sum to 1:
S -> NP VP Prob = 2/2 = 1.0 NP -> John Prob = 1/3 = 0.3333 NP -> Mary Prob = 2/3 = 0.6666 VP -> Vt NP Prob = 1/2 = 0.5 VP -> Vi Prob = 1/2 = 0.5 Vt -> saw Prob = 1/1 = 1.0 Vi -> laughed Prob = 1/1 = 1.0To take one example in detail, the NP rule probabilities can be calculated by keeping track of (or computing) the total number of NP rules (here equal to 3), and converting the frequency of each NP rule to a probability by dividing by that total. In this example, we see that for this corpus, whenever there's an NP, there is a 1/3 chance that it will be expanded to the symbol John and a 2/3 chance it will be expanded as the symbol Mary.
Let's express this more as an algorithm (what follows is only one of many different ways of doing it). Given a treebank:
For each tree in the treebank Get the context-free rules from the tree For each such context-free rule R Update the frequency of R Update the frequency of LHS(R) For every context-free rule R encountered in the treebank Let freq(R) be the frequency of R Let freq(LHS(R)) be the frequency of the LHS of R Let prob = freq(R)/freq(LHS(R)) Output: R probThus the end result of the algorithm is a probabilistic CFG, represented as a table of context-free rules and their probabilities. (LHS = left-hand-side, i.e. the nonterminal being expanded.)
( (S (NP (DT The) (NNP Fulton) (NNP County) (NNP Grand) (NNP Jury) ) (VP (VBD said) (NP (NNP Friday) ) (SBAR (-NONE- 0) (S (NP (DT an) (NN investigation) (PP (IN of) (NP (NP (NNP Atlanta) ) (POS 's) (JJ recent) (JJ primary) (NN election) ))) (VP (VBD produced) (NP (OQUOTE OQUOTE) (DT no) (NN evidence) (CQUOTE CQUOTE) (SBAR (IN that) (S (NP (DT any) (NNS irregularities) ) (VP (VBD took) (NP (NN place) )))))))))) (PERIOD PERIOD) ) ((S (NP (DT the) (JJ new) (NN management) ) (VP (VBZ takes) (NP (NN charge) )) (NP (NNP JanPERIOD) (CD 1) )) (PERIOD PERIOD) )You should produce a file called "pcfg.dat" that contains, on each line, a PCFG rule followed by its probability. The rules don't have to be in any particular order, and the probabilities can be represented either as rational numbers (e.g. 1/2) or floating-point numbers (e.g. 0.5), it doesn't matter which. For example, here are some rules from an output file, sorted so matching left-hand-sides of rules are grouped together:
(DT -> (A)) 54/245 (DT -> (AN)) 6/245 (DT -> (ANY)) 1/245 (DT -> (BOTH)) 3/245 (MD -> (MAY)) 1/35 (MD -> (MIGHT)) 1/35 (MD -> (MUST)) 3/35 (MD -> (SHOULD)) 6/35 (MD -> (WILL)) 2/5 (NNP -> (WILLIAM)) 2/251 (NNP -> (WILLIAMS)) 5/251 (NNP -> (WPERIOD)) 1/251 (NNS -> (ACTIONS)) 1/115 (NNS -> (ADJUSTMENTS)) 1/115 (NNS -> (ADMINISTRATORS)) 1/115 (NP -> (DT JJ NN)) 12/727 (NP -> (DT JJ NNP)) 1/727 (NP -> (DT NN PP)) 23/727 (NP -> (DT NN)) 82/727Notice the output format for CFG rules: a "rule" is a list containing three elements: