Prototype-Driven Grammar Induction Aria Haghighi Computer Science Division University of California Berkeley aria42@cs.berkeley.edu Dan Klein Computer Science Division University of California Berkeley klein@cs.berkeley.edu Abstract We investigate prototype-driven learning for primarily unsupervised grammar induction. Prior knowledge is specified declaratively, by providing a few canonical examples of each target phrase type. This sparse prototype information is then propagated across a corpus using distributional similarity features, which augment an otherwise standard PCFG model. We show that distributional features are effective at distinguishing bracket labels, but not determining bracket locations. To improve the quality of the induced trees, we combine our PCFG induction with the CCM model of Klein and Manning (2002), which has complementary stengths: it identifies brackets but does not label them. Using only a handful of prototypes, we show substantial improvements over naive PCFG induction for English and Chinese grammar induction. 1 Introduction There has been a great deal of work on unsupervised grammar induction, with motivations ranging from scientific interest in language acquisition to engineering interest in parser construction (Carroll and Charniak, 1992; Clark, 2001). Recent work has successfully induced unlabeled grammatical structure, but has not successfully learned labeled tree structure (Klein and Manning, 2002; Klein and Manning, 2004; Smith and Eisner, 2004) . In this paper, our goal is to build a system capable of producing labeled parses in a target grammar with as little total effort as possible. We investigate a prototype-driven approach to grammar induction, in which one supplies canonical examples of each target concept. For example, we might specify that we are interested in trees which use the symbol NP and then list several examples of prototypical NPs (determiner noun, pronouns, etc., see figure 1 for a sample prototype list). This prototype information is similar to specifying an annotation scheme, which even human annotators 881 must be provided before they can begin the construction of a treebank. In principle, prototypedriven learning is just a kind of semi-supervised learning. However, in practice, the information we provide is on the order of dozens of total seed instances, instead of a handful of fully parsed trees, and is of a different nature. The prototype-driven approach has three strengths. First, since we provide a set of target symbols, we can evaluate induced trees using standard labeled parsing metrics, rather than the far more forgiving unlabeled metrics described in, for example, Klein and Manning (2004). Second, knowledge is declaratively specified in an interpretable way (see figure 1). If a user of the system is unhappy with its systematic behavior, they can alter it by altering the prototype information (see section 7.1 for examples). Third, and related to the first two, one does not confuse the ability of the system to learn a consistent grammar with its ability to learn the grammar a user has in mind. In this paper, we present a series of experiments in the induction of labeled context-free trees using a combination of unlabeled data and sparse prototypes. We first affirm the well-known result that simple, unconstrained PCFG induction produces grammars of poor quality as measured against treebank structures. We then augment a PCFG with prototype features, and show that these features, when propagated to non-prototype sequences using distributional similarity, are effective at learning bracket labels on fixed unlabeled trees, but are still not enough to learn good tree structures without bracketing information. Finally, we intersect the feature-augmented PCFG with the CCM model of Klein and Manning (2002), a highquality bracketing model. The intersected model is able to learn trees with higher unlabeled F1 than those in Klein and Manning (2004). More impor- Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 881­888, Sydney, July 2006. c 2006 Association for Computational Linguistics tantly, its trees are labeled and can be evaluated according to labeled metrics. Against the English Penn Treebank, our final trees achieve a labeled F1 of 65.1 on short sentences, a 51.7% error reduction over naive PCFG induction. productions X (i, j ) over constituent spans (i, j ), where is a pair of non-terminal and/or terminal symbols in the grammar. The generative probability of a tree T for S is: X PC F G (T , S ) = P (|X ) 2 Experimental Setup The majority of our experiments induced tree structures from the WSJ section of the English Penn treebank (Marcus et al., 1994), though see section 7.4 for an experiment on Chinese. To facilitate comparison with previous work, we extracted WSJ-10, the 7,422 sentences which contain 10 or fewer words after the removal of punctuation and null elements according to the scheme detailed in Klein (2005). We learned models on all or part of this data and compared their predictions to the manually annotated treebank trees for the sentences on which the model was trained. As in previous work, we begin with the part-of-speech (POS) tag sequences for each sentence rather than lexical sequences (Carroll and Charniak, 1992; Klein and Manning, 2002). Following Klein and Manning (2004), we report unlabeled bracket precision, recall, and F1 . Note that according to their metric, brackets of size 1 are omitted from the evaluation. Unlike that work, all of our induction methods produce trees labeled with symbols which are identified with treebank categories. Therefore, we also report labeled precision, recall, and F1 , still ignoring brackets of size 1.1 (i,j )T 3 Experiments in PCFG induction As an initial experiment, we used the insideoutside algorithm to induce a PCFG in the straightforward way (Lari and Young, 1990; Man¨ ning and Schutze, 1999). For all the experiments in this paper, we considered binary PCFGs over the nonterminals and terminals occuring in WSJ10. The PCFG rules were of the following forms: · X Y Z , for nonterminal types X, Y , and Z , with Y = X or Z = X · X t Y , X Y t, for each terminal t F · X t t , for terminals t and t or a given sentence S , our CFG generates labeled trees T over S .2 Each tree consists of binary 1 In cases where multiple gold labels exist in the gold trees, precision and recall were calculated as in Collins (1999). 2 Restricting our CFG to a binary branching grammar results in an upper bound of 88.1% on unlabeled F1 . In the inside-outside algorithm, we iteratively compute posterior expectations over production occurences at each training span, then use those expectations to re-estimate production probabilities. This process is guaranteed to converge to a local extremum of the data likelihood, but initial production probability estimates greatly influence the final grammar (Carroll and Charniak, 1992). In particular, uniform initial estimates are an (unstable) fixed point. The classic approach is to add a small amount of random noise to the initial probabilities in order to break the symmetry between grammar symbols. We randomly initialized 5 grammars using treebank non-terminals and trained each to convergence on the first 2000 sentences of WSJ-10. Viterbi parses were extracted for each of these 2000 sentences according to each grammar. Of course, the parses' symbols have nothing to anchor them to our intended treebank symbols. That is, an NP in one of these grammars may correspond to the target symbol VP, or may not correspond well to any target symbol. To evaluate these learned grammars, we must map the models' phrase types to target phrase types. For each grammar, we followed the common approach of greedily mapping model symbols to target symbols in the way which maximizes the labeled F1 . Note that this can, and does, result in mapping multiple model symbols to the most frequent target symbols. This experiment, labeled P C F G × N O N E in figure 4, resulted in an average labeled F1 of 26.3 and an unlabeled F1 of 45.7. The unlabeled F1 is better than randomly choosing a tree (34.7), but not better than always choosing a right branching structure (61.7). Klein and Manning (2002) suggest that the task of labeling constituents is significantly easier than identifying them. Perhaps it is too much to ask a PCFG induction algorithm to perform both of these tasks simultaneously. Along the lines of Pereira and Schabes (1992), we reran the insideoutside algorithm, but this time placed zero mass on all trees which did not respect the bracketing of the gold trees. This constraint does not fully 882 Phrase NP Prototypes DT NN JJ NNS NNP NNP Phrase VP Prototypes VBN IN NN VBD DT NN MD VB CD S PRP VBD DT NN DT NN VBD IN DT NN DT VBZ DT JJ NN QP CD CD RB CD DT CD CD PP IN NN TO C D C D IN PRP ADJP RB JJ JJ JJ CC JJ ADVP RB RB RB CD RB CC RB VP-INF VB NN NP-INF NN POS Figure 1: English phrase type prototype list manually specified (The entire supervision for our system). The second part of the table is additional prototypes discussed in section 7.1. eliminate the structural uncertainty since we are inducing binary trees and the gold trees are flatter than binary in many cases. This approach of course achieved the upper bound on unlabeled F1 , because of the gold bracket constraints. However, it only resulted in an average labeled F1 of 52.6 (experiment P C F G × G O L D in figure 4). While this labeled score is an improvement over the P C F G × N O N E experiment, it is still relatively disappointing. 3.1 Encoding Prior Knowledge with Prototypes Clearly, we need to do something more than adding structural bias (e.g. bracketing information) if we are to learn a PCFG in which the symbols have the meaning and behaviour we intend. How might we encode information about our prior knowledge or intentions? Providing labeled trees is clearly an option. This approach tells the learner how symbols should recursively relate to each other. Another option is to provide fully linearized yields as prototypes. We take this approach here, manually creating a list of POS sequences typical of the 7 most frequent categories in the Penn Treebank (see figure 1).3 Our grammar is limited to these 7 phrase types plus an additional type which has no prototypes and is unconstrained.4 This list grounds each sym3 A possible objection to this approach is the introduction of improper reasearcher bias via specifying prototypes. See section 7.3 for an experiment utilizing an automatically generated prototype list with comparable results. 4 In our experiments we found that adding prototypes for more categories did not improve performance and took more bol in terms of an observable portion of the data, rather than attempting to relate unknown symbols to other unknown symbols. Broadly, we would like to learn a grammar which explains the observed data (EM's objective) but also meets our prior expectations or requirements of the target grammar. How might we use such a list to constrain the learning of a PCFG with the inside-outside algorithm? We might require that all occurences of a prototype sequence, say D T N N, be constituents of the corresponding type (NP). However, human-elicited prototypes are not likely to have the property that, when they occur, they are (nearly) always constituents. For example, D T N N is a perfectly reasonable example of a noun phrase, but is not a constituent when it is part of a longer D T N N N N constituent. Therefore, when summing over trees with the inside-outside algorithm, we could require a weaker property: whenever a prototype sequence is a constituent it must be given the label specified in the prototype file.5 This constraint is enough to break the symmetry between the model labels, and therefore requires neither random initialization for training, nor post-hoc mapping of labels for evaluation. Adding prototypes in this way and keeping the gold bracket constraint gave 59.9 labeled F1 . The labeled F1 measure is again an improvement over naive PCFG induction, but is perhaps less than we might expect given that the model has been given bracketing information and has prototypes as a form of supervision to direct it. In response to a prototype, however, we may wish to conclude something stronger than a constraint on that particular POS sequence. We might hope that sequences which are similar to a prototype in some sense are generally given the same label as that prototype. For example, D T N N is a noun phrase prototype, the sequence D T J J N N is another good candidate for being a noun phrase. This kind of propagation of constraints requires that we have a good way of defining and detecting similarity between POS sequences. 3.2 Phrasal Distributional Similarity A central linguistic argument for constituent types is substitutability: phrases of the same type appear time. We note that we still evaluate against all phrase types regardless of whether or not they are modeled by our grammar. 5 Even this property is likely too strong: prototypes may have multiple possible labels, for example DT NN may also be a QP in the English treebank. 883 Yield DT JJ NN IN DT VBG NN DT NN MD VB DT NNS CC NN MD NNS Prototype DT NN IN NN PRP VBD DT NN IN NN PRP VBD DT NN Skew KL 0.10 0.24 0.54 0.43 1.43 Phrase Type NP PP S PP NONE Skew KL 0.39 0.45 0.58 0.71 - p(|X ). Letting proto(X ) denote the (few) prototype yields for phrase type X , we define (X ): ~ 1 (X ) = ~ () |proto(X )| pr oto(X ) Figure 2: Yields along with most similar prototypes and phrase types, guessed according to (3). in similar contexts and are mutually substitutable (Harris, 1954; Radford, 1988). For instance, D T J J N N and D T N N occur in similar contexts, and are indeed both common N Ps. This idea has been repeatedly and successfully operationalized using various kinds of distributional clustering, where we define a similarity measure between two items on the basis of their immediate left and right con¨ texts (Schutze, 1995; Clark, 2000; Klein and Manning, 2002). As in Clark (2001), we characterize the distribution of a sequence by the distribution of POS tags occurring to the left and right of that sequence in a corpus. Each occurence of a POS sequence falls in a context x y , where x and y are the adjacent tags. The distribution over contexts x - y for a given is called its signature, and is denoted by (). Note that () is composed of context counts from all occurences, constitiuent and distituent, of . Let c () denote the context distribution for where the context counts are taken only from constituent occurences of . For each phrase type in our grammar, X , define c (X ) to be the context distribution obtained from the counts of all constituent occurences of type X : c (X ) = Ep(|X ) c () (1) Note (X ) is an approximation to (1) in sev~ eral ways. We have replaced an expectation over p(|X ) with a uniform weighting of proto(X ), and we have replaced c () with () for each term in that expectation. Because of this, we will rely only on high confidence guesses, and allow yields to be given a N O N E type if their divergence from each (X ) exceeds a fixed threshold t. This ~ gives the following alternative to (2): ty pe() = N (3) if minX DS K L ( (), (X )) < t ~ arg minX DS K L ( (), (X )), otherwise ~ O N E, where p(|X ) is the distribution of yield types for phrase type X . We compare context distributions using the skewed KL divergence: DS K L (p, q ) = DK L (p p + (1 - )q ) where controls how much of the source distributions is mixed in with the target distribution. A reasonable baseline rule for classifying the phrase type of a POS yield is to assign it to the phrase from which it has minimal divergence: ty pe() = arg min DS K L (c (), c (X )) (2) X We built a distributional model implementing the rule in (3) by constructing () from context counts in the WSJ portion of the Penn Treebank as well as the BLIPP corpus. Each (X ) was ap~ proximated by a uniform mixture of () for each of X 's prototypes listed in figure 1. This method of classifying constituents is very precise if the threshold is chosen conservatively enough. For instance, using a threshold of t = 0.75 and = 0.1, this rule correctly classifies the majority label of a constituent-type with 83% precision, and has a recall of 23% over constituent types. Figure 2 illustrates some sample yields, the prototype sequence to which it is least divergent, and the output of rule (3). We incorporated this distributional information into our PCFG induction scheme by adding a prototype feature over each span (i, j ) indicating the output of (3) for the yield in that span. Associated with each sentence S is a feature map F specifying, for each (i, j ), a prototype feature pij . These features are generated using an augmented CFG model, C F G+ , given by:6 X PC F G+ (T , F ) = P (pij |X )P (|X ) (i,j ) T X = C F G+ (X , pij ) (i,j ) T 6 Technically, all features in F must be generated for each assignment to T , which means that there should be terms in this equation for the prototype features on distituent spans. However, we fixed the prototype distribution to be uniform for distituent spans so that the equation is correct up to a constant depending on F . However, this rule is not always accurate, and, moreover, we do not have access to c () or c (X ). We chose to approximate c (X ) using the prototype yields for X as samples from 884 P ( S | RO OT ) ¯ ROOT S P (N P V P|S) P (P = N O N E|S) P (N N N N S|N P) P (P = N P|N P) $$$ $ ff $$ NP VP P (V B D P P|V P) P (P = V P|V P) NN ¨¨rr ¨ r NNN 3 33 3 VBD fell IN in PP P (I N N N|P P) P (P = P P|P P) Factory payrolls 3 33 NN November Figure 3: Illustration of PCFG augmented with prototype similarity features. where C F G+ (X , pij ) is the local factor for placing X on a span with prototype feature pij . An example is given in figure 3. For our experiments, we fixed P (pij |X ) to be: 0 .60, if pij = X P (pij |X ) = uniform, otherwise Modifying the model in this way, and keeping the gold bracketing information, gave 71.1 labeled F1 (see experiment P ROTO × G O L D in figure 4), a 40.3% error reduction over naive PCFG induction in the presence of gold bracketing information. We note that the our labeled F1 is upper-bounded by 86.0 due to unary chains and more-than-binary configurations in the treebank that cannot be obtained from our binary grammar. We conclude that in the presence of gold bracket information, we can achieve high labeled accuracy by using a CFG augmented with distributional prototype features. the constituent-context model (C C M) of Klein and Manning (2002), a generative distributional model. For a given sentence S , the C C M generates a bracket matrix, B , which for each span (i, j ), indicates whether or not it is a constituent (Bij = c) or a distituent (Bij = d). In addition, it generates a feature map F , which for each span (i, j ) in S specifies a pair of features, Fij = (yij , cij ), where yij is the POS yield of the span, and cij is the context of the span, i.e identity of the conjoined left and right POS tags: PC C M (B , F ) = P (B ) ( i,j ) P (yij |Bij )P (cij |Bij ) The distribution P (B ) only places mass on bracketings which correspond to binary trees. We can efficiently compute PC C M (B , F ) (up to a constant) depending on F using local factors C C M (yij , cij ) which decomposes over constituent spans:7 PC C M (B , F ) = ( i,j ):Bij =c P (yij |c)P (cij |c) P (yij |d)P (cij |d) C C M (yij , cij ) ( i,j ):Bij =c The C C M by itself yields an unlabeled F1 of 71.9 on WSJ-10, which is reasonably high, but does not produce labeled trees. 5 Intersecting CCM and PCFG The C C M and P C F G models provide complementary views of syntactic structure. The C C M explicitly learns the non-recursive contextual and yield properties of constituents and distituents. The P C F G model, on the other hand, does not explicitly model properties of distituents but instead focuses on modeling the hierarchical and recursive properties of natural language syntax. One would hope that modeling both of these aspects simultaneously would improve the overall quality of our induced grammar. We therefore combine the C C M with our featureaugmented P C F G, denoted by P ROTO in experiment names. When we run EM on either of the models alone, at each iteration and for each training example, we calculate posteriors over that 7 4 Constituent Context Model So far, we have shown that, given perfect perfect bracketing information, distributional prototype features allow us to learn tree structures with fairly accurate labels. However, such bracketing information is not available in the unsupervised case. Perhaps we don't actually need bracketing constraints in the presence of prototypes and distributional similarity features. However this experiment, labeled P ROTO × N O N E in figure 4, gave only 53.1 labeled F1 (61.1 unlabeled), suggesting that some amount of bracketing constraint is necessary to achieve high performance. Fortunately, there are unsupervised systems which can induce unlabeled bracketings with reasonably high accuracy. One such model is 885 Klein (2005) gives a full presentation. model's latent variables. For C C M, the latent variable is a bracketing matrix B (equivalent to an unlabeled binary tree), while for the C F G+ the latent variable is a labeled tree T . While these latent variables aren't exactly the same, there is a close relationship between them. A bracketing matrix constrains possible labeled trees, and a given labeled tree determines a bracketing matrix. One way to combine these models is to encourage both models to prefer latent variables which are compatible with each other. Similar to the approach of Klein and Manning (2004) on a different model pair, we intersect C C M and C F G+ by multiplying their scores for any labeled tree. For each possible labeled tree over a sentence S , our generative model for a labeled tree T is given as follows: P (T , F, F ) Labeled Setting PCFG × NONE P ROT O × N O N E PCFG × GOLD P ROT O × G O L D CCM PCFG × CCM P ROT O × C C M BEST UBOUND Prec. Rec. F1 Prec. Unlabeled Rec. F1 No Brackets 23.9 29.1 26.3 51.8 62.9 56.8 Gold Brackets 47.0 57.2 51.6 64.8 78.7 71.1 CCM Brackets 32.3 38.9 35.3 56.9 68.5 62.2 59.4 72.1 65.1 78.8 94.7 86.0 40.7 59.6 78.8 78.8 64.2 64.1 68.4 69.7 78.8 52.1 76.2 100.0 100.0 81.6 81.4 86.9 89.1 100.0 45.7 66.9 88.1 88.1 71.9 71.8 76.5 78.2 88.1 Figure 4: English grammar induction results. The upper bound on labeled recall is due to unary chains. 6 CCM as a Bracketer We tested the product model described in section 5 on WSJ-10 under the same conditions as in section 3. Our initial experiment utilizes no protoype information, random initialization, and greedy remapping of its labels. This experiment, P C F G × C C M in figure 4, gave 35.3 labeled F1 , compared to the 51.6 labeled F1 with gold bracketing information (P C F G × G O L D in figure 4). Next we added the manually specified prototypes in figure 1, and constrained the model to give these yields their labels if chosen as constituents. This experiment gave 48.9 labeled F1 (73.3 unlabeled). The error reduction is 21.0% labeled (5.3% unlabeled) over P C F G × C C M. We then experimented with adding distributional prototype features as discussed in section 3.2 using a threshold of 0.75 and = 0.1. This experiment, P ROTO × C C M in figure 4, gave 62.2 labeled F1 (76.5 unlabeled). The error reduction is 26.0% labeled (12.0% unlabeled) over the experiment using prototypes without the similarity features. The overall error reduction from P C F G × C C M is 41.6% (16.7%) in labeled (unlabeled) F1 . = ) (4) PC F G+ (T , F )PC C M (B (T ), F where B (T ) corresponds to the bracketing matrix determined by T . The EM algorithm for the product model will maximize: T P (S,F, F ) = PC C M (B , F )PC F G+ (T , F ) T (S ) = B PC C M (B , F ) T PC F G+ (T , F ) T (B ,S ) where T (S ) is the set of labeled trees consistent with the sentence S and T (B , S ) is the set of labeled trees consistent with the bracketing matrix B and the sentence S . Notice that this quantity increases as the C C M and C F G+ models place probability mass on compatible latent structures, giving an intuitive justification for the success of this approach. We can compute posterior expectations over (B , T ) in the combined model (4) using a variant of the inside-outside algorithm. The local factor for a binary rule r = X Y Z, over span (i, j ), with C C M features Fij = (yij , cij ) and prototype feature pij , is given by the product of local factors for the C C M and C F G+ models: (r, (i, j )) = C C M (yij , cij )C F G+ (r, pij ) From these local factors, the inside-outside algorithm produces expected counts for each binary rule, r, over each span (i, j ) and split point k , denoted by P (r, (i, j ), k |S, F, F ). These posteriors are sufficient to re-estimate all of our model parameters. 886 7 Error Analysis The most common type of error by our P ROTO × C C M system was due to the binary grammar restriction. For instance common N Ps, such as D T J J N N, analyzed as [N P D T [N P J J N N] ], which proposes additional N constituents compared to the flatter treebank analysis. This discrepancy greatly, and perhaps unfairly, damages NP precision (see figure 6). However, this is error is unavoidable S $$ $ $$ NP NNP France VP $$ $$ $$ MD VP @@@hhhhhh @@@@ h @@@@ @@@@ S @hhh hhhh hh VP @hhh @@@ hhhh @@@@ h VP PP S @hhh hhhh @@@@ @ h @ @@ h NNP France VP MD can NNS & & NNP France NP JJ ¨¨rr ¨ r @@ @ @ @@ VP @hhh @@@ hhhh hh h VP MD $$ $$ PP can VB boast NP DT the NN lion NP 3 3 33 NP $$ $$$ $ $$ $$ NP VB NP DT the 3 33 IN PP NN of VP NP IN of NN NP POS 's c) share JJ NP ¨¨rr ¨ r PP $ NNS can VB boast DT the NP 4 4 4 NNS NN IN of JJ NP ¨¨rr ¨ r Dl Dl 3 33 boast NN lion POS 's & & high-priced bottles high-priced bottles POS 's share share high-priced bottles NN 5 5 lion a) b) Figure 5: Examples of corrections from adding V P - I N F and N P - P O S prototype categories. The tree in (a) is the Treebank parse, (b) is the parse with P ROTO × C C M model, and c) is the parse with the B E S T model (added prototype categories), which fixes the possesive NP and infinitival VP problems, but not the PP attachment. given our grammar restriction. Figure 5(b) demonstrates three other errors. Possessive N Ps are analyzed as [N P N N [P P P O S N N ] ], with the P O S element treated as a preposition and the possessed N P as its complement. While labeling the P O S N N as a P P is clearly incorrect, placing a constituent over these elements is not unreasonable and in fact has been proposed by some linguists (Abney, 1987). Another type of error also reported by Klein and Manning (2002) is M D V B groupings in infinitival V Ps also sometimes argued by linguists (Halliday, 2004). More seriously, prepositional phrases are almost always attached "high" to the verb for longer N Ps. 7.1 Augmenting Prototypes Label S NP VP PP QP ADJP ADVP Prec. 79.3 49.0 80.4 45.6 36.2 29.4 25.0 Rec. 80.0 74.4 73.3 78.6 78.8 33.3 12.2 F1 79.7 59.1 76.7 57.8 49.6 31.2 16.4 Figure 6: Precision, recall, and F1 for individual phrase types in the B E S T model Rule S NP VP S PRP VP S NNP VP S NNS VP NP DT NN NP NP PP NP NNP NNP NP JJ NN PP IN NP PP CC NP P P TO V P P P TO Q P ADJP RB VBN ADJP RB JJ ADJP RBR JJ One of the advantages of the prototype driven approach, over a fully unsupervised approach, is the ability to refine or add to the annotation specification if we are not happy with the output of our system. We demonstrate this flexibility by augmenting the prototypes in figure 1 with two new categories N P - P O S and V P - I N F, meant to model possessive noun phrases and infinitival verb phrases, which tend to have slightly different distributional properties from normal N Ps and V Ps. These new sub-categories are used during training and then stripped in post-processing. This prototype list gave 65.1 labeled F1 (78.2 unlabeled). This experiment is labeled B E S T in figure 4. Looking at the C F G-learned rules in figure 7, we see that the basic structure of the treebank grammar is captured. 7.2 Parsing with only the PCFG Probability 0.51 0.13 0.06 0.05 0.12 0.09 0.09 0.07 0.37 0.06 0.05 0.04 0.37 0.31 0.09 Rule VP VBZ NP VP VBD NP VP VBP NP VP VB NP RO OT S RO OT N P Probability 0.20 0.15 0.09 0.08 0.95 0.05 QP CD CD QP CD NN QP QP PP QP QP NNS A DV P R B R B A DV P A D J P P R P A DV P R B C D 0.35 0.30 0.10 0.05 0.25 0.15 0.10 Figure 7: Top PCFG Rules learned by B E S T model ment gave 65.1 labeled F1 (76.8 unlabeled). This demonstrates that while our P C F G performance degrades without the C C M, it can be used on its own with reasonable accuracy. 7.3 Automatically Generated Prototypes There are two types of bias which enter into the creation of prototypes lists. One of them is the bias to choose examples which reflect the annotation semantics we wish our model to have. The second is the iterative change of prototypes in order to maximize F1 . Whereas the first is appro887 In order to judge how well the P C F G component of our model did in isolation, we experimented with training our B E S T model with the C C M component, but dropping it at test time. This experi- priate, indeed the point, the latter is not. In order to guard against the second type of bias, we experimented with automatically extracted generated prototype lists which would not be possible without labeled data. For each phrase type category, we extracted the three most common yield associated with that category that differed in either first or last POS tag. Repeating our P ROTO × C C M experiment with this list yielded 60.9 labeled F1 (76.5 unlabeled), comparable to the performance of our manual prototype list. 7.4 Chinese Grammar Induction edge, which has learned CFGs in an unsupervised or semi-supervised setting and can parse natural language language text with any reasonable accuracy. Acknowledgments We would like to thank the anonymous reviewers for their comments. This work is supported by a Microsoft / CITRIS grant and by an equipment donation from Intel. References Stephen P. Abney. 1987. The English Noun Phrase in its Sentential Aspect. Ph.D. thesis, MIT. Glenn Carroll and Eugene Charniak. 1992. Two experiments on learning probabilistic dependency grammars from corpora. Technical Report CS-92-16. Alexander Clark. 2000. Inducing syntactic categories by context distribution clustering. In CoNLL, pages 91­94, Lisbon, Portugal. Alexander Clark. 2001. The unsupervised induction of stochastic context-free grammars using distributional clustering. In CoNLL. Michael Collins. 1999. The Unsupervised learning of Natural Language Structure. Ph.D. thesis, University of Rochester. M.A.K Halliday. 2004. An introduction to functional grammar. Edward Arnold, 2nd edition. Zellig Harris. 1954. Distributional Structure. University of Chicago Press, Chicago. Nianwen Xue Ircs. 2002. Building a large-scale annotated chinese corpus. Dan Klein and Christopher Manning. 2002. A generative constituent-context model for improved grammar induction. In ACL. Dan Klein and Christopher Manning. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In ACL. Dan Klein. 2005. The unsupervised learning of Natural Language Structure. Ph.D. thesis, Stanford University. Karim Lari and Steve Young. 1990. The estimation of stochastic context-free grammars using the insideoutside algorithm. Computer Speech and Language, 2(4):35­56. ¨ Christopher D. Manning and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1994. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313­330. Fernando C. N. Pereira and Yves Schabes. 1992. Insideoutside reestimation from partially bracketed corpora. In Meeting of the Association for Computational Linguistics, pages 128­135. Andrew Radford. 1988. Transformational Grammar. Cambridge University Press, Cambridge. ¨ Hinrich Schutze. 1995. Distributional part-of-speech tagging. In EACL. Noah A. Smith and Jason Eisner. 2004. Guiding unsupervised grammar induction using contrastive estimation. In Working notes of the IJCAI workshop on Grammatical Inference Applications. In order to demonstrate that our system is somewhat language independent, we tested our model on CTB-10, the 2,437 sentences of the Chinese Treebank (Ircs, 2002) of length at most 10 after punctuation is stripped. Since the authors have no expertise in Chinese, we automatically extracted prototypes in the same way described in section 7.3. Since we did not have access to a large auxiliary POS tagged Chinese corpus, our distributional model was built only from the treebank text, and the distributional similarities are presumably degraded relative to the English. Our P C F G × C C M experiment gave 18.0 labeled F1 (43.4 unlabeled). The P ROTO × C C M model gave 39.0 labeled F1 (53.2 unlabeled). Presumably with access to more POS tagged data, and the expertise of a Chinese speaker, our system would see increased performance. It is worth noting that our unlabeled F1 of 53.2 is the best reported from a primarily unsupervised system, with the next highest figure being 46.7 reported by Klein and Manning (2004). 8 Conclusion We have shown that distributional prototype features can allow one to specify a target labeling scheme in a compact and declarative way. These features give substantial error reduction in labeled F1 measure for English and Chinese grammar induction. They also achieve the best reported unlabeled F1 measure.8 Another positive property of this approach is that it tries to reconcile the success of distributional clustering approaches to grammar induction (Clark, 2001; Klein and Manning, 2002), with the CFG tree models in the supervised literature (Collins, 1999). Most importantly, this is the first work, to the authors' knowl8 The next highest results being 77.1 and 46.7 for English and Chinese respectively from Klein and Manning (2004). 888