Combining Constituent Parsers Victoria Fossum Dept. of Computer Science University of Michigan Ann Arbor, MI 48104 vfossum@umich.edu Kevin Knight Information Sciences Institute University of Southern California Marina del Rey, CA 90292 knight@isi.edu Abstract Combining the 1-best output of multiple parsers via parse selection or parse hybridization improves f-score over the best individual parser (Henderson and Brill, 1999; Sagae and Lavie, 2006). We propose three ways to improve upon existing methods for parser combination. First, we propose a method of parse hybridization that recombines context-free productions instead of constituents, thereby preserving the structure of the output of the individual parsers to a greater extent. Second, we propose an efficient lineartime algorithm for computing expected f-score using Minimum Bayes Risk parse selection. Third, we extend these parser combination methods from multiple 1-best outputs to multiple n-best outputs. We present results on WSJ section 23 and also on the English side of a Chinese-English parallel corpus. 1.1 Related Work (Henderson and Brill, 1999) perform parse selection by maximizing the expected precision of the selected parse with respect to the set of parses being combined. (Henderson and Brill, 1999) and (Sagae and Lavie, 2006) propose methods for parse hybridization by recombining constituents. 1.2 Our Work In this work, we propose three ways to improve upon existing methods for parser combination. First, while constituent recombination (Henderson and Brill, 1999; Sagae and Lavie, 2006) gives a significant improvement in f-score, it tends to flatten the structure of the individual parses. To illustrate, Figures 1 and 2 contrast the output of the Charniak parser with the output of constituent recombination on a sentence from WSJ section 24. We recombine context-free productions instead of constituents, producing trees containing only context-free productions that have been seen in the individual parsers' output (Figure 3). Second, the parse selection method of (Henderson and Brill, 1999) selects the parse with maximum expected precision; here, we present an efficient, linear-time algorithm for selecting the parse with maximum expected f-score within the Minimum Bayes Risk (MBR) framework. Third, we extend these parser combination methods from 1-best outputs to n-best outputs. We present results on WSJ section 23 and also on the English side of a Chinese-English parallel corpus. 1 Introduction Parse quality impacts the quality of downstream applications such as syntax-based machine translation (Quirk and Corston-Oliver, 2006). Combining the output of multiple parsers can boost the accuracy of such applications. Parses can be combined in two ways: parse selection (selecting the best parse from the output of the individual parsers) or parse hybridization (constructing the best parse by recombining sub-sentential components from the output of the individual parsers). 253 Proceedings of NAACL HLT 2009: Short Papers, pages 253­256, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics SBAR IN although RB not RB as RB sharply IN as DT the NP NN gain VBN reported FRAG ADJP PP NP VP NP NNP Friday RB not IN although ADVP RB as ADVP RB sharply SBAR S ADVP PP IN as DT the NP NN gain VBN reported VP NP NNP Friday Figure 3: Output of Context-Free Production Recombination Parser wsj ce dev test dev test 88.6 89.3 82.9 83.5 87.0 88.2 81.2 80.6 90.6 91.4 84.7 84.1 87.3 88.4 82.3 82.1 85.4 86.4 81.3 80.1 Figure 1: Output of Charniak Parser SBAR IN although RB not RB as RB sharply IN as DT the NP NN gain VBN reported VP NP NNP Friday Figure 2: Output of Constituent Recombination 2 Parse Selection Berkeley (Petrov and Klein, 2007) Bikel­Collins Model 2 (Bikel, 2002) Charniak (Charniak and Johnson, 2005) Soricut­Collins Model 2 (Soricut, 2004) Stanford (Klein and Manning, 2003) In the MBR framework, although the true reference parse is unknown, we assume that the individual parsers' output forms a reasonable distribution over possible reference parses. We compute the expected f-score of each parse tree pi using this distribution: expected f (pi ) = pj Table 1: F-Scores of 1-best Output of Individual Parsers f (pi , pj ) · pr(pj ) where f (pi , pj ) is the f-score of parse pi with respect to parse pj and pr(pj ) is the prior probability of parse pj . We estimate pr(pj ) as follows: pr(pj ) = pr(parserk ) · pr(pj |parserk ), where parserk is the parser generating pj . We set pr(parserk ) according to the proportion of sentences in the development set for which the 1-best output of parserk achieves the highest f-score of any individual parser, breaking ties randomly. When n = 1, pr(pj |parserk ) = 1 for all pj ; when n > 1 we must estimate pr(pj |parserk ), the distribution over parses in the n-best list output by any given parser. We estimate this distribution using the model score, or log probability, given by parserk to each entry pj in its n-best list: pr(pj |parserk ) = escorej,k scorej ,k n j =1 e 254 We tune on a development set to maximize fscore,1 and select the parse pi with highest expected f-score. Computing exact expected f-score requires O(m2 ) operations per sentence, where m is the number of parses being combined. We can compute an approximate expected f-score in O(m) time. To do so, we compute expected precision for all parses in O(m) time by associating with each unique constituent ci a list of parses in which it occurs, plus the total probability qi of those parses. For each parse p associated with ci , we increment the expected precision of that parse by qi /size(p). This computation yields the same result as the O(m2 ) algorithm. We carry out a similar operation for expected recall. We then compute the harmonic mean of expected precision and expected recall, which closely approximates the true expected f-score. 1 A low value of creates a uniform distribution, while a high value concentrates probability mass on the 1-best entry in the n-best list. In practice, tuning produces a higher f-score than setting to the value that exactly reproduces the individual parser's probability distribution. System best individual parser n=1 n=10 n=25 n=50 P 91.3 91.7 92.1 92.1 92.1 Parse Selection: Minimum Bayes Risk wsj-dev wsj-test ce-dev R F P R F P R F 89.9 90.5 90.8 90.9 91.0 90.6 91.1 91.5 91.5 91.5 91.8 92.5 92.4 92.4 92.4 91.0 91.8 91.7 91.7 91.7 91.4 92.0 92.0 92.0 92.1 86.1 87.1 87.9 88.0 88.0 83.4 84.6 85.3 85.4 85.3 84.7 85.8 86.6 86.7 86.6 P 85.6 86.7 87.7 87.4 87.6 ce-test R F 82.6 83.7 84.4 84.2 84.3 84.1 85.2 86.0 85.7 85.9 Table 2: Precision, Recall, and F-score Results from Parse Selection 3 Constituent Recombination (Henderson and Brill, 1999) convert each parse into constituents with syntactic labels and spans, and weight each constituent by summing pr(parserk ) over all parsers k in whose output the constituent appears. They include all constituents with weight above a threshold t = m+1 , where m is the number 2 of input parses, in the combined parse. (Sagae and Lavie, 2006) extend this method by tuning t on a development set to maximize fscore.2 They populate a chart with constituents whose weight meets the threshold, and use a CKYstyle parsing algorithm to find the heaviest tree, where the weight of a tree is the sum of its constituents' weights. Parsing is not constrained by a grammar; any context-free production is permitted. Thus, the combined parses may contain context-free productions not seen in the individual parsers' outputs. While this failure to preserve the structure of individual parses does not affect f-score, it may hinder downstream applications. To extend this method from 1-best to n-best lists, we weight each constituent by summing pr(parserk )·pr(pj |parserk ) over all parses pj generated by parserk in which the constituent appears. ming pr(parserk ) · pr(pj |parserk ) over all parses pj generated by parserk in which the production appears. We re-parse the sentence with these productions, returning the heaviest tree (where the weight of a tree is the sum of its context-free productions' weights). We optimize f-score by varying the tradeoff between precision and recall using a derivation length penalty, which we tune on a development set.3 5 Experiments Table 1 illustrates the 5 parsers used in our combination experiments and the f-scores of their 1-best output on our data sets. We use the n-best output of the Berkeley, Charniak, and Soricut parsers, and the 1-best output of the Bikel and Stanford parsers. All parsers were trained on the standard WSJ training sections. We use two corpora: the WSJ (sections 24 and 23 are the development and test sets, respectively) and English text from the LDC2007T02 Chinese-English parallel corpus (the development and test sets contain 400 sentences each). 6 Discussion & Conclusion 4 Context-Free Production Recombination To ensure that all context-free productions in the combined parses have been seen in the individual parsers' outputs, we recombine context-free productions rather than constituents. We convert each parse into context-free productions, labelling each constituent in the production with its span and syntactic category and weighting each production by sumA high threshold results in high precision, while a low threshold results in high recall. 2 Results are shown in Tables 2, 3, and 4. On both test sets, constituent recombination achieves the best f-score (1.0 points on WSJ test and 2.3 points on Chinese-English test), followed by context-free production combination, then parse selection, though the differences in f-score among the combination methods are not statistically significant. Increasing the n-best list size from 1 to 10 improves parse selection and context-free production recombination, By subtracting higher(lower) values of this length penalty from the weight of each production, we can encourage the combination method to favor trees with shorter(longer) derivations and therefore higher precision(recall) at the constituent level. 3 255 System best individual parser n=1 n=10 n=25 n=50 P 91.3 92.5 92.6 92.6 92.6 Parse Hybridization: Constituent Recombination wsj-dev wsj-test ce-dev R F P R F P R F 89.9 90.3 90.5 90.5 90.5 90.6 91.4 91.5 91.5 91.5 91.8 93.0 93.1 93.2 93.1 91.0 91.6 91.7 91.7 91.7 91.4 92.3 92.4 92.4 92.4 86.1 89.2 89.9 89.9 89.9 83.4 84.6 84.4 84.4 84.4 84.7 86.8 87.1 87.0 87.1 P 85.6 89.1 89.9 89.7 89.7 ce-test R F 82.6 83.6 83.2 83.4 83.2 84.1 86.2 86.4 86.4 86.3 Table 3: Precision, Recall, and F-score Results from Constituent Recombination Parse Hybridization: Context-Free Production Recombination wsj-dev wsj-test ce-dev P R F P R F P R F P 91.3 91.7 92.1 92.2 92.1 89.9 91.0 90.9 91.0 90.8 90.6 91.4 91.5 91.6 91.4 91.8 92.1 92.5 92.5 92.4 91.0 91.9 91.8 91.8 91.7 91.4 92.0 92.2 92.2 92.1 86.1 86.9 87.8 87.8 87.6 83.4 85.4 85.1 85.1 84.9 84.7 86.2 86.4 86.4 86.2 85.6 86.2 86.2 87.6 87.7 ce-test R F 82.6 84.3 84.3 84.6 84.6 84.1 85.2 86.1 86.1 86.1 System best individual parser n=1 n=10 n=25 n=50 Table 4: Precision, Recall, and F-score Results from Context-Free Production Recombination though further increasing n does not, in general, help.4 Chinese-English test set f-score gets a bigger boost from combination than WSJ test set f-score, perhaps because the best individual parser's baseline f-score is lower on the out-of-domain data. We have presented an algorithm for parse hybridization by recombining context-free productions. While constituent recombination results in the highest f-score of the methods explored, contextfree production recombination produces trees which better preserve the syntactic structure of the individual parses. We have also presented an efficient linear-time algorithm for selecting the parse with maximum expected f-score. References Daniel M. Bikel. 2004. Design of a Multi-lingual, Parallel-processing Statistical Parsing Engine. In Proceedings of HLT. Eugene Charniak and Mark Johnson. 2005. Coarse-tofine n-best parsing and MaxEnt discriminative reranking. In Proceedings of ACL. Michael Collins and Terry Koo. 2005. Discriminative Reranking for Natural Language Parsing. Computational Linguistics, 31(1):25-70. John C. Henderson and Eric Brill. 2000. Exploiting Diversity in Natural Language Processing: Combining Parsers. In Proceedings of EMNLP. Dan Klein and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. In Proceedings of ACL. Slav Petrov and Dan Klein. 2007. Improved Inference for Unlexicalized Parsing. In Proceedings of HLTNAACL. Chris Quirk and Simon Corston-Oliver. 2006. The Impact of Parse Quality on Syntactically-Informed Statistical Machine Translation. In Proceedings of EMNLP. Kenji Sagae and Alon Lavie. 2006. Parser Combination by Reparsing. In Proceedings of HLT-NAACL. Radu Soricut. 2004. A Reimplementation of Collins' Parsing Models. Technical report, Information Sciences Institute, Department of Computer Science, University of Southern California. Acknowledgments We thank Steven Abney, John Henderson, and Kenji Sagae for helpful discussions. This research was supported by DARPA (contract HR0011-06-C0022) and by NSF ITR (grant IIS-0428020). These diminishing gains in f-score as n increases reflect the diminishing gains in f-score of the oracle parse produced by each individual parser as n increases. 4 256