Efficient Parsing for Transducer Grammars
John DeNero, Mohit Bansal, Adam Pauls, and Dan Klein Computer Science Division University of California, Berkeley {denero, mbansal, adpauls, klein}@cs.berkeley.edu

Abstract
The tree-transducer grammars that arise in current syntactic machine translation systems are large, flat, and highly lexicalized. We address the problem of parsing efficiently with such grammars in three ways. First, we present a pair of grammar transformations that admit an efficient cubic-time CKY-style parsing algorithm despite leaving most of the grammar in n-ary form. Second, we show how the number of intermediate symbols generated by this transformation can be substantially reduced through binarization choices. Finally, we describe a two-pass coarse-to-fine parsing approach that prunes the search space using predictions from a subset of the original grammar. In all, parsing time reduces by 81%. We also describe a coarse-to-fine pruning scheme for forest-based language model reranking that allows a 100-fold increase in beam size while reducing decoding time. The resulting translations improve by 1.3 BLEU.

2004) and have increased in size by including more synchronous tree fragments (Galley et al., 2006; Marcu et al., 2006; DeNeefe et al., 2007). As a result of these trends, the syntactic component of machine translation decoding can now account for a substantial portion of total decoding time. In this paper, we focus on efficient methods for parsing with very large tree-to-string grammars, which have flat n-ary rules with many adjacent non-terminals, as in Figure 1. These grammars are sufficiently complex that the purely syntactic pass of our multi-pass decoder is the compute-time bottleneck under some conditions. Given that parsing is well-studied in the monolingual case, it is worth asking why MT grammars are not simply like those used for syntactic analysis. There are several good reasons. The most important is that MT grammars must do both analysis and generation. To generate, it is natural to memorize larger lexical chunks, and so rules are highly lexicalized. Second, syntax diverges between languages, and each divergence expands the minimal domain of translation rules, so rules are large and flat. Finally, we see most rules very few times, so it is challenging to subcategorize non-terminals to the degree done in analytic parsing. This paper develops encodings, algorithms, and pruning strategies for such grammars. We first investigate the qualitative properties of MT grammars, then present a sequence of parsing methods adapted to their broad characteristics. We give normal forms which are more appropriate than Chomsky normal form, leaving the rules mostly flat. We then describe a CKY-like algorithm which applies such rules efficiently, working directly over the n-ary forms in cubic time. We show how thoughtful

1

Introduction

Current approaches to syntactic machine translation typically include two statistical models: a syntactic transfer model and an n-gram language model. Recent innovations have greatly improved the efficiency of language model integration through multipass techniques, such as forest reranking (Huang and Chiang, 2007), local search (Venugopal et al., 2007), and coarse-to-fine pruning (Petrov et al., 2008; Zhang and Gildea, 2008). Meanwhile, translation grammars have grown in complexity from simple inversion transduction grammars (Wu, 1997) to general tree-to-string transducers (Galley et al., 227

Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 227­235, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics

Maria

P daba

(a)

S!

NNP1 did not slap DT2 green NN3 NNP1 no daba una bofetada a DT2 NN3 verde

90,000 Original grammar rules
S ! NNP no daba una bofetada a DT NN verde

Right-branch

o daba

(b)

S ! NNP no daba una bofetada a DT NN verde S

60,000 DT NN NNS NP !
Lexical normal form (LNF) transformation 30,000 S ! NNP no daba una bofetada a DT+NN verde

Left-branch

Gre

Optimal (I

a bruja (c) NNP

DT

NN

0 NP ! DT+NN NNS 1 2

or

3

NP ! DT NN+NNS

4

5

6

7+ 70,000 52,500 35,000 17,500 0

Mary did not slap the green witch Maria no daba una bofetada a la bruja verde

Type-minimizing binarization

Figure 1: (a) A synchronous transducer rule has coindexed non-terminals on the source and target side. Internal grammatical structure of the target side has been DT+NN ! DT NN NP ! DT+NN NNS omitted. (b) The source-side projection of the rule is a 90,000 monolingual source-language rule with target-side gram- Anchored LNF transformation stituent alignments (Galley et al., 2004). Given this mar symbols. (c) A training sentence pair is annotated S\NNP ! no daba una bofetada a DT+NN verde 67,500 correspondence, an array of extraction procedures with a target-side parse tree and a word alignment, which DT+NN ! DT NN areNP ! DT+NN NNS machine transyields rules that well-suited to license this rule to be extracted. 45,000

Figure 2: Transducer grammars are composed of very flat Required symbols Sequences to build rules. Above, the histogram shows rule counts for each ruleDT+NN DT the 332,000 DT,NNthat apply to an indisize among NN rules NNS NNP sentence. The size of a rule is the total DT,NN,NNS vidual 30-word NP S number of non-terminals and lexical items in its sourceMinimal binary rules for LNF side yield.

binarization can further increase parsing speed, and we present a new coarse-to-fine scheme that uses 70,000 rule subsets rather than symbol clustering to build a coarse 52,500 grammar projection. These techniques reduce parsing time by 81% in aggregate. Finally, 35,000 we demonstrate that we can accelerate forest-based reranking with a language model by pruning with 17,500 information from the parsing pass. This approach 0 enables a 100-fold increase in maximum beam size, 1 2 3 4 5 6+ improving translation quality by 1.3 BLEU while decreasing total decoding time.

2
6 7 8 9 10+

Tree Transducer Grammars

5

Tree-to-string transducer grammars consist of weighted rules like the one depicted in Figure 1. Each n-ary rule consists of a root symbol, a sequence of lexical items and non-terminals on the source-side, and a fragment of a syntax tree on the target side. Each non-terminal on the source side corresponds to a unique one on the target side. Aligned non-terminals share a grammar symbol derived from a target-side monolingual grammar. These grammars are learned from word-aligned sentence pairs annotated with target-side phrase structure trees. Extraction proceeds by using word alignments to find correspondences between targetside constituents and source-side word spans, then discovering transducer rules that match these con228

lation NNP S\NNPet al., 2006; DeNeefe et al., 2007; S ! (Galley Marcu et al., 22,500 Rule weights are estimated 2006). by discriminatively combining relative frequency 0 counts and other rule features. 1 2 3 4 5 6 7 A transducer grammar G can be projected 8 9 10+ onto its source language, inducing a monolingual grammar. If we weight each rule by the maximum weight of its projecting synchronous rules, then parsing with this projected grammar maximizes the translation model score for a source sentence. We need not even consider the target side of transducer rules until integrating an n-gram language model or other non-local features of the target language. We conduct experiments with a grammar extracted from 220 million words of Arabic-English bitext, extracting rules with up to 6 non-terminals. A histogram of the size of rules applicable to a typical 30-word sentence appears in Figure 2. The grammar includes 149 grammatical symbols, an augmentation of the Penn Treebank symbol set. To evaluate, we decoded 300 sentences of up to 40 words in length from the NIST05 Arabic-English test set.

3

Efficient Grammar Encodings

Monolingual parsing with a source-projected transducer grammar is a natural first pass in multi-pass decoding. These grammars are qualitatively different from syntactic analysis grammars, such as the lexicalized grammars of Charniak (1997) or the heavily state-split grammars of Petrov et al. (2006).

NNP1 did not slap DT2 green NN3 (a) InSthis section, we develop an appropriate grammar ! encoding that enables efficient parsing. 3 verde NNP1 no daba una bofetada a DT2 NN

Original grammar rules are flat and lexical S ! NNP no daba una bofetada a DT NN verde NP ! DT NN NNS LNF replaces non-terminal sequences in lexical rules S ! NNP no daba una bofetada a DT+NN verde DT+NN ! DT NN Non-lexical rules are binarized using few symbols Non-lexical rules before binarization: NP ! DT NN NNS DT+NN ! DT NN Equivalent binary rules, minimizing symbol count: NP ! DT+NN NNS DT+NN ! DT NN Anchored LNF rules are bounded by lexical items S\NNP ! no daba una bofetada a DT+NN verde NP ! DT+NN NNS S ! NNP S\NNP DT+NN ! DT NN

7+

It is problematic to convert these grammars into (b) Chomsky normal form, which aCKY requires. BeS ! NNP no daba una bofetada DT NN verde cause transducer rules are very flat and contain speS cific lexical items, binarization introduces a large number of intermediate grammar symbols. Rule size parsing complexity whether (c) and lexicalization affectDT NNP NN the grammar is binarized explicitly (Zhang et al., Mary did not slap the green witch 2006) or implicitly binarized using Early-style intermediate symbols (Zollmann et al., 2006). Moreover, theMaria no daba una bofetada a la bruja Markovized to resulting binary rules cannot be verde merge symbols, as in Klein and Manning (2003), because each rule is associated with a target-side tree that cannot be abstracted. We also do not restrict the form of rules in the Right-branching 8,095 grammar, a common technique in syntactic machine Left-branching For instance, Zollmann 5,871 (2006) translation. et al. follow Chiang (2005) in disallowing adjacent nonGreedy 1,101 terminals. Watanabe et al. (2006) limit grammars Optimal (ILP) 443 to Griebach-Normal form. However, general tree 0 3,000 6,000 9,000 transducer grammars provide excellent translation performance (Galley et al., 2006), and so we focus on parsing with all available rules.
70,000 52,500

3.1

Lexical Normal Form

Sequences of consecutive non-terminals complicate parsing because they require a search over non35,000 terminal boundaries when applied to a sentence 45,000 span. We transform the grammar to ensure that all lexical rule we replace each sequence of consecutive 17,500 rules containing lexical items (lexical rules) do not non-terminals X1 . . . Xn with the intermediate sym22,500 contain sequences of non-terminals. We allow both bol X1+. . .+Xn (abbreviated X1:n ) and introduce a 0 1 2 3 6+ unary and binary non-lexical 4 rules. 5 non-lexical rule X1 +. . .+Xn  X1 . . . Xn . In the 0 Let L be the set of lexical items and V the set binarization1step, we 3 introduce 5 further intermediate 2 4 6 7+ of non-terminal symbols in the original grammar. symbols and rules to binarize all non-lexical rules Then, lexical normal form (LNF) limits productions 90,000 grammar, including those added by sequence in the to two forms: elimination. Non-lexical: X  X1 (X2 )
9 10+
+

Figure 3: We transform the original grammar by first eliminating non-terminal sequences in lexical rules. Next, we binarize, adding a minimal number of intermediate grammar symbols and binary non-lexical rules. 90,000 Finally, anchored LNF further transforms lexical rules to begin and end with lexical items by introducing additional symbols. 67,500

67,500

3.2

Non-terminal Binarization

Lexical: X  (X1 )(X2 )  = w (Xi w )
+ 

45,000 Exactly how we binarize non-lexical rules affects the 22,500 the LNF transformation.

total number of intermediate symbols introduced by

Above, all Xi  V and w+  L+ . Symbols in parentheses are optional. The nucleus  of lexical rules is a mixed sequence that has lexical items on each end and no adjacent non-terminals. Converting a grammar into LNF requires two steps. In the sequence elimination step, for every 229

Binarization involves selecting a set of symbols 0 that will1 allow3us 4to 5 6 7 8 9right-hand side assemble the 10+ 2 X1 . . . Xn of every non-lexical rule using binary productions. This symbol set must at least include the left-hand side of every rule in the grammar (lexical and non-lexical), including the intermediate

DT+NN DT NN Maria no daba una bofetada a la bruja verde NNS NNP NP S

DT NN DT NN NNS

Binary rules for LNF that minimize symbol count DT+NN ! DT NN Right-branching Left-branching Greedy Optimal (ILP) 0 7+ 1,101 443 3,000 6,000 9,000 5,871 8,095 NP ! DT+NN NNS

mar. Let R be the set of symbol sequences on the Anchored LNF rules are bounded by lexical items right-hand side of all non-lexical rules. Then, the S\NNP ILP takes! noform:una bofetada a DT+NN verde the daba
DT+NN ! DT NN NP ! DT+NN NNS S ! NNP S\NNP

min

TY

(1) Y L (2)

6

Figure 4: The number of non-terminal symbols intro70,000 duced to the grammar through LNF binarization depends upon the policy for binarizing type sequences. This ex52,500 periment shows results from transforming a grammar that has already been filtered for a particular short sentence. Both 35,000 the greedy and optimal binarizations use far fewer symbols than naive binarizations.
17,500 0 symbols X1:n introduced by sequence elimination. 1 2 3 4 5 6+ To ensure that a symbol sequence X1 . . . Xn can be constructed, we select a split point k and add intermediate types X1:k and Xk+1:n to the grammar. We must also ensure that the sequences X1 . . . Xk and Xk+1 . . . Xn can be constructed. As baselines, we used left-branching (where k = 1 always) and right-branching (where k = n - 1) binarizations. We also tested a greedy binarization approach, choosing k to minimize the number of grammar symbols introduced. We first try to select k such that both X1:k and Xk+1:n are already in the grammar. If no such k exists, we select k such that one of the intermediate types generated is already used. If no 1 such k exists again, we choose k = 2 n . This policy only creates new intermediate types when necessary. Song et al. (2008) propose a similar greedy approach to binarization that uses corpus statistics to select common types rather than explicitly reusing types that have already been introduced. Finally, we computed an optimal binarization that explicitly minimizes the number of symbols in the resulting grammar. We cast the minimization as an integer linear program (ILP). Let V be the set of all base non-terminal symbols in the grammar. We introduce an indicator variable TY for each symbol Y  V + to indicate that Y is used in the grammar. Y can be either a base non-terminal symbol Xi or an intermediate symbol X1:n . We also introduce indicators AY,Z for each pairs of symbols, indicating that both Y and Z are used in the grammar. Let L  V + be the set of left-hand side symbols for all lexical and non-lexical rules already in the gram-

s.t. TY = 1
90,000 1  67,500 45,000 22,500
k

Y V +

AX1:k ,Xk+1:n  X1 . . . Xn  R (3) AX1:k ,Xk+1:n
k

TX1:n 

 X1:n  Y, Z

(4) (5)

AY,Z  TY , AY,Z  TZ

7

8

9 10+

The solution to this ILP indicates which symbols 0 appear in a1minimal3binarization. 6 7+ Equation 1 explic2 4 5 itly minimizes the number of symbols. Equation 2 ensures that all symbols already in the grammar re90,000 main in the grammar. 67,500 Equation 3 does not require that a symbol represent the entire right-hand side of each non-lexical 45,000 rule, but does ensure that each right-hand side sequence can be built from two subsequence symbols. 22,500 Equation 4 ensures that any included intermediate type can also be built from two subsequence types. 0 2 3 5 6 7 8 9 pair Finally,1Equation45 ensures that if a10+ is used, each member of the pair is included. This program can be optimized with an off-the-shelf ILP solver.1 Figure 4 shows the number of intermediate grammar symbols needed for the four binarization policies described above for a short sentence. Our ILP solver could only find optimal solutions for very short sentences (which have small grammars after relativization). Because greedy requires very little time to compute and generates symbol counts that are close to optimal when both can be computed, we use it for our remaining experiments. 3.3 Anchored Lexical Normal Form

We also consider a further grammar transformation, anchored lexical normal form (ALNF), in which the yield of lexical rules must begin and end with a lexical item. As shown in the following section, ALNF improves parsing performance over LNF by shifting work from lexical rule applications to non-lexical
1

We used lp solve: http://sourceforge.net/projects/lpsolve.

230

rule applications. ALNF consists of rules with the following two forms: Non-lexical: X  X1 (X2 ) Lexical: X  w+ (Xi w+ )

Again, w(k, , X) will have been computed by the dynamic program. Assuming only a constant number of mappings per rule per span, the work in this phase is quadratic. We can then merge wl and wb : w(i, j, X) = max(wl (i, j, X), wb (i, j, X)). To efficiently compute mappings, we store lexical rules in a trie (or suffix array) ­ a searchable graph that indexes rules according to their sequence of lexical items and non-terminals. This data structure has been used similarly to index whole training sentences for efficient retrieval (Lopez, 2007). To find all rules that map onto a span, we traverse the trie using depth-first search. 4.3 Applying Unary Rules Unary non-lexical rules are applied after lexical rules and non-lexical binary rules. w(i, j, X) =
r:r=XX1

To convert a grammar into ALNF, we first transform it into LNF, then introduce additional binary rules that split off non-terminal symbols from the ends of lexical rules, as shown in Figure 3.

4

Efficient CKY Parsing

We now describe a CKY-style parsing algorithm for grammars in LNF. The dynamic program is organized into spans Sij and computes the Viterbi score w(i, j, X) for each edge Sij [X], the weight of the maximum parse over words i+1 to j, rooted at symbol X. For each Sij , computation proceeds in three phases: binary, lexical, and unary. 4.1 Applying Non-lexical Binary Rules For a span Sij , we first apply the binary non-lexical rules just as in standard CKY, computing an intermediate Viterbi score wb (i, j, X). Let r be the weight of rule r. Then, wb (i, j, X) =
r=XX1 X2

max

r w(i, j, X1 ).

max

r max w(i, k, X1 ) · w(k, j, X2 ).
k=i+1

j-1

While this definition is recursive, we allow only one unary rule application per symbol X at each span to prevent infinite derivations. This choice does not limit the generality of our algorithm: chains of unaries can always be collapsed via a unary closure. 4.4 Bounding Split Points for Binary Rules Non-lexical binary rules can in principle apply to any span Sij where j - i  2, using any split point k such that i < k < j. In practice, however, many rules cannot apply to many (i, k, j) triples because the symbols for their children have not been constructed successfully over the subspans Sik and Skj . Therefore, the precise looping order over rules and split points can influence computation time. We found the following nested looping order for the binary phase of processing an edge Sij [X] gave the fastest parsing times for these grammars: 1. Loop over symbols X1 for the left child 2. Loop over all rules X  X1 X2 containing X1 3. Loop over split points k : i < k < j 4. Update wb (i, j, X) as necessary This looping order allows for early stopping via additional bookkeeping in the algorithm. We track the following statistics as we parse:

The quantities w(i, k, X1 ) and w(k, j, X2 ) will have already been computed by the dynamic program. The work in this phase is cubic in sentence length. 4.2 Applying Lexical Rules On the other hand, lexical rules in LNF can be applied without binarization, because they only apply to particular spans that contain the appropriate lexical items. For a given Sij , we first compute all the legal mappings of each rule onto the span. A mapping consists of a correspondence between non-terminals in the rule and subspans of Sij . In practice, there is typically only one way that a lexical rule in LNF can map onto a span, because most lexical items will appear only once in the span. Let m be a legal mapping and r its corresponding (i) rule. Let Sk [X] be the edge mapped to the ith nonterminal of r under m, and r the weight of r. Then, wl (i, j, X) = max r
m Sk [X]
(i)

w(k, , X).

231

Grammar LNF LNF ALNF

Bound checks no yes yes

Parsing time 264 181 104

Table 1: Adding bound checks to CKY and transforming the grammar from LNF to anchored LNF reduce parsing time by 61% for 300 sentences of length 40 or less. No approximations have been applied, so all three scenarios produce no search errors. Parsing time is in minutes.

spent searching the trie for mappings, because the first transition into the trie must use an edge with a lexical item. Finally, ALNF improves the frequency that, when a lexical rule matches a span, we have successfully built every edge Sk [X] in the mapping for that rule. This frequency increases from 45% to 96% with ALNF.

5

Coarse-to-Fine Search

minEND (i, X), maxEND (i, X): The minimum and maximum position k for which symbol X was successfully built over Sik . minSTART (j, X), maxSTART (j, X): The minimum and maximum position k for which symbol X was successfully built over Skj . We then bound k by mink and maxk in the inner loop using these statistics. If ever mink > maxk , then the loop is terminated early. 1. set mink = i + 1, maxk = j - 1

2. loop over symbols X1 for the left child mink = max(mink , minEND (i, X1 )) maxk = min(maxk , maxEND (i, X1 )) 3. loop over rules X  X1 X2 mink = max(mink , minSTART (j, X2 )) maxk = min(maxk , maxSTART (j, X2 )) 5. update wb (i, j, X) as necessary 4. loop over split points k : mink  k  maxk

We now consider two coarse-to-fine approximate search procedures for parsing with these grammars. Our first approach clusters grammar symbols together during the coarse parsing pass, following work in analytic parsing (Charniak and Caraballo, 1998; Petrov and Klein, 2007). We collapse all intermediate non-terminal grammar symbols (e.g., NP) to a single coarse symbol X, while pre-terminal symbols (e.g., NN) are hand-clustered into 7 classes (nouns, verbals, adjectives, punctuation, etc.). We then project the rules of the original grammar into this simplified symbol set, weighting each rule of the coarse grammar by the maximum weight of any rule that mapped onto it. In our second and more successful approach, we select a subset of grammar symbols. We then include only and all rules that can be built using those symbols. Because the grammar includes many rules that are compositions of smaller rules, parsing with a subset of the grammar still provides meaningful scores that can be used to prune base grammar symbols while parsing under the full grammar. 5.1 Symbol Selection

In this way, we eliminate unnecessary work by avoiding split points that we know beforehand cannot contribute to wb (i, j, X). 4.5 Parsing Time Results Table 1 shows the decrease in parsing time from including these bound checks, as well as switching from lexical normal form to anchored LNF. Using ALNF rather than LNF increases the number of grammar symbols and non-lexical binary rules, but makes parsing more efficient in three ways. First, it decreases the number of spans for which a lexical rule has a legal mapping. In this way, ALNF effectively shifts work from the lexical phase to the binary phase. Second, ALNF reduces the time 232

To compress the grammar, we select a small subset of symbols that allow us to retain as much of the original grammar as possible. We use a voting scheme to select the symbol subset. After conversion to LNF (or ALNF), each lexical rule in the original grammar votes for the symbols that are required to build it. A rule votes as many times as it was observed in the training data to promote frequent rules. We then select the top nl symbols by vote count and include them in the coarse grammar C. We would also like to retain as many non-lexical rules from the original grammar as possible, but the right-hand side of each rule can be binarized in many ways. We again use voting, but this time each non-

Pruning No pruning Clustering Subsets

Minutes 104 79 50

Model score 60,179 60,179 60,163

BLEU 44.84 44.84 44.82

6

Language Model Integration

Table 2: Coarse-to-fine pruning speeds up parsing time with minimal effect on either model score or translation quality. The coarse grammar built using symbol subsets outperforms clustering grammar symbols, reducing parsing time by 52%. These experiments do not include a language model.

lexical rule votes for its yield, a sequence of symbols. We select the top nu symbol sequences as the set R of right-hand sides. Finally, we augment the symbol set of C with intermediate symbols that can construct all sequences in R, using only binary rules. This step again requires choosing a binarization for each sequence, such that a minimal number of additional symbols is introduced. We use the greedy approach from Section 3.2. We then include in C all rules from the original grammar that can be built from the symbols we have chosen. Surprisingly, we are able to retain 76% of the grammar rules while excluding 92% of the grammar symbols2 , which speeds up parsing substantially. 5.2 Max Marginal Thresholding

Large n-gram language models (LMs) are critical to the performance of machine translation systems. Recent innovations have managed the complexity of LM integration using multi-pass architectures. Zhang and Gildea (2008) describes a coarse-to-fine approach that iteratively increases the order of the LM. Petrov et al. (2008) describes an additional coarse-to-fine hierarchy over language projections. Both of these approaches integrate LMs via bottomup dynamic programs that employ beam search. As an alternative, Huang and Chiang (2007) describes a forest-based reranking algorithm called cube growing, which also employs beam search, but focuses computation only where necessary in a top-down pass through a parse forest. In this section, we show that the coarse-to-fine idea of constraining each pass using marginal predictions of the previous pass also applies effectively to cube growing. Max marginal predictions from the parse can substantially reduce LM integration time. 6.1 Language Model Forest Reranking

We parse first with the coarse grammar to find the Viterbi derivation score for each edge Sij [X]. We then perform a Viterbi outside pass over the chart, like a standard outside pass but replacing with max (Goodman, 1999). The product of an edge's Viterbi score and its Viterbi outside score gives a max marginal, the score of the maximal parse that uses the edge. We then prune away regions of the chart that deviate in their coarse max marginal from the global Viterbi score by a fixed margin tuned on a development set. Table 2 shows that both methods of constructing a coarse grammar are effective in pruning, but selecting symbol subsets outperformed the more typical clustering approach, reducing parsing time by an additional factor of 2.
We used nl of 500 and nu of 4000 for experiments. These parameters were tuned on a development set.
2

Parsing produces a forest of derivations, where each edge in the forest holds its Viterbi (or one-best) derivation under the transducer grammar. In forest reranking via cube growing, edges in the forest produce k-best lists of derivations that are scored by both the grammar and an n-gram language model. Using ALNF, each edge must first generate a k-best list of derivations that are not scored by the language model. These derivations are then flattened to remove the binarization introduced by ALNF, so that the resulting derivations are each rooted by an nary rule r from the original grammar. The leaves of r correspond to sub-edges in the chart, which are recursively queried for their best language-modelscored derivations. These sub-derivations are combined by r, and new n-grams at the edges of these derivations are scored by the language model. The language-model-scored derivations for the edge are placed on a priority queue. The top of the priority queue is repeatedly removed, and its successors added back on to the queue, until k language-model-scored derivations have been discovered. These k derivations are then sorted and

233

Pruning strategy No pruning CTF parsing CTF reranking CTF parse + rerank

Max beam 20 200 200 2000

BLEU 57.67 58.43 58.63 58.90

TM score 58,570 58,495 58,582 58,602

LM score -17,202 -16,929 -16,998 -16,980

Total score 41,368 41,556 41,584 41,622

Inside time 99 53 98 53

Outside time 0 0 64 52

LM time 247 186 79 148

Total time 346 239 241 253

Table 3: Time in minutes and performance for 300 sentences. We used a trigram language model trained on 220 million words of English text. The no pruning baseline used a fix beam size for forest-based language model reranking. Coarse-to-fine parsing included a coarse pruning pass using a symbol subset grammar. Coarse-to-fine reranking used max marginals to constrain the reranking pass. Coarse-to-fine parse + rerank employed both of these approximations.

supplied to parent edges upon request.3 6.2 Coarse-to-Fine Parsing Even with this efficient reranking algorithm, integrating a language model substantially increased decoding time and memory use. As a baseline, we reranked using a small fixed-size beam of 20 derivations at each edge. Larger beams exceeded the memory of our hardware. Results appear in Table 3. Coarse-to-fine parsing before LM integration substantially improved language model reranking time. By pruning the chart with max marginals from the coarse symbol subset grammar from Section 5, we were able to rerank with beams of length 200, leading to a 0.8 BLEU increase and a 31% reduction in total decoding time. 6.3 Coarse-to-Fine Forest Reranking We realized similar performance and speed benefits by instead pruning with max marginals from the full grammar. We found that LM reranking explored many edges with low max marginals, but used few of them in the final decoder output. Following the coarse-to-fine paradigm, we restricted the reranker to edges with a max marginal above a fixed threshold. Furthermore, we varied the beam size of each edge based on the parse. Let m be the ratio of the max marginal for edge m to the global Viterbi derivation for the sentence. We used a beam of size k · 2ln m for each edge. Computing max marginals under the full grammar required an additional outside pass over the full parse forest, adding substantially to parsing time.
Huang and Chiang (2007) describes the cube growing algorithm in further detail, including the precise form of the successor function for derivations.
3

However, soft coarse-to-fine pruning based on these max marginals also allowed for beams up to length 200, yielding a 1.0 BLEU increase over the baseline and a 30% reduction in total decoding time. We also combined the coarse-to-fine parsing approach with this soft coarse-to-fine reranker. Tiling these approximate search methods allowed another 10-fold increase in beam size, further improving BLEU while only slightly increasing decoding time.

7

Conclusion

As translation grammars increase in complexity while innovations drive down the computational cost of language model integration, the efficiency of the parsing phase of machine translation decoding is becoming increasingly important. Our grammar normal form, CKY improvements, and symbol subset coarse-to-fine procedure reduced parsing time for large transducer grammars by 81%. These techniques also improved forest-based language model reranking. A full decoding pass without any of our innovations required 511 minutes using only small beams. Coarse-to-fine pruning in both the parsing and language model passes allowed a 100-fold increase in beam size, giving a performance improvement of 1.3 BLEU while decreasing total decoding time by 50%.

Acknowledgements
This work was enabled by the Information Sciences Institute Natural Language Group, primarily through the invaluable assistance of Jens Voeckler, and was supported by the National Science Foundation (NSF) under grant IIS-0643742.

234

References
Eugene Charniak and Sharon Caraballo. 1998. New figures of merit for best-first probabilistic chart parsing. In Computational Linguistics. Eugene Charniak. 1997. Statistical techniques for natural language parsing. In National Conference on Artificial Intelligence. David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In The Annual Conference of the Association for Computational Linguistics. Steve DeNeefe, Kevin Knight, Wei Wang, and Daniel Marcu. 2007. What can syntax-based MT learn from phrase-based MT? In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. 2004. What's in a translation rule? In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. 2006. Scalable inference and training of context-rich syntactic translation models. In The Annual Conference of the Association for Computational Linguistics. Joshua Goodman. 1999. Semiring parsing. Computational Linguistics. Liang Huang and David Chiang. 2007. Forest rescoring: Faster decoding with integrated language models. In The Annual Conference of the Association for Computational Linguistics. Dan Klein and Chris Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the Association for Computational Linguistics. Adam Lopez. 2007. Hierarchical phrase-based translation with suffix arrays. In The Conference on Empirical Methods in Natural Language Processing. Daniel Marcu, Wei Wang, Abdessamad Echihabi, and Kevin Knight. 2006. SPMT: Statistical machine translation with syntactified target language phrases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In The Annual Conference of the North American Chapter of the Association for Computational Linguistics. Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In The Annual Conference of the Association for Computational Linguistics.

Slav Petrov, Aria Haghighi, and Dan Klein. 2008. Coarse-to-fine syntactic machine translation using language projections. In The Conference on Empirical Methods in Natural Language Processing. Xinying Song, Shilin Ding, and Chin-Yew Lin. 2008. Better binarization for the CKY parsing. In The Conference on Empirical Methods in Natural Language Processing. Ashish Venugopal, Andreas Zollmann, and Stephan Vogel. 2007. An efficient two-pass approach to synchronous-CFG driven statistical MT. In In Proceedings of the Human Language Technology and North American Association for Computational Linguistics Conference. Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. 2006. Left-to-right target generation for hierarchical phrase-based translation. In The Annual Conference of the Association for Computational Linguistics. Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23:377­404. Hao Zhang and Daniel Gildea. 2008. Efficient multipass decoding for synchronous context free grammars. In The Annual Conference of the Association for Computational Linguistics. Hao Zhang, Liang Huang, Daniel Gildea, and Kevin Knight. 2006. Synchronous binarization for machine translation. In North American Chapter of the Association for Computational Linguistics. Andreas Zollmann, Ashish Venugopal, and Stephan Vogel. 2006. Syntax augmented machine translation via chart parsing. In The Statistical Machine Translation Workshop at the North American Association for Computational Linguistics Conference.

235