Preference Grammars: Softening Syntactic Constraints to Improve Statistical Machine Translation
Ashish Venugopal Andreas Zollmann Noah A. Smith Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {ashishv,zollmann,nasmith,vogel}@cs.cmu.edu Abstract
We propose a novel probabilistic synchoronous context-free grammar formalism for statistical machine translation, in which syntactic nonterminal labels are represented as "soft" preferences rather than as "hard" matching constraints. This formalism allows us to efficiently score unlabeled synchronous derivations without forgoing traditional syntactic constraints. Using this score as a feature in a log-linear model, we are able to approximate the selection of the most likely unlabeled derivation. This helps reduce fragmentation of probability across differently labeled derivations of the same translation. It also allows the importance of syntactic preferences to be learned alongside other features (e.g., the language model) and for particular labeling procedures. We show improvements in translation quality on small and medium sized Chinese-to-English translation tasks.

1

Introduction

of finding the maximum-weighted derivation consistent with the source sentence, where the scores are defined (at least in part) by R-valued weights associated with the rules. A PSCFG derivation is a synchronous parse tree. Defining the translation function as finding the best derivation has the unfortunate side effect of forcing differently-derived versions of the same target sentence to compete with each other. In other words, the true score of each translation is "fragmented" across many derivations, so that each translation's most probable derivation is the only one that matters. The more Bayesian approach of finding the most probable translation (integrating out the derivations) instantiates an NP-hard inference problem even for simple word-based models (Knight, 1999); for grammar-based translation it is known as the consensus problem (Casacuberta and de la Higuera, 2000; Sima'an, 2002). With weights interpreted as probabilities, the maximum-weighted derivation is the maximum a posteriori (MAP) derivation: e  argmax max p(e, d | f ) ^
e d

Probabilistic synchronous context-free grammars (PSCFGs) define weighted production rules that are automatically learned from parallel training data. As in classical CFGs, these rules make use of nonterminal symbols to generalize beyond lexical modeling of sentences. In MT, this permits translation and reordering to be conditioned on more abstract notions of context. For example, VP  ne VB1 pas # do not VB1 represents the discontiguous translation of the French words "ne" and "pas" to "do not", in the context of the labeled nonterminal symbol "VB" (representing syntactic category "verb"). Translation with PSCFGs is typically expressed as the problem 236

where f is the source sentence, e ranges over target sentences, and d ranges over PSCFG derivations (synchronous trees). This is often described as an approximation to the most probable translation, argmaxe d p(e, d | f ). In this paper, we will describe a technique that aims to find the most probable equivalence class of unlabeled derivations, rather than a single labeled derivation, reducing the fragmentation problem. Solving this problem exactly is still an NP-hard consensus problem, but we provide approximations that build on well-known PSCFG decoding methods. Our model falls somewhere between PSCFGs that extract nonterminal symbols from parse trees and treat them as part of

Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 236­244, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics

the derivation (Zollmann and Venugopal, 2006) and unlabeled hierarchical structures (Chiang, 2005); we treat nonterminal labels as random variables chosen at each node, with each (unlabeled) rule expressing "preferences" for particular nonterminal labels, learned from data. The paper is organized as follows. In Section 2, we summarize the use of PSCFG grammars for translation. We describe our model (Section 3). Section 4 explains the preference-related calculations, and Section 5 addresses decoding. Experimental results using preference grammars in a loglinear translation model are presented for two standard Chinese-to-English tasks in Section 6. We review related work (Section 7) and conclude.

MAP approximation can be defined as: e = tgt ^ argmax
dD(G):src(d)=f

p(d)

(1)

where tgt(d) is the target-side yield of a derivation d, and D(G) is the set of G's derivations. Using an n-gram language model to score derivations and rule labels to constraint the rules that form derivations, we define p(d) as log-linear model in terms of the rules r  R used in d as:
m

p(d) = pLM (tgt(d)) ×psyn (d)
rR

0

×

pi (d)i
i=1

m+1

/Z() (2)

2

PSCFGs for Machine Translation

pi (d) = psyn (d) =

hi (r)

freq(r;d)

Probabilistic synchronous context-free grammars (PSCFGs) are defined by a source terminal set (source vocabulary) TS , a target terminal set (target vocabulary) TT , a shared nonterminal set N and a set R of rules of the form: X  , , w where · X  N is a labeled nonterminal referred to as the left-hand-side of the rule. ·   (N  TS ) is the source side of the rule. ·   (N  TT ) is the target side of the rule. · w  [0, ) is a nonnegative real-valued weight assigned to the rule. For visual clarity, we will use the # character to separate the source side of the rule  from the target side . PSCFG rules also have an implied one-toone mapping between nonterminal symbols in  and nonterminals symbols in . Chiang (2005), Zollmann and Venugopal (2006) and Galley et al. (2006) all use parameterizations of this PSCFG formalism1 . Given a source sentence f and a PSCFG G, the translation task can be expressed similarly to monolingual parsing with a PCFG. We aim to find the most likely derivation d of the input source sentence and read off the English translation, identified by composing  from each rule used in the derivation. This search for the most likely translation under the
Galley et al. (2006) rules are formally defined as tree transducers but have equivalent PSCFG forms.
1

1 if d respects label constraints (3) 0 otherwise

where  = 0 · · · m+1 are weights that reflect the relative importance of features in the model. The features include the n-gram language model (LM) score of the target yield sequence, a collection of m rule feature functions hi : R  R0 , and a "syntax" feature that (redundantly) requires every nonterminal token to be expanded by a rule with that nonterminal on its left-hand side. freq(r; d) denotes the frequency of the rule r in the derivation d. Note that m+1 can be effectively ignored when psyn is defined as in Equation 3. Z() is a normalization constant that does not need to be computed during search under the argmax search criterion in Equation 1. Feature weights  are trained discriminatively in concert with the language model weight to maximize the BLEU (Papineni et al., 2002) automatic evaluation metric via Minimum Error Rate Training (MERT) (Och, 2003). We use the open-source PSCFG rule extraction framework and decoder from Zollmann et al. (2008) as the framework for our experiments. The asymptotic runtime of this decoder is: O |f |3 |N ||TT |2(n-1)
K

(4)

where K is the maximum number of nonterminal symbols per rule, |f | the source sentence length, and

237

n is the order of the n-gram LM that is used to compute pLM . This constant factor in Equation 4 arises from the dynamic programming item structure used to perform search under this model. Using notation from Chiang (2007), the corresponding item structure is: [X, i, j, q()] : w (5)

(4) (3) (2) (1)

where X is the nonterminal label of a derivation, i, j define a span in the source sentence, and q() maintains state required to compute pLM (). Under the MAP criterion we can discard derivations of lower weight that share this item structure, but in practice we often require additional lossy pruning to limit the number of items produced. The Syntax-Augmented MT model of Zollmann and Venugopal (2006), for instance, produces a very large nonterminal set using "slash" (NP/NN  the great) and "plus" labels (NP+VB  she went) to assign syntactically motivated labels for rules whose target words do not correspond to constituents in phrase structure parse trees. These labels lead to fragmentation of probability across many derivations for the same target sentence, worsening the impact of the MAP approximation. In this work we address the increased fragmentation resulting from rules with labeled nonterminals compared to unlabeled rules (Chiang, 2005).

( ê ı VB1 # a place where I can VB1 S  ( ê ı VP1 # a place where I can VP1 SBAR  ( ê ı VP1 # a place where I can VP1 FRAG  ( ê ı AUX1 # a place where I can AUX1 S  VB  VP  NP  NN  m # eat m # eat m # eat m # dish

(8) (1) (1) (10)

where the numbers are frequencies of the rule from the training corpus. In classical PSCFG we can think of the nonterminals mentioned in the rules as hard constraints on which rules can be used to expand a particular node; e.g., a VP can only be expanded by a VP rule. In Equation 2, psyn (d) explicitly enforces this hard constraint. Instead, we propose softening these constraints. In the rules below, labels are represented as soft preferences. (10) X     ( ê ı X1 # a place where I can X1
= = = = 0.4 0.3 0.2 0.1

3

Preference Grammars

We extend the PSCFG formalism to include soft "label preferences" for unlabeled rules that correspond to alternative labelings that have been encountered in training data for the unlabeled rule form. These preferences, estimated via relative frequency counts from rule occurrence data, are used to estimate the feature psyn (d), the probability that an unlabeled derivation can be generated under traditional syntactic constraints. In classic PSCFG, psyn (d) enforces a hard syntactic constraint (Equation 3). In our approach, label preferences influence the value of psyn (d). 3.1 Motivating example

p(H0 = S, H1 = VB | r) p(H0 = S, H1 = VP | r) p(H0 = SBAR, H1 = VP | r) p(H0 = FRAG, H1 = AUX | r)

  

(10) X 

m # eat
p(H0 = VB | r) p(H0 = VP | r) p(H0 = NP | r) = = = 0.8 0.1 0.1

(10) X 

m # dish { p(H0 = NN | r)

=

1.0

}

Consider the following labeled Chinese-to-English PSCFG rules: 238

Each unlabeled form of the rule has an associated distribution over labels for the nonterminals referenced in the rule; the labels are random variables Hi , with H0 the left-hand-side label. These unlabeled rule forms are simply packed representations of the original labeled PSCFG rules. In addition to the usual features hi (r) for each rule, estimated based on unlabeled rule frequencies, we now

have label preference distributions. These are estimated as relative frequencies from the labelings of the base, unlabeled rule. Our primary contribution is how we compute psyn (d)--the probability that an unlabeled derivation adheres to traditional syntactic constraints--for derivations built from preference grammar rules. By using psyn (d) as a feature in the log-linear model, we allow the MERT framework to evaluate the importance of syntactic structure relative to other features. The example rules above highlight the potential for psyn (d) to affect the choice of translation. The translation of the Chinese word sequence (ê ı m can be performed by expanding the nonterminal in the rule "a place where I can X1 " with either "eat" or "dish." A hierarchical system (Chiang, 2005) would allow either expansion, relying on features like pLM to select the best translation since both expansions occurred the same number of times in the data. A richly-labeled PSCFG as in Zollmann and Venugopal (2006) would immediately reject the rule generating "dish" due to hard label matching constraints, but would produce three identical, competing derivations. Two of these derivations would produce S as a root symbol, while one derivation would produce SBAR. The two S-labeled derivations compete, rather than reinforce the choice of the word "eat," which they both make. They will also compete for consideration by any decoder that prunes derivations to keep runtime down. The rule preferences indicate that VB and VP are both valid labels for the rule translating to "eat", and both of these labels are compatible with the arguments expected by "a place where I can X1 ". Alternatively, "dish" produces a NN label which is not compatible with the arguments of this higherup rule. We design psyn (d) to reflect compatibility between two rules (one expanding a right-hand side nonterminal in the other), based on label preference distributions. 3.2 Formal definition

with the explicit label set N . · : H  N , a function that associates each implicit label with a single explicit label. We can therefore think of H symbols as refinements of the nonterminals in N (Matsusaki et al., 2005). · For each rule r, we define a probability distribution over vectors h of implicit label bindings for its nonterminals, denoted ppref (h | r). h includes bindings for the left-hand side nonterminal (h0 ) as well as each right-hand side nonterminal (h1 , ..., h|h| ). Each hi  H. When N , H are defined to include just a single generic symbol as in (Chiang, 2005), we produce the unlabeled grammar discussed above. In this work, we define · N = {S, X} · H = {NP, DT, NN · · · } = NSAMT where N corresponds to the generic labels of Chiang (2005) and H corresponds to the syntactically motivated SAMT labels from (Zollmann and Venugopal, 2006), and  maps all elements of H to X. We will use hargs(r) to denote the set of all h = h0 , h1 , ..., hk  Hk+1 that are valid bindings for the rule with nonzero preference probability. The preference distributions ppref from each rule used in d are used to compute psyn (d) as described next.

4

Computing feature psyn (d)

Let us view a derivation d as a collection of nonterminal tokens nj , j  {1, ..., |d|}. Each nj takes an explicit label in N . The score psyn (d) is a product, with one factor per nj in the derivation d: psyn (d) =
|d| j=1

j

(6)

Probabilistic synchronous context-free preference grammars are defined as PSCFGs with the following additional elements: · H: a set of implicit labels, not to be confused 239

Each j factor considers the two rules that nj participates in. We will refer to the rule above nonterminal token nj as rj (the nonterminal is a child in this rule) and the rule that expands nonterminal token j as rj . The intuition is that derivations in which these two rules agree (at each j) about the implicit label

for nj , in H are preferable to derivations in which they do not. Rather than making a decision about the implicit label, we want to reward psyn when rj and rj are consistent. Our way of measuring this consistency is an inner product of preference distributions: j  ppref (h | rj )ppref (h | rj ) (7)

signature [Y, i - ||, j + | |, v, ...]. The left-handside preferences v for the new item are calculated as follows: v(h) = v (h) = ~ v (h) ~ ~ h v (h ) where (8)

hH

h H: h,h hargs(r)

ppref ( h, h | r) × u(h )

This is not quite the whole story, because ppref (· | r) is defined as a joint distribution of all the implicit labels within a rule; the implicit labels are not independent of each other. Indeed, we want the implicit labels within each rule to be mutually consistent, i.e., to correspond to one of the rule's preferred labelings, for both hargs(r) and hargs(r). Our approach to calculating psyn within the dynamic programming algorithm is to recursively calculate preferences for each chart item based on (a) the smaller items used to construct the item and (b) the rule that permits combination of the smaller items into the larger one. We describe how the preferences for chart items are calculated. Let a chart item be denoted [X, i, j, u, ...] where X  N and i and j are positions in the source sentence, and u : {h  H | (h) = X}  [0, 1] (where h u(h) = 1) denotes a distribution over possible X-refinement labels. We will refer to it below as the left-hand-side preference distribution. Additional information (such as language model state) may also be included; it is not relevant here. The simplest case is for a nonterminal token nj that has no nonterminal children. Here the left-handside preference distribution is simply given by u(h) = ppref (h | rj ) . and we define the psyn factor to be j = 1. Now consider the dynamic programming step of combining an already-built item [X, i, j, u, ...] rooted by explicit nonterminal X, spanning source sentence positions i to j, with left-hand-side preference distribution u, to build a larger item rooted by Y through a rule r = Y  X1  , X1  , w with preferences ppref (· | r).2 The new item will have
2  We assume for the discussion that ,   TS and ,  

Renormalizing keeps the preference vectors on the same scale as those in the rules. The psyn factor , which is factored into the value of the new item, is calculated as: =
h H: h,h hargs(r)

u(h )

(9)

so that the value considered for the new item is w ×  × ..., where factors relating to pLM , for example, may also be included. Coming back to our example, if we let r be the leaf rule producing "eat" at shared nonterminal n1 , we generate an item with: u = u(VB) = 0.8, u(VP) = 0.1, u(NP) = 0.1 1 = 1 Combining this item with X  ( ê ı X1 # a place where I can X1 as r2 at nonterminal n2 generates a new target item with translation "a place where I can eat", 2 = 0.9 and v as calculated in Fig. 1. In contrast, 2 = 0 for the derivation where r is the leaf rule that produces "dish". This calculation can be seen as a kind of singlepass, bottom-up message passing inference method embedded within the usual dynamic programming search.

5

Decoding Approximations

As defined above, accurately computing psyn (d) requires extending the chart item structure with u. For models that use the n-gram LM feature, the item structure would be: [X, i, j, q(), u] : w (10)

Since u effectively summarizes the choice of rules in a derivation, this extension would partition the
 TT . If there are multiple nonterminals on the right-hand side of the rule, we sum over the longer sequences in hargs(r) and include appropriate values from the additional "child" items' preference vectors in the product.

240

v (S) = ppref ( h = S, h = VB | r)u(VB) + ppref ( h = S, h = VP | r)u(VP) = (0.4 × 0.8) + (0.3 × 0.1) = 0.35 ~ v (SBAR) = p( h = SBAR, h = VP | r)u(VP) = (0.2 × 0.1) = 0.02 ~ v = v(S) = 0.35/(~(S) + v (SBAR)), v(SBAR) = 0.02/~(S) + v (SBAR) = v(S) = 0.35/0.37, v(SBAR) = 0.02/0.37 v ~ v ~ 2 = u(VB) + u(VP) = 0.8 + 0.1 = 0.9 Figure 1: Calculating v and 2 for the running example.

search space further. To prevent this partitioning, we follow the approach of Venugopal et al. (2007). We keep track of u for the best performing derivation from the set of derivations that share [X, i, j, q()] in a first-pass decoding. In a second top-down pass similar to Huang and Chiang (2007), we can recalculate psyn (d) for alternative derivations in the hypergraph; potentially correcting search errors made in the first pass. We face another significant practical challenge during decoding. In real data conditions, the size of the preference vector for a single rule can be very high, especially for rules that include multiple nonterminal symbols that are located on the left and right boundaries of . For example, the Chineseto-English rule X  X1 ,, X2 # X1 's X2 has over 24K elements in hargs(r) when learned for the medium-sized NIST task used below. In order to limit the explosive growth of nonterminals during decoding for both memory and runtime reasons, we define the following label pruning parameters: · R : This parameter limits the size of hargs(r) to the R top-scoring preferences, defaulting other values to zero. · L : This parameter is the same as R but applied only to rules with no nonterminals. The stricter of L and R is applied if both thresholds apply. · P : This parameter limits the number labels in item preference vectors (Equation 8) to the P most likely labels during decoding, defaulting other preferences to zero.

subset of the full training data (67M words of English text) from the annual NIST MT Evaluation. Development corpora are used to train model parameters via MERT. We use a variant of MERT that prefers sparse solutions where i = 0 for as many features as possible. At each MERT iteration, a subset of features  are assigned 0 weight and optimization is repeated. If the resulting BLEU score is not lower, these features are left at zero. All systems are built on the SAMT framework described in Zollmann et al. (2008), using a trigram LM during search and the full-order LM during a second hypergraph rescoring pass. Reordering limits are set to 10 words for all systems. Pruning parameters during decoding limit the number of derivations at each source span to 300. The system "Hier." uses a grammar with a single nonterminal label as in Chiang (2005). The system "Syntax" applies the grammar from Zollmann and Venugopal (2006) that generates a large number of syntactically motivated nonterminal labels. For the NIST task, rare rules are discarded based on their frequency in the training data. Purely lexical rules (that include no terminal symbols) that occur less than 2 times, or non-lexical rules that occur less than 4 times are discarded. IWSLT task: We evaluate the preference grammar system "Pref." with parameters R = 100, L = 5, P = 2. Results comparing systems Pref. to Hier. and Syntax are shown in Table 2. Automatic evaluation results using the preference grammar translation model are positive. The preference grammar system shows improvements over both the Hier. and Syntax based systems on both unseen evaluation sets IWSLT 2007 and 2008. The improvements are clearest on the BLEU metric (matching the MERT training criteria). On 2007 test data, Pref. shows a 1.2-point improvement over Hier., while on the 2008 data, there is a 0.6-point improvement. For the IWSLT task, we report additional au-

6

Empirical Results

We evaluate our preference grammar model on small (IWSLT) and medium (NIST) data Chineseto-English translation tasks (described in Table 1). IWSLT is a limited domain, limited resource task (Paul, 2006), while NIST is a broadcast news task with wide genre and domain coverage. We use a 241

System Name IWSLT NIST

Words in Target Text 632K 67M

LM singleton 1-n-grams (n) 431K (5) 102M (4)

Dev. IWSLT06 MT05

Test IWSLT07,08 MT06

Table 1: Training data configurations used to evaluate preference grammars. The number of words in the target text and the number of singleton 1-n-grams represented in the complete model are the defining statistics that characterize the scale of each task. For each LM we also indicate the order of the n-gram model. System Dev BLEU (lpen)  28.0 (0.89) 30.9 (0.96) 28.3 (0.88) 2007 BLEU (lpen)  37.0 (0.89) 35.5 (0.94) 38.2 (0.90) 2008 BLEU (lpen)  45.9 (0.91) 45.3 (0.95) 46.3 (0.91) 2008 WER  44.5 45.7 43.8 2008 PER  39.9 40.4 40.0 2008 MET.  61.8 62.1 61.7 2008 GTM  70.7 71.5 71.2

Hier. Syntax Pref.

Table 2: Translation quality metrics on the IWSLT translation task, with IWSLT 2006 as the development corpora, and IWSLT 2007 and 2008 as test corpora. Each metric is annotated with an  if increases in the metric value correspond to increase in translation quality and a  if the opposite is true. We also list length penalties for the BLEU metric to show that improvements are not due to length optimizations alone.

tomatic evaluation metrics that generally rank the Pref. system higher than Hier. and Syntax. As a further confirmation, our feature selection based MERT chooses to retain m+1 in the model. While the IWSLT results are promising, we perform a more complete evaluation on the NIST translation task. NIST task: This task generates much larger rule preference vectors than the IWSLT task simply due to the size of the training corpora. We build systems with both R = 100, 10 varying P . Varying P isolates the relative impact of propagating alternative nonterminal labels within the preference grammar model. L = 5 for all NIST systems. Parameters  are trained via MERT on the R = 100, L = 5, P = 2 system. BLEU scores for each preference grammar and baseline system are shown in Table 3, along with translation times on the test corpus. We also report length penalties to show that improvements are not simply due to better tuning of output length. The preference grammar systems outperform the Hier. baseline by 0.5 points on development data, and upto 0.8 points on unseen test data. While systems with R = 100 take significantly longer to translate the test data than Hier., setting R = 10 takes approximately as long as the Syntax based system but produces better slightly better results (0.3 242

points). The improvements in translation quality with the preference grammar are encouraging, but how much of this improvement can simply be attributed to MERT finding a better local optimum for parameters ? To answer this question, we use parameters  optimized by MERT for the preference grammar system to run a purely hierarchical system, denoted Hier.( ), which ignores the value of m+1 during decoding. While almost half of the improvement comes from better parameters learned via MERT for the preference grammar systems, 0.5 points can be still be attributed purely to the feature psyn . In addition, MERT does not set parameter m+1 to 0, corroborating the value of the psyn feature again. Note that Hier.( ) achieves better scores than the Hier. system which was trained via MERT without psyn . This highlights the local nature of MERT parameter search, but also points to the possibility that training with the feature psyn produced a more diverse derivation space, resulting in better parameters . We see a very small improvement (0.1 point) by allowing the runtime propagation of more than 1 nonterminal label in the left-hand side posterior distribution, but the improvement doesn't extend to P = 5. Improved integration of the feature psyn (d) into decoding might help to widen this gap.

Dev. Test System BLEU (lpen) BLEU (lpen) Baseline Systems Hier. 34.1 (0.99) 31.8 (0.95) Syntax 34.7 (0.99) 32.3 (0.95) Hier.( ) 32.1 (0.95) Preference Grammar: R = 100 P = 1 32.5 (0.96) P = 2 34.6 (0.99) 32.6 (0.95) P = 5 32.5 (0.95) Preference Grammar: R = 10 P = 1 32.5 (0.95) P = 2 32.6 (0.95) P = 5 32.5 (0.95)

Test time (h:mm) 0:12 0:45 0:12 3:00 3:00 3:20 1:03 1:10 1:10

notations add an additional level of structure that must be marginalized during search. They demonstrate improvements in parse quality only when a variational approximation is used to select the most likely unannotated tree rather than simply stripping annotations from the MAP annotated tree. In our work, we focused on approximating the selection of the most likely unlabeled derivation during search, rather than as a post-processing operation; the methods described above might improve this approximation, at some computational expense.

8

Conclusions and Future Work

Table 3: Translation quality and test set translation time (using 50 machines with 2 tasks per machine) measured by the BLEU metric for the NIST task. NIST 2006 is used as the development (Dev.) corpus and NIST 2007 is used as the unseen evaluation corpus (Test). Dev. scores are reported for systems that have been separately MERT trained, Pref. systems share parameters from a single MERT training. Systems are described in the text.

7

Related Work

There have been significant efforts in the both the monolingual parsing and machine translation literature to address the impact of the MAP approximation and the choice of labels in their respective models; we survey the work most closely related to our approach. May and Knight (2006) extract nbest lists containing unique translations rather than unique derivations, while Kumar and Byrne (2004) use the Minimum Bayes Risk decision rule to select the lowest risk (highest BLEU score) translation rather than derivation from an n-best list. Tromble et al. (2008) extend this work to lattice structures. All of these approaches only marginalize over alternative candidate derivations generated by a MAPdriven decoding process. More recently, work by Blunsom et al. (2007) propose a purely discriminative model whose decoding step approximates the selection of the most likely translation via beam search. Matsusaki et al. (2005) and Petrov et al. (2006) propose automatically learning annotations that add information to categories to improve monolingual parsing quality. Since the parsing task requires selecting the most non-annotated tree, the an243

We have proposed a novel grammar formalism that replaces hard syntactic constraints with "soft" preferences. These preferences are used to compute a machine translation feature (psyn (d)) that scores unlabeled derivations, taking into account traditional syntactic constraints. Representing syntactic constraints as a feature allows MERT to train the corresponding weight for this feature relative to others in the model, allowing systems to learn the relative importance of labels for particular resource and language scenarios as well as for alternative approaches to labeling PSCFG rules. This approach takes a step toward addressing the fragmentation problems of decoding based on maximum-weighted derivations, by summing the contributions of compatible label configurations rather than forcing them to compete. We have suggested an efficient technique to approximate psyn (d) that takes advantage of a natural factoring of derivation scores. Our approach results in improvements in translation quality on small and medium resource translation tasks. In future work we plan to focus on methods to improve on the integration of the psyn (d) feature during decoding and techniques that allow us consider more of the search space through less pruning.

Acknowledgements
We appreciate helpful comments from three anonymous reviewers. Venugopal and Zollmann were supported by a Google Research Award. Smith was supported by NSF grant IIS-0836431.

References
Phil Blunsom, Trevor Cohn, and Miles Osborne. 2007. A discriminative latent variable model for statistical machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Francisco Casacuberta and Colin de la Higuera. 2000. Computational complexity of problems on probabilistic grammars and transducers. In Proc. of the 5th International Colloquium on Grammatical Inference: Algorithms and Applications. David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of the Annual Meeting of the Association for Compuational Linguistics (ACL). David Chiang. 2007. Hierarchical phrase based translation. Computational Linguistics. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. 2006. Scalable inferences and training of context-rich syntax translation models. In Proceedings of the Annual Meeting of the Association for Compuational Linguistics (ACL), Sydney, Australia. Liang Huang and David Chiang. 2007. Forest rescoring: Faster decoding with integrated language models. In Proceedings of the Annual Meeting of the Association for Compuational Linguistics (ACL). Kevin Knight. 1999. Decoding complexity in wordreplacement translation models. Computational Linguistics, Squibs and Discussion. Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology and North American Association for Computational Linguistics Conference (HLT/NAACL), Boston,MA, May 27-June 1. Takuya Matsusaki, Yusuke Miyao, and Junichi Tsujii. 2005. Probabilistic CFG with latent annotations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Jonathan May and Kevin Knight. 2006. A better N-best list: Practical determinization of weighted finite tree automata. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics Conference (HLT/NAACL). Franz J. Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Annual Meeting of the Association for Compuational Linguistics (ACL). Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the

Annual Meeting of the Association for Compuational Linguistics (ACL). Michael Paul. 2006. Overview of the IWSLT 2006 evaluation campaign. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT). Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the Annual Meeting of the Association for Compuational Linguistics (ACL). Khalil Sima'an. 2002. Computational complexity of probabilistic disambiguation. Grammars, 5(2):125­ 151. Roy Tromble, Shankar Kumar, Franz Och, and Wolfgang Macherey. 2008. Lattice minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Ashish Venugopal, Andreas Zollmann, and Stephan Vogel. 2007. An efficient two-pass approach to Synchronous-CFG driven statistical MT. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics Conference (HLT/NAACL). Andreas Zollmann and Ashish Venugopal. 2006. Syntax augmented machine translation via chart parsing. In Proceedings of the Workshop on Statistical Machine Translation, HLT/NAACL, New York, June. Andreas Zollmann, Ashish Venugopal, Franz J. Och, and Jay Ponte. 2008. A systematic comparison of phrasebased, hierarchical and syntax-augmented statistical MT. In Proceedings of the Conference on Computational Linguistics (COLING).

244