The Complexity of Phrase Alignment Problems
John DeNero and Dan Klein Computer Science Division, EECS Department University of California at Berkeley {denero, klein}@cs.berkeley.edu

Abstract
Many phrase alignment models operate over the combinatorial space of bijective phrase alignments. We prove that finding an optimal alignment in this space is NP-hard, while computing alignment expectations is #P-hard. On the other hand, we show that the problem of finding an optimal alignment can be cast as an integer linear program, which provides a simple, declarative approach to Viterbi inference for phrase alignment models that is empirically quite efficient.

most problem instances. In an experiment intended to illustrate the practicality of the ILP approach, we show speed and search accuracy results for aligning phrases under a standard phrase translation model.

2 Phrase Alignment Problems
Rather than focus on a particular model, we describe four problems that arise in training phrase alignment models. 2.1 Weighted Sentence Pairs

1

Introduction

Learning in phrase alignment models generally requires computing either Viterbi phrase alignments or expectations of alignment links. For some restricted combinatorial spaces of alignments--those that arise in ITG-based phrase models (Cherry and Lin, 2007) or local distortion models (Zens et al., 2004)--inference can be accomplished using polynomial time dynamic programs. However, for more permissive models such as Marcu and Wong (2002) and DeNero et al. (2006), which operate over the full space of bijective phrase alignments (see below), no polynomial time algorithms for exact inference have been exhibited. Indeed, Marcu and Wong (2002) conjectures that none exist. In this paper, we show that Viterbi inference in this full space is NP-hard, while computing expectations is #P-hard. On the other hand, we give a compact formulation of Viterbi inference as an integer linear program (ILP). Using this formulation, exact solutions to the Viterbi search problem can be found by highly optimized, general purpose ILP solvers. While ILP is of course also NP-hard, we show that, empirically, exact solutions are found very quickly for 25

A sentence pair consists of two word sequences, e and f. A set of phrases {eij } contains all spans eij from between-word positions i to j of e. A link is an aligned pair of phrases, denoted (eij , fkl ).1 Let a weighted sentence pair additionally include a real-valued function  : {eij } ╫ {fkl }  R, which scores links. (eij , fkl ) can be sentence-specific, for example encoding the product of a translation model and a distortion model for (eij , fkl ). We impose no additional restrictions on  for our analysis. 2.2 Bijective Phrase Alignments

An alignment is a set of links. Given a weighted sentence pair, we will consider the space of bijective phrase alignments A : those a  {eij } ╫ {fkl } that use each word token in exactly one link. We first define the notion of a partition: iSi = T means Si are pairwise disjoint and cover T . Then, we can formally define the set of bijective phrase alignments:   (  ( A= a: eij = e ; fkl = f  
eij ,fkl )a eij ,fkl )a
1 As in parsing, the position between each word is assigned an index, where 0 is to the left of the first word. In this paper, we assume all phrases have length at least one: j > i and l > k.

Proceedings of ACL-08: HLT, Short Papers (Companion Volume), pages 25н28, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics


Both the conditional model of DeNero et al. (2006) and the joint model of Marcu and Wong (2002) operate in A , as does the phrase-based decoding framework of Koehn et al. (2003). 2.3 Problem Definitions For a weighted sentence pair (e, f, ), let the score of an alignment be the product of its link scores: ( (eij , fkl ). (a) =
eij ,fkl )a

3.1

Reducing Satisfiability to D

A reduction proof of NP-completeness gives a construction by which a known NP-complete problem can be solved via a newly proposed problem. From a SAT instance, we construct a weighted sentence pair for which alignments with positive score correspond exactly to the SAT solutions. Since SAT is NPcomplete and our construction requires only polynomial time, we conclude that D is NP-complete.2 SAT: Given vectors of boolean variables v = (v ) and propositional clauses3 C = (C ), decide whether there exists an assignment to v that simultaneously satisfies each clause in C. For a SAT instance (v, C), we construct f to contain one word for each clause, and e to contain several copies of the literals that appear in those clauses.  scores only alignments from clauses to literals that satisfy the clauses. The crux of the construction lies in ensuring that no variable is assigned both true and false. The details of constructing such a weighted sentence pair wsp(v, C) = (e, f, ), described below, are also depicted in figure 1. 1. f contains a word for each C , followed by an assignment word for each variable, assign(v ). 2. e contains c( ) consecutive words for each lita eral , where c( ) is the number of times that ppears in the clauses. Then, we set (╖, ╖) = 0 everywhere except: 3. For all clauses C and each satisfying literal , and each one-word phrase e in e containing , (e, fC ) = 1. fC is the one-word phrase containing C in f. 4. The assign(v ) words in f align to longer phrases of literals and serve to consistently assign each variable by using up inconsistent literals. They also align to unused literals to yield a bijection. Let ek ] be the phrase in e containing all literals [ and k negations of . fassign(v) is the one-word phrase for assign(v ). Then, (ek ] , fassign(v) ) = [ 1 for  {v , v } and all applicable k . п
2 Note that D is trivially in NP: given an alignment a, it is easy to determine whether or not (a)  1. 3 A clause is a disjunction of literals. A literal is a bare variable vn or its negation vn . For instance, v2  v7  v9 is a clause. п пп

Four related problems involving scored alignments arise when training phrase alignment models. O P T I M I Z AT I O N, O: Given (e, f, ), find the highest scoring alignment a. D E C I S I O N, D: Given (e, f, ), decide if there is an alignment a with (a)  1. O arises in the popular Viterbi approximation to EM (Hard EM) that assumes probability mass is concentrated at the mode of the posterior distribution over alignments. D is the corresponding decision problem for O, useful in analysis. E X P E C TAT I O N, E : Given a weighted sentence pair a (a) (e, f, ) and indices i, j, k , l, compute over all a  A such that (eij , fkl )  a. a S U M, S : Given (e, f, ), compute A (a). E arises in computing sufficient statistics for re-estimating phrase translation probabilities (Estep) when training models. The existence of a polynomial time algorithm for E implies a polynomial time algorithm for S , because A = |e | |f|-1 |f| j =1 k=0 l=k+1 {a : (e0j , fkl )  a, a  A } .

3

Complexity of Inference in A

For the space A of bijective alignments, problems E and O have long been suspected of being NP-hard, first asserted but not proven in Marcu and Wong (2002). We give a novel proof that O is NP-hard, showing that D is NP-complete by reduction from SAT, the boolean satisfiability problem. This result holds despite the fact that the related problem of finding an optimal matching in a weighted bipartite graph (the A S S I G N M E N T problem) is polynomialtime solvable using the Hungarian algorithm. 26


v1 v1 v1 v1 v2 v2 v2 v2 v3 v3 v3 v3 пп пп ппп

v1 v1 v1 v1 v2 v2 v2 v2 v3 v3 v3 v3 пп пп ппп

v1  v2  v3 v1  v2  v3 п п v1  v2  v3 п п п v1  v2  v3 п п assign(v1 ) assign(v2 ) assign(v3 ) (a) (b) (c) v1 is true v2 is false v3 is false (d)

Figure 1: (a) The clauses of an example SAT instance with v = (v1 , v2 , v3 ). (b) The weighted sentence pair wsp(v, C) constructed from the SAT instance. All links that have  = 1 are marked with a blue horizontal stripe. Stripes in the last three rows demarcate the alignment options for each assign(vn ), which consume all words for some literal. (c) A bijective alignment with score 1. (d) The corresponding satisfying assignment for the original SAT instance.

Claim 1. If wsp(v, C) has an alignment a with (a)  1, then (v, C) is satisfiable. Proof. The score implies that f aligns using all oneword phrases and ai  a, (ai ) = 1. By condition 4, each fassign(v) aligns to all v or all v in e. Then, п assign each v to true if fassign(v) aligns to all v , and п false otherwise. By condition 3, each C must align to a satisfying literal, while condition 4 assures that all available literals are consistent with this assignment to v, which therefore satisfies C. Claim 2. If (v, C) is satisfiable, then wsp(v, C) has an alignment a with (a) = 1. Proof. We construct such an alignment a from the satisfying assignment v. For each C , we choose a satisfying literal consistent with the assignment. Align fC to the first available token in e if the corresponding v is true, or the last if v is false. Align each fassign(v) to all remaining literals for v . Claims 1 and 2 together show that D is NPcomplete, and therefore that O is NP-hard. 3.2 Reducing Perfect Matching to S With another construction, we can show that S is #Phard, meaning that it is at least as hard as any #Pcomplete problem. #P is a class of counting problems related to NP, and #P-hard problems are NPhard as well. C O U N T I N G P E R F E C T M AT C H I N G S , C P M Given a bipartite graph G with 2n vertices, count the number of matchings of size n. 27

For a bipartite graph G with edge set E = {(vj , vl )}, we construct e and f with n words each, and set (ej -1 j , fl-1 l ) = 1 and 0 otherwise. The number of perfect matchings in G is the sum S for this weighted sentence pair. C P M is #P-complete (Valiant, 1979), so S (and hence E ) is #P-hard.

4

Solving the Optimization Problem

Although O is NP-hard, we present an approach to solving it using integer linear programming (ILP). 4.1 Previous Inference Approaches

Marcu and Wong (2002) describes an approximation to O . Given a weighted sentence pair, high scoring phrases are linked together greedily to reach an initial alignment. Then, local operators are applied to hill-climb A in search of the maximum a. This procedure also approximates E by collecting weighted counts as the space is traversed. DeNero et al. (2006) instead proposes an exponential-time dynamic program to systematically explore A , which can in principle solve either O or E . In practice, however, the space of alignments has to be pruned severely using word alignments to control the running time of EM. Notably, neither of these inference approaches offers any test to know if the optimal alignment is ever found. Furthermore, they both require small data sets due to computational expense. 4.2 Alignment via an Integer Program We cast O as an ILP problem, for which many optimization techniques are well known. First, we in-


troduce binary indicator variables ai,j,k,l denoting whether (eij , fkl )  a. Furthermore, we introduce binary indicators ei,j and fk,l that denote whether some (eij , ╖) or (╖, fkl ) appears in a, respectively. Finally, we represent the weight function  as a weight vector in the program: wi,j,k,l = log (eij , fkl ). Now, we can express an integer program that, when optimized, will yield the optimal alignment of our weighted sentence pair. i max wi,j,k,l ╖ ai,j,k,l
,j,k,l

Sentences per hour on a four-core server Frequency of optimal solutions found Frequency of -optimal solutions found

20,000 93.4% 99.2%

Table 1: The solver, tuned for speed, regularly reports solutions that are within 10-5 of optimal.

s.t.

i
,j :i<xj

ei,j = 1 fk,l = 1
,l:k<y l

x : 1  x  |e| (1) y : 1  y  |f | (2)

Using an off-the-shelf ILP solver,4 we were able to quickly and reliably find the globally optimal phrase alignment under (eij , fkl ) derived from the Moses pipeline (Koehn et al., 2007).5 Table 1 shows that finding the optimal phrase alignment is accurate and efficient.6 Hence, this simple search technique effectively addresses the intractability challenges inherent in evaluating new phrase alignment ideas.

k ei,j = fk,l = k

References
ai,j,k,l i, j k , l (3) (4)
Colin Cherry and Dekang Lin. 2007. Inversion transduction grammar for joint phrasal translation modeling. In NAACL-HLT Workshop on Syntax and Structure in Statistical Translation. John DeNero, Dan Gillick, James Zhang, and Dan Klein. 2006. Why generative phrase models underperform surface heuristics. In NAACL Workshop on Statistical Machine Translation. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In HLTNAACL. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL. ^┤ Percy Liang, Alexandre Bouchard-Cote, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In ACL. Daniel Marcu and William Wong. 2002. A phrase-based, joint probability model for statistical machine translation. In EMNLP. Leslie G. Valiant. 1979. The complexity of computing the permanent. In Theoretical Computer Science 8. Richard Zens, Hermann Ney, Taro Watanabeand, and E. Sumita. 2004. Reordering constraints for phrase based statistical machine translation. In Coling.
We used Mosek: www.mosek.com. (eij , fkl ) was estimated using the relative frequency of phrases extracted by the default Moses training script. We evaluated on English-Spanish Europarl, sentences up to length 25. 6 ILP solvers include many parameters that trade off speed for accuracy. Substantial speed gains also follow from explicitly pruning the values of ILP variables based on prior information.
5 4

,l

i
,j

ai,j,k,l

with the following constraints on index variables: 0  i < |e|, 0 < j  |e|, i < j 0  k < |f |, 0 < l  |f |, k < l . The objective function is log (a) for a implied by {ai,j,k,l = 1}. Constraint equation 1 ensures that the English phrases form a partition of e н each word in e appears in exactly one phrase н as does equation 2 for f. Constraint equation 3 ensures that each phrase in the chosen partition of e appears in exactly one link, and that phrases not in the partition are not aligned (and likewise constraint 4 for f).

5

Applications

The need to find an optimal phrase alignment for a weighted sentence pair arises in at least two applications. First, a generative phrase alignment model can be trained with Viterbi EM by finding optimal phrase alignments of a training corpus (approximate E-step), then re-estimating phrase translation parameters from those alignments (M-step). Second, this is an algorithm for forced decoding: finding the optimal phrase-based derivation of a particular target sentence. Forced decoding arises in online discriminative training, where model updates are made toward the most likely derivation of a gold translation (Liang et al., 2006). 28