A Clustered Global Phrase Reordering Model for Statistical Machine Translation
Masaaki Nagata Kuniko Saito NTT Communication Science Laboratories NTT Cyber Space Laboratories 2-4 Hikaridai, Seika-cho, Souraku-gun 1-1 Hikarinooka, Yokoshuka-shi Kyoto, 619-0237 Japan Kanagawa, 239-0847 Japan nagata.masaaki@labs.ntt.co.jp, saito.kuniko@labs.ntt.co.jp Kazuhide Yamamoto, Kazuteru Ohashi Nagaoka University of Technology 1603-1, Kamitomioka, Nagaoka City Niigata, 940-2188 Japan ykaz@nlp.nagaokaut.ac.jp, ohashi@nlp.nagaokaut.ac.jp Abstract
In this paper, we present a novel global reordering model that can be incorporated into standard phrase-based statistical machine translation. Unlike previous local reordering models that emphasize the reordering of adjacent phrase pairs (Tillmann and Zhang, 2005), our model explicitly models the reordering of long distances by directly estimating the parameters from the phrase alignments of bilingual training sentences. In principle, the global phrase reordering model is conditioned on the source and target phrases that are currently being translated, and the previously translated source and target phrases. To cope with sparseness, we use N-best phrase alignments and bilingual phrase clustering, and investigate a variety of combinations of conditioning factors. Through experiments, we show, that the global reordering model significantly improves the translation accuracy of a standard Japanese-English translation task. Standard phrase-based translation systems use a word distance-based reordering model in which non-monotonic phrase alignment is penalized based on the word distance between successively translated source phrases without considering the orientation of the phrase alignment or the identities of the source and target phrases (Koehn et al., 2003; Och and Ney, 2004). (Tillmann and Zhang, 2005) introduced the notion of a block (a pair of source and target phrases that are translations of each other), and proposed the block orientation bigram in which the local reordering of adjacent blocks are expressed as a three-valued orientation, namely Right (monotone), Left (swapped), or Neutral. A block with neutral orientation is supposed to be less strongly linked to its predecessor block: thus in their model, the global reordering is not explicitly modeled. In this paper, we present a global reordering model that explicitly models long distance reordering1 . It predicts four type of reordering patterns, namely MA (monotone adjacent), MG (monotone gap), RA (reverse adjacent), and RG (reverse gap). There are based on the identities of the source and target phrases currently being translated, and the previously translated source and target phrases. The parameters of the reordering model are estimated from the phrase alignments of training bilingual sentences. To cope with sparseness, we use N-best phrase alignments and bilingual phrase clustering. In the following sections, we first describe the global phrase reordering model and its param1 It might be misleading to call our reordering model "global" since it is at most considers two phrases. A truly global reordering model would take the entire sentence structure into account.
 

1 Introduction
Global reordering is essential to the translation of languages with different word orders. Ideally, a model should allow the reordering of any distance, because if we are to translate from Japanese to English, the verb in the Japanese sentence must be moved from the end of the sentence to the beginning just after the subject in the English sentence.
б

Graduated in March 2006

713
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 713н720, Sydney, July 2006. c 2006 Association for Computational Linguistics


2 Baseline Translation Model
In statistical machine translation, the translation of a source (foreign) sentence is formulated as the search for a target (English) sentence that max, which imizes the conditional probability can be rewritten using the Bayes rule as,

Figure 1: Phrase alignment and reordering
target

d=MA bi bi-1 fi-1

ei ei-1

fi source
target

where is a translation model and is a target language model. In phrase-based statistical machine translation, the source sentence is segmented into a sequence of phrases , and each source phrase is translated into a target phrase . Target phrases may be reordered. The translation model used in (Koehn et al., 2003) is the product of translation probability and distortion probability ,

target

d=RA bi bi-1 fi-1

ei ei-1

fi source

Figure 2: Four types of reordering patterns

(1)

3 The Global Phrase Reordering Model
Figure 1 shows an example of Japanese-English phrase alignment that consists of four phrase pairs. Note that the Japanese verb phrase " " at the the end of the sentence is aligned to the English verb "is" at the beginning of the sentence just after the subject. Such reordering is typical in JapaneseEnglish translations. Motivated by the three-valued orientation for local reordering in (Tillmann and Zhang, 2005), we define the following four types of reordering patterns, as shown in Figure 2, monotone adjacent (MA): The two source phrases are adjacent, and are in the same order as the two target phrases.

(2)

(3)

714


is the frequency of alignments where between the source phrase and the target phrase . (Koehn et al., 2003) used the following distortion model, which simply penalizes nonmonotonic phrase alignments based on the word distance of successively translated source phrases with an appropriate value for the parameter ,

monotone gap (MG): The two source phrases are not adjacent, but are in the same order as the two target phrases. reverse adjacent (RA): The two source phrases are adjacent, but are in the reverse order of the two target phrases.

 
where denotes the start position of the source phrase translated into the -th target phrase, and denotes the end position of the source phrase translated into the -th target phrase. The translation probability is calculated from the relative frequency as,

   p p

eter estimation method including N-best phrase alignments and bilingual phrase clustering. Next, through an experiment, we show that the global phrase reordering model significantly improves the translation accuracy of the IWSLT-2005 Japanese-English translation task (Eck and Hori, 2005).

communication of means a is anguage
b1

b4

RA b3 MG RA b2

l

y
ei


8д 976 и 2 FDCA3 E3 B @

д 'г д 'г ив и вж   д ег д ег ! #     д ег ! #      б и в и в ж   " &% й  $в " в иж

q

и жб й  зв в ) 3 " 

T Sд и V@ U 2 S3 B E T 8 3 8д 9Q6 ) ) 54 2 I3  2 ) в ж 2 5ег дH )д и 2 S3 R@ 3 P3 в ж "  EB и3 и0 0  0G

д ег r

2E

)   ) ) д db"YW ca`X и в е  e ) ) д d7pYW ca`X i )д g и в е c a ` Xh e )   ) д db"YW f  и в ж )   54 и в е  e

xv tr $wt %s &q  2 8д 9Q6 и S3 R@ 3 EB

u bE

 
target

d=MG bi bi-1 fi-1 d=RG bi bi-1 fi-1 fi
source

ei-1

fi

source

ei ei-1

) 3в

 
2) 01 

) ) 54 д и 3 вж 3  

(

)

в


Monotone Adjacent Monotone Gap Reverse Adjacent Reverse Gap

J-to-E 0.441 0.281 0.206 0.072

C-to-E 0.828 0.106 0.033 0.033

communication of eans

6 4

7

8 5

1

2

3

Table 1: Percentage of reordering patterns

m
For the global reordering model, we only consider the cases in which the two target phrases are adjacent because, in decoding, the target sentence is generated from left to right and phrase by phrase. If we are to generate the -th target phrase from the source phrase , we call and the and the previous current block , and . block Table 1 shows the percentage of each reordering pattern that appeared in the N-best phrase alignments of the training bilingual sentences for the IWSLT 2005 Japanese-English and ChineseEnglish translation tasks (Eck and Hori, 2005). Since non-local reorderings such as monotone gap and reverse gap are more frequent in Japanese to English translations, they are worth modeling explicitly in this reordering model. Since the probability of reordering pattern (intended to stand for `distortion') is conditioned on the current and previous blocks, the global phrase reordering model is formalized as follows:

Phrase translation probabilities are approximated and using word translation probabilities as follows,

(4)

We can replace the conventional word distancebased distortion probability in Equation (1) with the global phrase reordering model in Equation (4) with minimal modification of the underlying phrase-based decoding algorithm.

In principle, the parameters of the global phrase reordering model in Equation (4) can be estimated from the relative frequencies of respective events in the Viterbi phrase alignment of the training bilingual sentences. This straightforward estimation method, however, often suffers from sparse data problem. To cope with this sparseness, we used N-best phrase alignment and bilingual phrase

1. All source word and target word pairs are considered to be initial phrase pairs.
715

д ег и г  ж 3 в

д ег и д3 Pев г ж г   и г  ж 3 в

4 Parameter Estimation Method

where and are words in the target and source phrases. The phrase alignment based on Equation (5) can be thought of as an extension of word alignment based on the IBM Model 1 to phrase alignment. Note that bilingual phrase segmentation (phrase extraction) is also done using the same criteria. The approximation in Equation (6) is motivated by (Vogel et al., 2003). Here, we added the second term to cope with the asymmetry between and . The word translation probabilities are estimated using the GIZA++ (Och and Ney, 2003). The above search is implemented in the following way:

д 'г и 3 в ж д  г

д ег и г  ж 3 в

xx ) ) ег 2 w3 б g # в б g i д H " & 2 б ) в 2 б ) д !   и0 e0   и3 P3 в ж "  0G

ж зи

д ег 3 3 вж г  

е

) ) ег д и3 3 в ж " 

г

) ) ег д G  и вж  

3в

г

д ег и г  ж 3 в

 
) 3 " 

6

8д 976 и 2 S%B 13 E3 @

) E3 ) 2 S"  3в T

) 9дег ) )) 6 и 3   e 2 F3   e 3 в e 2 F3 в ж E E

) 2Eв ) S3 3 " 

3B

2 SB E3

 ) 3в

reverse gap (RG): The two source phrases are not adjacent, and are in the reverse order as the two target phrases.

Figure 3: Expansion of a phrase pair clustering. We also investigated various approximations of Equation (4) by reducing the conditional factors. 4.1 N-best Phrase Alignment

In order to obtain the Viterbi phrase alignment of a bilingual sentence pair, we search for the phrase segmentation and phrase alignment that maximizes the product of the phrase translation , probabilities (5)

(6)


2. If the phrase translation probability of the phrase pair is less than the threshold, it is deleted. 3. Each phrase pair is expanded toward the eight neighboring directions as shown in Figure 3. 4. If the phrase translation probability of the expanded phrase pair is less than the threshold, it is deleted. 5. The process of expansion and deletion is repeated until no further expansion is possible. 6. The consistent N-best phrase alignment are searched from all combinations of the above phrase pairs. The search for consistent Viterbi phrase alignments can be implemented as a phrase-based decoder using a beam search whose outputs are constrained only to the target sentence. The consistent N-best phrase alignment can be obtained by using A* search as described in (Ueffing et al., 2002). We did not use any reordering constraints, such as IBM constraint and ITG constraint in the search for the N-best phrase alignment (Zens et al., 2004). The thresholds used in the search are the following: the minimum phrase translation probability is 0.0001. The maximum number of translation candidates for each phrase is 20. The beam width is 1e-10, the stack size (for each target candidate word length) is 1000. We found that, compared with the decoding of sentence translation, we have to search significantly larger space for the N-best phrase alignment. Figure 3 shows an example of phrase pair expansion toward eight neighbors. If the current phrase pair is ( , of), the expanded phrase , means of), ( , pairs are ( means of), ( , means of), ( , of), ( , of), ( , of communication), ( , of communication), and ( , of communication). Figure 4 shows an example of the best three phrase alignments for a Japanese-English bilingual sentence. For the estimation of the global phrase reordering model, preliminary tests have shown that the appropriate N-best number is 20. In counting the events for the relative frequency estimation, we treat all N-best phrase alignments equally. For comparison, we also implemented a different N-best phrase alignment method, where

(1)

the_light_was_red

(2)

the_light

_ __ _ __ _ r_
_ was

was_red

(3)

the_light

ed

Figure 4: N-best phrase alignments phrase pairs are extracted using the standard phrase extraction method described in (Koehn et al., 2003). We call this conventional phrase extraction method "grow-diag-final", and the proposed phrase extraction method "ppicker" (this is intended to stand for phrase picker). 4.2 Bilingual Phrase Clustering

The second approach to cope with the sparseness in Equation (4) is to group the phrases into equivalence classes. We used a bilingual word clustering tool, mkcls (Och et al., 1999) for this purpose. It forms partitions of the vocabulary of the two languages to maximize the joint probability of the training bilingual corpus. In order to perform bilingual phrase clustering, all words in a phrase are concatenated by an underscore ' ' to form a pseudo word. We then use the modified bilingual sentences as the input to mkcls. We treat all N-best phrase alignments equally. Thus, the phrase alignments in Figure 4 are converted to the following three bilingual sentence pairs. ___ _ the_light_was_red _ _ _ the_light was_red _ _ the_light was red Preliminary tests have shown that the appropriate number of classes for the estimation of the global phrase reordering model is 20. As a comparison, we also tried two phrase classification methods based on the part of speech of the head word (Ohashi et al., 2005). We defined (arguably) the first word of each English phrase and the last word of each Japanese phrase as the
716

5 6 2 1 3 5 4 2 1 0#) 3 5 4 2 1 0#) 3 0 #)

(        е' &и %д$в#б ж       е   и "зв!б  жд        йздегб иж в  


e[0] f[0] e[0]f[0] e[-1]f[0] e[0]f[-1,0] e[-1]f[-1,0] e[-1,0]f[0] e[-1,0]f[-1,0]

Table 2: All reordering models tried in the experiments head word. We then used the part of speech of the head word as the phrase class. We call this method "1pos". Since we are not sure whether it is appropriate to introduce asymmetry in head word selection, we also tried a "2pos" method, where the parts of speech of both the first and the last words are used for phrase classification. 4.3 Conditioning Factor of Reordering

The third approach to cope with sparseness in Equation (4) is to approximate the equation by reducing the conditioning factors. Other than the baseline word distance-based reordering model and the Equation (4) itself, we tried eight different approximations of Equation (4) as shown in Table 2, where, the symbol in the left column is the shorthand for the reordering model in the right column. The approximations are designed based on two intuitions. The current block ( and ) would probably be more important than the previous and ). The previous target phrase block ( ( ) might be more important than the current target phrase ( ) because the distortion model of IBM 4 is conditioned on , and . The appropriate form of the global phrase reordering model is decided through experimentation.

5 Experiments
5.1 Corpus and Tools We used the IWSLT-2005 Japanese-English translation task (Eck and Hori, 2005) for evaluating the proposed global phrase reordering model. We report results using the well-known automatic evaluation metrics Bleu (Papineni et al., 2002). IWSLT (International Workshop on Spoken
717

) 3 " 

) 9ег ) )) 6д и3 "  e 2 S3")   e 3 ) в e 2 S3 ) в ж 6 д E E в 2 E в ж 9ег ) и 3"  e 3 ) e S3 ) 9ег 6д и 3"  e ) 2 S"  e ) 2 S3 ) в ж 9ег E3 E 6д и 3   e 2 ) S3   e 3 ) в ж 9ег E 6д и 3"  e ) 2 S3 ) в ж 9'г E 6д и 3"  e 3 ) в ж 9ег 6д и3 6д ")   ж 9ег и 6д P3 в ж 9ег xv t $wt и %s r q u bE ) 3 "  ) 2 S")3   2 S3 в E E r ) 3в

2E

shorthand baseline

reordering model

Japanese English

Sentences 20,000 20,000

Words 198,453 183,452

Vocabulary 9,277 6,956

2 S")3   E

) 3в

) E ) 2 S3 в 2 S3 в E

4

Table 3: IWSLT 2005 Japanese-English training data Language Translation) 2005 is an evaluation campaign for spoken language translation.Its task domain encompasses basic travel conversations. 20,000 bilingual sentences are provided for training. Table 3 shows the number of words and the size of vocabulary of the training data. The average sentence length of Japanese is 9.9 words, while that of English is 9.2 words. Two development sets, each containing 500 source sentences, are also provided and each development sentence comes with 16 reference translations. We used the second development set (devset2) for the experiments described in this paper. This 20,000 sentence corpus allows for fast experimentation and enables us to study different aspects of the proposed global phrase reordering model. Japanese word segmentation was done using ChaSen2 and English tokenization was done using a tool provided by LDC3 . For the phrase classification based on the parts of speech of the head word, we used the first two layers of the Chasen's part of speech tag for Japanese. For English part of speech tagging, we used MXPOST4 . Word translation probabilities are obtained by using GIZA++ (Och and Ney, 2003). For training, all English words are made in lower case. We used a back-off word trigram model as the language model. It is trained from the lowercased English side of the training corpus using a statistical language modeling toolkit, Palmkit 5 . We implemented our own decoder based on the algorithm described in (Ueffing et al., 2002). For decoding, we used phrase translation probability, lexical translation probability, word penalty, and distortion (phrase reordering) probability. Minimum error rate training was not used for weight optimization. The thresholds used in the decoding are the following: the minimum phrase translation probability is 0.01. The maximum number of translation
2 3

http://chasen.aist-nara.ac.jp/ http://www.cis.upenn.edu/~treebank/tokenizer.sed 4 http://www.cis.upenn.edu/~adwait/statnlp.html 5 http://palmkit.sourceforge.net/


baseline f[0] e[0] e[0]f[0] e[0]f[-1,0] e[-1,0]f[0] e[-1,0]f[-1,0]

ppicker class lex 0.400 0.400 0.407 0.407 0.417 0.410 0.422 0.416 0.422 0.404 0.407 0.381 0.410 0.392 0.394 0.387

grow-diag-final

class 0.343 0.350 0.362 0.356 0.355 0.346 0.348 0.339

lex 0.343 0.350 0.356 0.360 0.353 0.327 0.341 0.340

It is obvious that, for building the global phrase reordering model, our phrase extraction method is significantly better than the conventional phrase extraction method. We assume this is because the proposed N-best phrase alignment method optimizes the combination of phrase extraction (segmentation) and phrase alignment in a sentence. 5.4 Global and Local Reordering Model

Table 4: BLEU score of reordering models with different phrase extraction methods candidates for each phrase is 10. The beam width is 1e-5, the stack size (for each target candidate word length) is 100. 5.2 Clustered and Lexicalized Model

5.3

Interaction between Phrase Extraction and Phrase Alignment

Table 4 shows the BLEU score of reordering models with different phrase extraction methods. Here, "ppicker" shows the accuracy when phrases are extracted by using the N-best phrase alignment method described in Section 4.1, while "growdiag-final" shows the accuracy when phrases are extracted using the standard phrase extraction algorithm described in (Koehn et al., 2003).
718

which is similar to the block orientation bigram (Tillmann and Zhang, 2005). We should note, however, that the block orientation bigram is a joint probability model for the sequence of blocks (source and target phrases) as well as their orientations (reordering pattern) whose purpose is very different from our global phrase reordering model. The advantage of the reordering model is that it can better model global phrase reordering using a four-valued reordering pattern, and it can be easily

) ззж8 еW ) в $дг8 бW ж 9'г двв   д в в   6д и и "  3 eи DP3

Figure 5 shows the BLEU score of clustered and lexical reordering model with different conditioning factors. Here, "class" shows the accuracy when the identity of each phrase is represented by its class, which is obtained by the bilingual phrase clustering, while "lex" shows the accuracy when the identity of each phrases is represented by its lexical form. The clustered reordering model "class" is generally better than the lexicalized reordering model "lex". The accuracy of "lex" drops rapidly as the number of conditioning factors increases. The reordering models using the part of speech of the head word for phrase classification such as "1pos" and "2pos" are somewhere in between. The best score is achieved by the clustered model when the phrase reordering pattern is conor ditioned on either the current target phrase the current block, namely phrase pair and . They are significantly better than the baseline of the word distance-based reordering model.

) 3  ) 3в

) 3в

4

In order to show the advantages of explicitly modeling global phrase reordering, we implemented a different reordering model where the reordering pattern is classified into three values: monotone adjacent, reverse adjacent and neutral. By collapsing monotone gap and reverse gap into neutral, it can be thought of as a local reordering model similar to the block orientation bigram (Tillmann and Zhang, 2005). Figure 6 shows the BLEU score of the local and global reordering models. Here, "class3" and "lex3"represent the three-valued local reordering model, while "class4" and "lex4"represent the four-valued global reordering model. "Class" and "lex" represent clustered and lexical models, respectively. We used "grow-diag-final" for phrase extraction in this experiment. It is obvious that the four-valued global reordering model consistently outperformed the threevalued local reordering model under various conditioning factors.

6 Discussion
As shown in Figure 5, the reordering model of Equation (4) (indicated as e[-1,0]f[-1,0] in shorthand) suffers from a sparse data problem even if phrase clustering is used. The empirically justifiable global reordering model seems to be the following, conditioned on the classes of source and target phrases: (7)


Figure 5: BLEU score for the clustered and lexical reordering model with different conditioning factors incorporated into a standard phrase-based translation decoder. The problem of the global phrase reordering model is the cost of parameter estimation. In particular, the N-best phrase alignment described in Section 4.1 is computationally expensive. We must devise a more efficient phrase alignment algorithm that can globally optimize both phrase segmentation (phrase extraction) and phrase alignment.
Computational Linguistics (HLT-NAACL-03), pages 127н133. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19н51. Franz Josef Och and Herman Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics, 30(4):417н449. Franz Josef Och, Christoph Tillman, and Hermann Ney. 1999. Improved alignment models for statistical machine translation. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/WVLC-99), pages 20н28. Kazuteru Ohashi, Kazuhide Yamamoto, Kuniko Saito, and Masaaki Nagata. 2005. NUT-NTT statistical machine translation system for IWSLT 2005. In Proceedings of International Workshop on Spoken Language Translation, pages 128н133. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Lnguistics (ACL-02), pages 311н318. Christoph Tillmann and Tong Zhang. 2005. A localized prediction model for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), pages 557н564. Nicola Ueffing, Franz Josef Och, and Hermann Ney. 2002. Generation of word graphs in statistical machine translation. In Proceedings of the Conference

7 Conclusion
In this paper, we presented a novel global phrase reordering model, that is estimated from the Nbest phrase alignment of training bilingual sentences. Through experiments, we were able to show that our reordering model offers improved translation accuracy over the baseline method.

References
Matthias Eck and Chiori Hori. 2005. Overview of the IWSLT 2005 evaluation campaign. In Proceedings of International Workshop on Spoken Language Translation (IWSLT 2005), pages 11н32. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of

719

G HF A C 39 ED C 3е ED CC A 39B в@

6 4 9' & 2 0(

674 2 ' 0( 6 42 '

0 ( '&

0(

6 4 3' & 2 0(

42 0 83'

64 752 ' & 0(

0 ' (

0 ( '&

42 0 53'

0 ( '&

0 )' (

0 1' (

0 )' & (

%

# $ !"

 
б  в  зб иг в  б вг в  жб вг в  йб ег в  дб ег в   е  б

гб в в  б в в 


Figure 6: BLEU score of local and global reordering model
on Empirical Methods in Natural Language Processing (EMNLP-02), pages 156н163. Stephan Vogel, Ying Zhang, Fei Huang, Alicia Tribble, Ashish Venugopal, Bing Zhao, and Alex Waibel. 2003. The CMU statistical machine translation system. In Proceedings of MT Summit IX. Richard Zens, Hermann Ney, Taro Watanabe, and Eiichiro Sumita. 2004. Reordering constraints for phrase-based statistical machine translation. In Proceedings of 20th International Conference on Computational Linguistics (COLING-04), pages 205н 211.

720

B г CA 8 @@ 8 г 669 е7 B  CA 8 @@ 8  669 е7

4 2 1' & 0 )(

4 36' 20

)(

) ( '&

4 2 1' 0 )(

4 20 531' & )(

)( '

) ( '&

) ' (

)( '

) ' & (

%

# $ !"

 в

зижге  б

жге  б жге  б б ежге  гб ежге  йб ежге  дгбв  вжге  б