A Discriminative Global Training Algorithm for Statistical MT

Christoph Tillmann IBM T.J. Watson Research Center Yorktown Heights, N.Y. 10598
ctill@us.ibm.com

Tong Zhang Yahoo! Research New York City, N.Y. 10011
tzhang@yahoo-in c.com

Abstract
S % '!%!! & $#"

where is a block, is its predecessor block, and eft ight eutral is a threevalued orientation component linked to the block : a block is generated to the left or the right of , where the orientation its predecessor block of the predecessor block is ignored. Here, is the number of blocks in the translation. We are interested in learning the weight vector from the training data. is a high-dimensional binary feature representation of the block orientation pair . The block orientation se721
 p  x §  W x   w ig w g   v w ig   ¡ § w    g v g  g   § g¡  e x  g   §u § `  g ¡ y g  w ig ¡ g 

Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 721­728, Sydney, July 2006. c 2006 Association for Computational Linguistics

 x 

w ig

  v

g¡


g

 
us q p § tr!


©

h ig

 
U

 ©

¡

 ©  

§ ¥¤

 
©

¡

  f  ©

§

This paper presents a view of phrase-based SMT as a sequential process that generates block orientation sequences. A block is a pair of phrases which are translations of each other. For example, Figure 1 shows an Arabic-English translation example that uses four blocks. During decoding, we view translation as a block segmentation process, where the input sentence is segmented from left to right and the target sentence is generated from bottom to top, one block at a time. A monotone block sequence is generated except for the possibility to handle some local phrase re-ordering. In this local re-ordering model (Tillmann and Zhang, 2005; Kumar and Byrne, 2005) a block with orientation is generated relative to its predecessor block . During decoding, we maximize the score of a block orientation sequence
    © ¡ ©    ¡ ¢ £  §¥ ¨¦¤

): (1)

e aU

d ¦¡


W XU

b c¡

 aU `

Y ¦¡


WU XV

1 Introduction

Figure 1: An Arabic-English block translation example, where the Arabic words are romanized. The following orientation sequence is generated: .
¡

3

3

"

$

@

$E

@

3Q

@

5E

3

Q CP

3

@

@

4

$

4G

H

 
3

&

6

@B

3E

@

@

E

F

R 

3

6

@

D

"E

F

3

B C@

6

@

A

@

4

&$ !%# 3 £# 76

2# 20( 4 3 1)

) 93 £'£8 #6&

5

This paper presents a novel training algorithm for a linearly-scored block sequence translation model. The key component is a new procedure to directly optimize the global scoring function used by a SMT decoder. No translation, language, or distortion model probabilities are used as in earlier work on SMT. Therefore our method, which employs less domain specific knowledge, is both simpler and more extensible than previous approaches. Moreover, the training procedure treats the decoder as a black-box, and thus can be used to optimize any decoding scheme. The training algorithm is evaluated on a standard Arabic-English translation task.

T  '!'£6 ) # I# 7&


No additional development data set is necessary as the weight vector is trained on bilingual training data only.
p

The paper is structured as follows: Section 2 presents the baseline block sequence model and the feature representation. Section 3 presents the discriminative training algorithm that learns
A translation and distortion model is used in generating the block set used in the experiments, but these translation probabilities are not used during decoding.
1

2 High scoring block sequences may contain translation errors that are quantified by a lower BLEU score. 3 The training BLEU score is computed for each training sentence pair separately (treating each sentence pair as a single-sentence corpus with a single reference) and then averaged over all training sentences. Although block sequences are found with a high BLEU score on average there is no guarantee to find the maximum BLEU block sequence for a given sentence pair. The target word sequence corresponding to a block sequence does not have to match the reference translation, i.e. maximum BLEU scores are quite low for some training sentences.

722

  c

No computationally expensive -best lists are generated during training: for each input sentence a single block sequence is generated on each iteration over the training data.


p

 
g 

W

 
p


quence is generated under the restriction that the concatenated source phrases of the blocks yield the input sentence. In modeling a block sequence, we emphasize adjacent block neighbors that have right or left orientation, since in the current experiments only local block swapping is handled (neutral orientation is used for 'detached' blocks as described in (Tillmann and Zhang, 2005)). This paper focuses on the discriminative training of the weight vector used in Eq. 1. The decoding process is decomposed into local decision steps based on Eq. 1, but the model is trained in a global setting as shown below. The advantage of this approach is that it can easily handle tens of million features millions of features, e.g. up to for the experiments in this paper. Moreover, under this view, SMT becomes quite similar to sequential natural language annotation problems such as part-of-speech tagging and shallow parsing, and the novel training algorithm presented in this paper is actually most similar to work on training algorithms presented for these task, e.g. the on-line training algorithm presented in (McDonald et al., 2005) and the perceptron training algorithm presented in (Collins, 2002). The current approach does not use specialized probability features as in (Och, 2003) in any stage during decoder parameter training. Such probability features include language model, translation or distortion probabilities, which are commonly used in current SMT approaches 1 . We are able to achieve comparable performance to (Tillmann and Zhang, 2005). The novel algorithm differs computationally from earlier work in discriminative training algorithms for SMT (Och, 2003) as follows:

a good global ranking function used during decoding. Section 4 presents results on a standard Arabic-English translation task. Finally, some discussion and future work is presented in Section 5.

2 Block Sequence Model
This paper views phrase-based SMT as a block sequence generation process. Blocks are phrase pairs consisting of target and source phrases and local phrase re-ordering is handled by including so-called block orientation. Starting point for the block-based translation model is a block set, e.g. about million Arabic-English phrase pairs for the experiments in this paper. This block set is used to decode training sentence to obtain block orientation sequences that are used in the discriminative parameter training. Nothing but the block set and the parallel training data is used to carry out the training. We use the block set described in (Al-Onaizan et al., 2004), the use of a different block set may effect translation results. Rather than predicting local block neighbors as in (Tillmann and Zhang, 2005) , here the model parameters are trained in a global setting. Starting with a simple model, the training data is decoded multiple times: the weight vector is trained to discriminate block sequences with a high translation score against block sequences with a high BLEU score 2 . The high BLEU scoring block sequences are obtained as follows: the regular phrase-based decoder is modified in a way that it uses the BLEU score as optimization criterion (independent of any translation model). Here, searching for the highest BLEU scoring block sequence is restricted to local re-ordering as is the model-based decoding (as shown in Fig. 1). The BLEU score is computed with respect to the single reference translation provided by the parallel training data. A block sequence with an avis obtained for erage BLEU score of about each training sentence 3 . The 'true' maximum BLEU block sequence as well as the high scoring


3 Approximate Relevant Set Method
Throughout the section, we let . Each block sequence corresponds to a candidate translation. In the training data where target translations are given, a BLEU score can be calculated for each against the target translations. In this set up, our goal is to find a weight vector such that the higher is, the higher the corresponding BLEU score should be. If we can find such a weight vector, then block decoding by searching for the highwill lead to good translation with high est BLEU score. Formally, we denote a source sentence by , be the set of possible candidate oriand let ented block sequences that the decoder can generate from . For example, in a contains block monotone decoder, the set sequences that cover the source sentence in the same order. For a decoder with local re-ordering, the candidate set also includes additional block sequences with re-ordered block configurations that the decoder can efficiently search. Therefore depending on the specific implementation of the decoder, the set can be different. In general, is a subset of all possible oriented block sequences that are consistent with input sentence . Given a scoring function and an input sentence , we can assume that the decoder implements the following decoding rule: (2)
   q k§m pVl q§ ¦r  k§ ¥ p¨c¤  x  ©  ¡ k§m ponl ¡    f  © q§ ¦r  © §  § k p§ ¥ ¤ q ¡   ©  ©     q§ ¦tr q§ ¦sr s§ ¥¤ k ¡     0o~ §  ©   }v z c|{ycaU  xwv  © q U § ¡   f  © k U § k U p k    f  © q¦§!k u  q§ ¦sr  k p§ ¥ ¤ q q  ©  ©   U

block consists of target phrase 'violate' and source phrase 'tnthk' otherwise 'Lebanese' is a word in the target phrase of block and 'AllbnAny' is a word in the source phrase otherwise
g    xx9 u   g U  w ig   v  

723


k§m ponl

On our test set, (Tillmann and Zhang, 2005) reports a BLEU score of and (Ittycheriah and Roukos, 2005) reports a BLEU score of .
i ch fg gf e Cccd

4

Let be a set of training sentences. Each sentence is associated with a set of possible translation block sequences that are searchable by the decoder. Each translation block sequence induces a translation, which is then assigned a BLEU score (obtained by comparing against the target translations). The
q§ ¦tr g  q§ g ¦r gq  !q  vv y k q

W


The feature is a 'unigram' phrase-based feature capturing the identity of a block. Additional phrase-based features include block orientation, target and source phrase bigram features. Word-based features are used as well, e.g. feature captures word-to-word translation de x9 u


j


w ig

  v

g¡


g

 
§ 0


w ig

  v

 
U 

g¡


g

w ig

 
§

  v

g¡

g¡


g

g

 
§

§

 x9

 xx9

U

U

u

u


block sequences are represented by high dimensional feature vectors using the binary features defined below and the translation process is handled as a multi-class classification problem in which each block sequence represents a possible class. The effect of this training procedure can be seen in Figure 2: each decoding step on the training data adds a high-scoring block sequence to the discriminative training and decoding performance on the training data is improved after each iteration (along with the test data decoding performance). A theoretical justification for the novel training procedure is given in Section 3. We now define the feature components for the in Eq. 1. block bigram feature vector Although the training algorithm can handle realvalued features as used in (Och, 2003; Tillmann and Zhang, 2005) the current paper intentionally excludes them. The current feature functions are similar to those used in common phrase-based translation systems: for them it has been shown that good translation performance can be achieved 4 . A systematic analysis of the novel training algorithm will allow us to include much more sophisticated features in future experiments, i.e. POSbased features, syntactic or hierarchical features (Chiang, 2005). The dimensionality of the feadepends on the number ture vector of binary features. For illustration purposes, the binary features are chosen such that they yield on the example block sequence in Fig. 1. There are phrase-based and word-based features:

pendencies similar to the use of Model probabilities in (Koehn et al., 2003). Additionally, we use distortion features involving relative source word position and -gram features for adjacent target words. These features correspond to the use of a language model, but the weights for theses features are trained on the parallel training data only. For the most complex model, the number of features is about million (ignoring all features that occur only once).


Lemma 1 Let be a non-negative continuous piece-wise differentiable function of , be a local solution of Eq. 3. Let and let , and define
p§ p §  ¸ p§  k kk · k g g x  u ¶U §  u ¯ U ¢   u ¯ q q§ k q g ¦§  r y k g ¦tr y g ¦§ ´³ µ¨ ¬ vC ¢ U

s.t.

(5)
p

(3)
¢ k  k  p§ 0 w %9~   }v cz  

where is a non-negative real-valued loss function (whose specific choice is not critical for the purposes of this paper),and is a regularization parameter. In our experiments, results are obtained using the following convex loss
 £ Y x ¢  ¤   ¤ x§      x§  ¢ ¢¡    §   U  ¢ v ¢ ¤   v ¤ x§ 

(4)
724

If is a convex function of (as in our choice), then we know that the global optimal solution remains the same if the whole decoding space is replaced by the relevant set . Each subspace will be significantly smaller than . This is because it only includes those alternatives with score close to one of the selected truth. These are the most important alternatives that are easily confused with the truth. Essentially the lemma says that if the decoder works well on these difficult alternatives (relevant points), then it works well on the whole space. The idea is closely related to active learning in standard classification problems, where we
r  i¢ k »¥ p§ ¹¤ ¬ vº r  q g ¦§ vº ¬ ¢ k  r q§ g ¦sr


Y

p q ¦§  « x g  ¬ vC

r

 x

q g ¦§

r


p§

 ¹

h ig


W

p

Then
 z 

is a local solution of

¥

p

 £¢

k


¢   «ª 


k


 x

p§

¤

q§ ¦sr

¤


 q§ ¦tr

¢  r  


 x§

¤

9²o w 9o±~ ²   

}v c|z q ¦§  x 

q ¦§

q§ ¦r

r

¢

U

r

¢

k

¤


£ k

p§

¦


¤ x§ ¤

p§


¢  £§  ¦


}cvz°U p 


q§ ¦sr

u

 ®

¢

u

¤

¢

k

q ¦§

¤ ©¤ ¨


¬ vC

p§ g


¥

r

¤

r

¯

goal of the training is to find a weight vector such that for each training sentence , the corresponding decoder outputs which has the maximum BLEU score among all based on Eq. 2. In other words, if maximizes the scoring function , then also maximizes the BLEU metric. Based on the description, a simple idea is to learn the BLEU score for each candidate block sequence . That is, we would like to estimate such that . This can be achieved through least squares regression. It is easy to see that if we can find a weight vector that approximates , then the decoding-rule in Eq. 2 automatically maximizes the BLEU score. However, it is usually difficult to estimate reliably based only on a linear combination of the feature vector as in Eq. 1. We note that a good decoder does not necessarily employ a scoring function that approximates the BLEU score. Instead, we only need to make sure that the top-ranked block sequence obtained by the decoder scoring function has a high BLEU score. To formulate this idea, we attempt to find a decoding parameter such that for each sentence in the training data, sequences in with the highest BLEU scores should get scores higher than those with low BLEU scores. Denote by a set of block sequences in with the highest BLEU scores. Our decoded result should lie in this set. We call them the "truth". The set of the remaining sequences is , which we shall refer to as the "alternatives". We look for a weight vector that minimize the following training criterion:
q§ ¦r g  p p k§m ponl  x p y k q§ g ¦tr  gq  x q g ¦§  u k q k§m pnl  y u  k  k§m pnl u   k  k p§ ¥ ¤   q§ ¦r k p§ ¥ ¤  k§m ponl  k p§ ¥ ¤  q§  ¦sr   k q ¦§  'cXU z xwv r    p q§ ¦tr q§ ¦sr u

where are BLEU scores, are translation scores, and . We refer to this formulation as 'costMargin' (cost-sensitive margin) method: for each training sentence the 'costMargin' between the 'true' block sequence set and the 'alternative' block sequence set is maximized. Note that due to the truth and alternative set up, we al. This loss function gives an upways have per bound of the error we will suffer if the order of and is wrongly predicted (that is, if we predict instead of ). It also has the property that if for the BLEU scores holds, then the ). loss value is small (proportional to A major contribution of this work is a procedure to solve Eq. 3 approximately. The main difficulty is that the search space covered by the decoder can be extremely large. It cannot be enumerated for practical purposes. Our idea is to replace this large space by a small subspace which we call relevant set. The possibility of this reduction is based on the following theoretical result.

¢


¢  vc 

kp§omnl kp§ ¥ ¤ kp§9mnl k p§ ¥ x§ ¤ kk xx ¢  x ¢ x  x  XU  ¢    0%~

q§ g ¦tr

r

p§

 
h ig

Y

W

p

U


¥

r


 r


p§

p§


p


Table 1: Generic Approximate Relevant Set Method
q

where . Using this feature representation and the loss function in Eq. 4, we obtain the following costMargin SGD update rule for each training data point and :
Ã  £ Ë ¿ ÎÏ ¿ §m ¾ k nl ¿ k§m ponl ¿m Ñnl  x  Ò Í U U p ÌÅp Ä ¿sq ÐÏ trp  § ¿ Ï ¿m xÐCnl  Í ÎÇ  w g g¡ g §u h ig k p§ Ë

4 Experimental Results
We applied the novel discriminative training approach to a standard Arabic-to-English translation task. The training data comes from UN news sources. Some punctuation tokenization and some number classing are carried out on the English and the Arabic training data. We show translation results in terms of the automatic BLEU evaluation metric (Papineni et al., 2002) on the MT03 Arabic-English DARPA evaluation test set consentences with Arabic words sisting of with reference translations. In order to speed up the parameter training the original training data is filtered according to the test set: all the Arabic substrings that occur in the test set are computed and the parallel training data is filtered to include only those training sentence pairs that contain at least one out of these phrases: the resulting pre-filtered training data contains about thousand sentence pairs ( million Arabic words and million English words). The block set is generated using a phrase-pair selection algorithm similar to (Koehn et al., 2003; Al-Onaizan et al., 2004), which includes some heuristic filtering to
mal statement here. A detailed theoretical investigation of the method will be given in a journal paper.
 Õ ×ÖÕ Ô ¡ Õ   Ô Ô Ô Ö Ô 

(6)

then the procedure converges to the solution of Eq. 3. Moreover, the rate of convergence depends only on the property of the loss function, and not on the size of . This property is critical as it shows that as long as Eq. 6 can be computed efficiently, then the Approximate Relevant Set algorithm is efficient. Moreover, it gives a bound on the size of an approximate relevant set with a certain accuracy.5
5

Due to the space limitation, we will not include a for-

725


¿ ¾ k §

Ë

 Ó

¿k p§


 x

k p§

  v

Ë s


q

 
p


U

©

k§ ¥ p¨c¤

U


selectively pick the most important samples (often based on estimation uncertainty) for labeling in order to maximize classification performance (Lewis and Catlett, 1994). In the active learning setting, as long as we do well on the actively selected samples, we do well on the whole sample space. In our case, as long as we do well on the relevant set, the decoder will perform well. Since the relevant set depends on the decoder parameter , and the decoder parameter is optimized on the relevant set, it is necessary to estimate them jointly using an iterative algorithm. The basic idea is to start with a decoding parameter , and estimate the corresponding relevant set; we then update based on the relevant set, and iterate this process. The procedure is outlined in Table 1. We intentionally leave the implementation details of the (*) step and (**) step open. Moreover, in this general algorithm, we do not have to assume that has the form of Eq. 1. A natural question concerning the procedure is its convergence behavior. It can be shown that under mild assumptions, if we pick in (*) an alternafor each tive ( ) such that
 q§  ¦sr  x ¢ k  y ¿ k ¿ k  p§      0¡0 w 0±~ }v c|z  q§  ¦r    k p§ ¥ ¤ p U  vv     U q§ ¦tr ¿ y ¾ k ¿ ¾ k p  ¿ k  p§ p  Ã


q§ ¦r

p

update

by solving Eq. 5 approximately (**)

¿ ¾k 

q ¦§

r

q ¦§

r

update


q§ ¦sr

Á Â

y

¿ ¾k 

select relevant points
À  ¬ vº


¬ vC

q

for each data point

(*)

The parameter is a fixed constant often referred to as learning rate. Again, convergence results can be proved for this procedure. Due to the space limitation, we skip the formal statement as well as the corresponding analysis. Up to this point, we have not assumed any specific form of the decoder scoring function in our algorithm. Now consider Eq. 1 used in our model. We may express it as:


¦ ÊÇ

ss vvs


½

for each decoding iteration :
¼


¿ ¾ k

¿ k

p§

¥

p ÆÅp Ä

q ¦§

`

r


U

q ¦§

r

initialize truth

and alternative


p

for each data point

The approximate solution of Eq. 5 in (**) can be implemented using stochastic gradient descent (SGD), where we may simply update as:
  ÉÇ ÂÈ 

¬ vC


(7)


scheme as a black box. One way to approximate Eq. 6 is to generate multiple decoding outputs and pick the most relevant points based on Eq. 6. Since the -best list generation is computationally costly, only a single block sequence is generated for each training sentence pair, reducing the memory requirements for the training algorithm as well. Although we are not able to rigorously prove fast convergence rate for this approximation, it works well in practice, as Figure 2 shows. Theoretically this is because points achieving large values in Eq. 6 tend to have higher chances to become the top-ranked decoder output as well. The SGDbased on-line training algorithm described in Section 3, is carried out after each decoding step to generate the weight vector for the subsequent decoding step. Since this training step is carried out on a single machine, it dominates the overall computation time. Since each iteration adds a sin, comgle relevant alternative to the set putation time increases with the number of training iterations: the initial model is trained in a few minutes, while training the model after the -th iteration takes up to hours for the most complex models. Table 3 presents experimental results in terms of uncased BLEU 6 . Two re-ordering restrictions are tested, i.e. monotone decoding ('MON'), and local block re-ordering where neighbor blocks can be swapped ('SWAP'). The 'SWAP' re-ordering uses the same features as the monotone models plus additional orientation-based and distortion6 Translation performance in terms of cased BLEU is typically reduced by about %.

726

 
q § ¾k g ¦!


q g ¦§ º ¬

r

p

r


á

r

update

 x

q§ g ¦à¾ k

select top-scoring sequence
q g ¦§ vC ¬ À ¡ q g ¦§ vC ¬

 ss vvs


Á Ó

gq

for each input sentence

and

W


ß U Ð

h ig

r

q r g ¦§ Û Â§

data


q g ¦§ vº ¬

 Ó


p

train

using SGD on training

ss vvs


½

for each decoding iteration :
¼

`


U

r

native

¬ U vC

The training algorithm in Table 2 is adapted from Table 1. The training is carried out by running times over the parallel training data, each time decoding all the training sentences and generating a single block translation sequence for each training sentence. The top five block sequences with the highest BLEU score are computed up-front for all training sentence pairs and are stored separately as described in Section 2. The score-based decoding of the training sentence pairs is carried out in parallel on -Bit Opteron machines. Here, the monotone decoding is much faster than the decoding with block swapping: the monotone decoding takes less hours and the decoding with swapping than takes about an hour. Since the training starts with only the parallel training data and a block set, some initial block sequences have to be generated in order to initialize the global model training: for each input sentence a simple bag of blocks translation is generated. For each input interval that is matched by some block , a single block is added to the bag-of-blocks translation . The order in which the blocks are generated is ignored. For this block set only block and word identity features are generated, i.e. features of type and in Section 2. This step does not require the use of a decoder. The initial block sequence training data contains only a single alternative. The training procedure proceeds by iteratively decoding the training data. After each decoding step, the resulting translation block sequences are stored on disc in binary format. A block sequence generated at decoding step is used in all subsequent training steps , where . The block sequence training data after the -th decoding step is given as , where the size of the relevant alternative set is . Although in order to achieve fast convergence with a theoretical guarantee, we should use Eq. 6 to update the relevant set, in reality, this idea is difficult to implement because it requires a more costly decoding step. Therefore in Table 2, we adopt an approximation, where the relevant set is updated by adding the decoder output at each stage. In this way, we are able to treat the decoding
  ¡Õ U Ù`  xx9 u   h ig q ¦§   q ¦§ ¬ º  ¡ g  k ½ ¼   ¡Õ ¦ Y ½   r  ½ Ü g ¦§ q ¬ º  q§Ûr ¦0Ó§  Â g U ÚW Y c½ q§Û ¦r g   rÜ Ô  c¸Õ   x9  g¥   u ¼

q g ¦§ k  x  q g ¦§ Û r

initialize truth


4.1

Practical Implementation Details

and alter-

ss vvs


gq

for each input sentence

W


ß U Î

Þ

iterations,

Ý

Ø

increase phrase translation accuracy. Blocks that occur only once in the training data are included as well.

Table 2: Relevant set method:

= number of decoding = number of training sentences.


0.6

Table 3: Translation results in terms of uncased BLEU on the training data ( sentences) and the MT03 test data (670 sentences). Re-ordering 'MON'
 Õ 

'SWAP.TRAINING' 'SWAP.TEST'

ing word-based features shown in line and line training takes less than days. Finally, the training for the most complex model in line takes about days. Figure 2 shows the BLEU performance for the model corresponding to line in Table 3 as a function of the number of training iterations. By adding top scoring alternatives in the training algorithm in Table 2, the BLEU performance on the for the initraining data improves from about tial model to about for the best model after iterations. After each training iteration the test data is decoded as well. Here, the BLEU performance improves from for the initial model to about for the final model (we do not include the test data block sequences in the training). Table 3 shows a typical learning curve for the experiments in Table 3: the training BLEU score is much higher than the test set BLEU score despite the fact that the test set uses reference translations.

5 Discussion and Future Work
The work in this paper substantially differs from previous work in SMT based on the noisy channel approach presented in (Brown et al., 1993). While error-driven training techniques are commonly used to improve the performance of phrasebased translation systems (Chiang, 2005; Och, 2003), this paper presents a novel block sequence translation approach to SMT that is similar to sequential natural language annotation problems
727

With a margin of , the differences between the results in line , line , and line are not statistically significant, but the other result differences are.
g hä f i åi ãâ e h

7

Ö

×


Õ Õ 

×

×

×  

× à 

Õ


Ô  

æ

 
based features. Different feature sets include word-based features, phrase-based features, and the combination of both. For the results with word-based features, the decoder still generates phrase-to-phrase translations, but all the scoring is done on the word level. Line shows a BLEU score of for the best performing system which uses all word-based and phrase-based features 7 . Line and line of Table 3 show the training data averaged BLEU score obtained by searching for the highest BLEU scoring block sequence for each training sentence pair as described in Section 2. Allowing local block swapping in this search procedure yields a much improved BLEU . The experimental results show score of that word-based models significantly outperform phrase-based models, the combination of wordbased and phrase-based features performs better than those features types taken separately. Additionally, swap-based re-ordering slightly improves performance over monotone decoding. For all experiments, the training BLEU score remains significantly lower than the maximum obtainable BLEU score shown in line and line . In this respect, there is significant room for improvements in terms of feature functions and alternative set generation. The word-based models perform surprisingly well, i.e. the model in line uses only three feature types: model features like in Section 2, distortion features, and target language m-gram features up to . Training speed varies depending on the feature types used: for the simplest model shown in line of Table 3, the training takes about hours, for the models us x9 u Ö  Õ ×  U   j Õ      Ô  

  Õ 

   
 Ô  

 à 

  à 

Ö  à 

   c

'SWAP'
 Ô Ö

   àc

Ô  Õ 

   
× Ö  

Õ   àc

ÖÕ  à 

ÖÖ  à 

Features bleu phrase word both bleu phrase word both

  ¡Õ

0.5

train

test -

0.4

0.3

0.2

×


-

0.1

0 0 5 10 15 20 25 30

Figure 2: BLEU performance on the training set (upper graph; averaged BLEU with single reference) and the test set (lower graph; BLEU with four references) as a function of the training iteration for the model corresponding to line in Table 3.


such as part-of-speech tagging or shallow parsing, both in modeling and parameter training. Unlike earlier approaches to SMT training, which either rely heavily on domain knowledge, or can only handle a small number of features, this approach treats the decoding process as a black box, and can optimize tens millions of parameters automatically, which makes it applicable to other problems as well. The choice of our formulation is convex, which ensures that we are able to find the global optimum even for large scale problems. The loss function in Eq. 4 may not be optimal, and using different choices may lead to future improvements. Another important direction for performance improvement is to design methods that better approximate Eq. 6. Although at this stage the system performance is not yet better than previous approaches, good translation results are achieved on a standard translation task. While being similar to (Tillmann and Zhang, 2005), the current procedure is more automated with comparable performance. The latter approach requires a decomposition of the decoding scheme into local decision steps with the inherent difficulty acknowledged in (Tillmann and Zhang, 2005). Since such limitation is not present in the current model, improved results may be obtained in the future. A perceptronlike algorithm that handles global features in the context of re-ranking is also presented in (Shen et al., 2004). The computational requirements for the training algorithm in Table 2 can be significantly reduced. While the global training approach presented in iterations or so, the this paper is simple, after alternatives that are being added to the relevant set differ very little from each other, slowing down the training considerably such that the set of possimight not be fully exble block translations plored. As mentioned in Section 2, the current approach is still able to handle real-valued features, e.g. the language model probability. This is important since the language model can be trained on a much larger monolingual corpus.

References
Yaser Al-Onaizan, Niyu Ge, Young-Suk Lee, Kishore Papineni, Fei Xia, and Christoph Tillmann. 2004. IBM Site Report. In NIST 2004 MT Workshop, Alexandria, VA, June. IBM. Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. CL, 19(2):263­311. David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proc. of ACL 2005), pages 263­270, Ann Arbor, Michigan, June. Michael Collins. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proc. EMNLP'02, Philadelphia,PA. A. Ittycheriah and S. Roukos. 2005. A Maximum Entropy Word Aligner for Arabic-English MT. In Proc. of HLT-EMNLP 06, pages 89­96, Vancouver, British Columbia, Canada, October. Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In HLT-NAACL 2003: Main Proceedings, pages 127­133, Edmonton, Alberta, Canada, May 27 - June 1. Shankar Kumar and William Byrne. 2005. Local phrase reordering models for statistical machine translation. In Proc. of HLT-EMNLP 05, pages 161­ 168, Vancouver, British Columbia, Canada, October. D. Lewis and J. Catlett. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 148­156. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of ACL'05, pages 91­98, Ann Arbor, Michigan, June. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of ACL'03, pages 160­167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a Method for Automatic Evaluation of machine translation. In In Proc. of ACL'02, pages 311­318, Philadelphia, PA, July. Libin Shen, Anoop Sarkar, and Franz-Josef Och. 2004. Discriminative Reranking of Machine Translation. In Proceedings of the Joint HLT and NAACL Conference (HLT 04), pages 177­184, Boston, MA, May. Christoph Tillmann and Tong Zhang. 2005. A localized prediction model for statistical machine translation. In Proceedings of ACL'05, pages 557­564, Ann Arbor, Michigan, June.

6 Acknowledgment
This work was partially supported by the GALE project under the DARPA contract No. HR001106-2-0001. The authors would like to thank the anonymous reviewers for their detailed criticism on this paper.

 
q§ ¦sr

728