Stochastic Language Generation Using WIDL-expressions and its Application in Machine Translation and Summarization
Radu Soricut Information Sciences Institute University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292 radu@isi.edu Daniel Marcu Information Sciences Institute University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292 marcu@isi.edu

Abstract
We propose WIDL-expressions as a flexible formalism that facilitates the integration of a generic sentence realization system within end-to-end language processing applications. WIDL-expressions represent compactly probability distributions over finite sets of candidate realizations, and have optimal algorithms for realization via interpolation with language model probability distributions. We show the effectiveness of a WIDL-based NLG system in two sentence realization tasks: automatic translation and headline generation.

1 Introduction
The Natural Language Generation (NLG) community has produced over the years a considerable number of generic sentence realization systems: Penman (Matthiessen and Bateman, 1991), FUF (Elhadad, 1991), Nitrogen (Knight and Hatzivassiloglou, 1995), Fergus (Bangalore and Rambow, 2000), HALogen (Langkilde-Geary, 2002), Amalgam (Corston-Oliver et al., 2002), etc. However, when it comes to end-to-end, text-totext applications ­ Machine Translation, Summarization, Question Answering ­ these generic systems either cannot be employed, or, in instances where they can be, the results are significantly below that of state-of-the-art, application-specific systems (Hajic et al., 2002; Habash, 2003). We believe two reasons explain this state of affairs. First, these generic NLG systems use input representation languages with complex syntax and semantics. These languages involve deep, semanticbased subject-verb or verb-object relations (such as ACTOR, AGENT, PATIENT, etc., for Penman

and FUF), syntactic relations (such as subject, object, premod, etc., for HALogen), or lexical dependencies (Fergus, Amalgam). Such inputs cannot be accurately produced by state-of-the-art analysis components from arbitrary textual input in the context of text-to-text applications. Second, most of the recent systems (starting with Nitrogen) have adopted a hybrid approach to generation, which has increased their robustness. These hybrid systems use, in a first phase, symbolic knowledge to (over)generate a large set of candidate realizations, and, in a second phase, statistical knowledge about the target language (such as stochastic language models) to rank the candidate realizations and find the best scoring one. The disadvantage of the hybrid approach ­ from the perspective of integrating these systems within end-to-end applications ­ is that the two generation phases cannot be tightly coupled. More precisely, input-driven preferences and target language­driven preferences cannot be integrated in a true probabilistic model that can be trained and tuned for maximum performance. In this paper, we propose WIDL-expressions (WIDL stands for Weighted Interleave, Disjunction, and Lock, after the names of the main operators) as a representation formalism that facilitates the integration of a generic sentence realization system within end-to-end language applications. The WIDL formalism, an extension of the IDL-expressions formalism of Nederhof and Satta (2004), has several crucial properties that differentiate it from previously-proposed NLG representation formalisms. First, it has a simple syntax (expressions are built using four operators) and a simple, formal semantics (probability distributions over finite sets of strings). Second, it is a compact representation that grows linearly

1105
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 1105­1112, Sydney, July 2006. c 2006 Association for Computational Linguistics


, etc.

2 The WIDL Representation Language
2.1 WIDL-expressions In this section, we introduce WIDL-expressions, a formal language used to compactly represent probability distributions over finite sets of strings. Given a finite alphabet of symbols , atomic . WIDL-expressions are of the form , with For a WIDL-expression , its semantics is a probability distribution , where and . Complex WIDL-expressions are created from other WIDL-expressions, by employing the following four operators, as well as operator distribution functions from an alphabet . Weighted Disjunction. If are WIDL-expressions, then , with , specified such that , is a WIDLexpression. Its semantics is a probability distribution , where , and the probability values are induced by and , . For example, if , , its semantics is a probability distribution over , defined by and
.

are WIDL, with , , , is a specified such that WIDL-expression. Its semantics is a probability distribution , where consists of all the possible interleavings of strings from , , and the proba. The bility values are induced by and distribution function is defined either explicitly, over (the set of all permutations of elements), or implicitly, as . Because the set of argument permutations is a subset of all possible interleavings, also needs to specify the probability mass for the strings that are not argument permutations, . For , example, if , its semantics is a probability distribution , , defined with domain by ,
, .

Weighted Interleave. expressions, then

If

1106

êW 806 ¦ è Aéç g

Precedence. then

If

are WIDL-expressions, is a WIDL-expression. Its

Lock. If is a WIDL-expression, then is a WIDL-expression. The semantic mapping is the same as , except that contains strings in which no additional symbol can be interleaved. For example, if , , its semantics is a proba, with domain bility distribution , defined by , . In Figure 1, we show a more complex WIDLexpression. The probability distribution associassigns probability 0.2 ated with the operator to the argument order ; from a probability mass of 0.7, it assigns uniformly, for each of the remaining argument permutations, a . The permutation probability value of

# !  ¤"¥ |© ¦ ¥ UQ

{  x{ v ts Ï ki × Ï ki Õ Ai 9¤d ÛØz×hÛÚ Å  ÙÔ× d hww}w|zyx~u shr ÎgE¤d zØSÍ ÌÈÕÖÌ ÕÎÔÓË Ò { } x{ v t i ki n {  x{ v t d e¢w|zyx ~u ss rÒ HÑÐ Al 9d { Vg x d d e¢}wzyx ~u hs r HÑÐ F e 7 S  E¦  1 0(%   C )' !  "¥ H|©  § 7  qÏ k Ai Ei ÈÌÍ ÈÉÊË h Ç ÄwVSSuÎgEi ÈÌÍ ÉÊË h ÈÇÆ§¥wS¢¢wwB9Al Ei Å ¦¡ ©¨ ¦mÏ k Å ¦ £¡¤£¡  mi k ! c h n gf $ÄSd d  E §  H d w ¦ ¥ !¾± ÂÀ» g AÃ9ÁS¿¾EQ 7 g EQ !¾ ²±¼²±»º g w2¹)h½hswhH'EQ Y ²±° ® p¹)hI¸´   gQ !f¥ AI© EQ R  g   ·¶µ8 1 0(%     )' 1 0(% )' )'% # !  @87 A906 5 3 1 0B´"¥ © ¦ ! g IH¢eQ y dc Y ² ± ° ® ¬ g m ª h x w¦v ¡ u ¨ ³ w f  q ¦ £ ¡ ¤ £ ¡     f    p)hI8 $¯ w9i «fww"sq©r Seu¡¦ A§¥w9w¢wSd  qt   !Y 7W  U  c ¦ 0`¥ XSW f¥ B¥ edA¥ YS 7 W W W U `S¥W 7 XSS7S7 V¥ n ki  E¤d { n x  { g x   x{ v l ki { x{ v t d {sw}wyzx ~u ths r  9$d { g x  { g x  d w}w|zyx ~u ss r  F   eEC  p0(%   ¦ 1 ) ' !     7 "¥ © |7  7  ¥ ¥  c  U¦¥ q ki h n m ki h gf d 9 E2o9 9pP !  7 HXb ¦¥  c b ¦ U q ki h n m ki h gf n 9(El Ef&d   ! 7 H Su"¥ !   Ef¥ G© A¥ HG© !U    {} x{ v t i k iwn E¤d àØ×Í ÌÕÖÌ ÕÔÈÓË d sww|zyx ~u ss r Al Ei ki { x{ v Ò F 7  h   o SEC d { Vg x d d e¢}wzyx ~u ths r ¢HÑÐ n ¦ 1 )0(%  ' !"¥ ©  qi i wn kE¶hß¦§¥wASw¢w¢|9Al Ei £¡¤£¡  mi k c h¸ÞfDd  !E 7 !   H Ý edw ¦ ¥ ng d 1 0(% )' ! Ü ¥ ©  !"¥ HG©  !(¥  Ý Ü Ü B¥ â o8 ~Aác  å 8 äã æ¦ ½â

1 )' 0(%

@87 3 1 )' A906 5 20(%

in the number of words available for generation (see Section 2). (In contrast, representations such as word lattices (Knight and Hatzivassiloglou, 1995) or non-recursive CFGs (Langkilde-Geary, 2002) require exponential space in the number of words available for generation (Nederhof and Satta, 2004).) Third, it has good computational properties, such as optimal algorithms for intersection with -gram language models (Section 3). Fourth, it is flexible with respect to the amount of linguistic processing required to produce WIDLexpressions directly from text (Sections 4 and 5). Fifth, it allows for a tight integration of inputspecific preferences and target-language preferences via interpolation of probability distributions using log-linear models. We show the effectiveness of our proposal by directly employing a generic WIDL-based generation system in two end-to-end tasks: machine translation and automatic headline generation.

semantics is a probability distribution , where is the set of all strings that obey the precedence imposed over the arguments, and the probability values are inand . For example, if duced by , , and , , then represents a probability distribution over the set , defined by ,

¦  ¥ U ¥ ¥  U f¥ 7 "¥ { x{ v t n 9¤d { n x d d ew|zyx ~u ss r ki { x{ v t s l 9i d { g x d  d A}w|zyxwu hhr  k F )' !  "¥ A©  C 7 Ej¦ 1 0(% q ki h n m ki h gf d An 9jpoEl 9j2ed  !  Hhd(b ¥ c ¦   8   !f¥7 AI© R    g 9Q  )' U 1 0B% `R  ¦ 1 0(% )' ) ' % # ! Y   @87 A906 5 3 1 0("¥ H© ¦ ! g y x vt qi 8 GQ ud c `wusCr Ep¡ g @87 A90c 6 5 3 F   XSSSh8 D# Q 7WWW7 !0Yf¥ 7 SW¥ U  U B ¦ EWSY`S7 B¥ edV¥ b a¥ W 7XSWSS7 WW T R SQ

¦PIH¢"¥ HG© ! !    F¢C ED¦ 1 0B% )' 3420(&$"¥ © 1 )'% # !  ¢¦ ¨§¥ £ ¢ ¡ ¤¢ ¡

 
@87 A906 5

8


rebels fighting turkish government in iraq in iraq attacked rebels turkish goverment in turkish goverment iraq rebels fighting

0.130 0.049 0.005

The following result characterizes an important representation property for WIDL-expressions. Theorem 1 A WIDL-expression over and using atomic expressions has space complexity O( ), if the operator distribution functions of have space complexity at most O( ). For proofs and more details regarding WIDLexpressions, we refer the interested reader to (Soricut, 2006). Theorem 1 ensures that highcomplexity hypothesis spaces can be represented efficiently by WIDL-expressions (Section 5). 2.2 WIDL-graphs and Probabilistic Finite-State Acceptors

(Figure 2(a)). A transition labeled between two states and in exists if there exists a vertex in the description of and vertices in the description of , such that , (see transition ), or if there exists vertices in the description and vertex in the description of , such of
t B  C  aC X X 8 a £C B x C W C wv¥C ua RC Ss C

that

,

. The -transitions

1107

a £C

1

  p´ 8   7 Wg W W F  c  g  Y ESSS7 U @ ~w ~c C 7 e e 5 3    «¤8 1 £ Ri Ð 3  Yi ESSS7 U i 7WWW 1 p  U U ½   Ð 3y 3 F   Ug @ Áá â ~c C 7 e ½ e 5 3

i pVC R

B

I

1

X

¡

W

i

A Y

C

8

RC

P

CR C `S ` C

aC

CGC Y¨` C

¢

¢

h gec1 fbdb

U

aC

i

X

1£ i X3 R Ð U i @~7 5  X U 1 ¹½ 3 ½ 3 ¹g½ U UU U

i

F g @ Áá â c C 7 h e 5

C

D yC

C

Cq r i

1

t

W

 s

X

8

a ¥C

C

 X

h eg51 fbdb

X


X

X

P

CRC `HY C

X

 c

U 

T

X

W

C

WIDL-graphs. Equivalent at the representation level with WIDL-expressions, WIDL-graphs allow for formulations of algorithms that process them. For each WIDL-expression , there exists an equivalent WIDL-graph . As an example, we illustrate in Figure 2(a) the WIDL-graph corresponding to the WIDL-expression in Figure 1. and a final WIDL-graphs have an initial vertex vertex . Vertices , , and with in-going edges labeled , , and , respectively, and vertices , , and with out-going edges labeled , , and , respectively, result from the expansion of the operator. Vertices and with in-going edges labeled , , respectively, and vertices and with out-going edges labeled , , respectively, result from the expansion of the operator. With each WIDL-graph , we associate a probability distribution. The domain of this distribution is the finite collection of strings that can be generated from the paths of a WIDL-specific

WIDL-graphs and Probabilistic FSA. Probabilistic finite-state acceptors (pFSA) are a wellknown formalism for representing probability distributions (Mohri et al., 2002). For a WIDLexpression , we define a mapping, called U N F O L D , between the WIDL-graph and a pFSA . A state in is created for each set of WIDL-graph vertices that can be reached simultaneously when traversing the graph. State records, in what we call a -stack (interleave , ­bordered substack), the order in which graphs are traversed. Consider Figure 2(b), in which state (at the bottom) corresponds to reaching vertices , and (see the WIDL-graph in Figure 2(a)), by first reach(inside the , ­bordered subing vertex graph), and then reaching vertex (inside the , ­bordered sub-graph). A transition labeled between two states and in exists if there exists a vertex in the description of and a vertex in the description of such that there exists a path in between and , and is the only -labeled transitions in this path. For example, transition (Figure 2(b)) results from unfolding the path

 c

F 7C

1

P

B

 e

C

1

B

!  "¥ A©

R ¨C

 ~c ~c   g 7Á F   g @ Áá â c C 7 h  5

Rc  cR

PT

T

D VC

C

I

1

PI

W

8

X

P

¥

CRC `HY C

1

P

¥

B

C

1

1

W

B

remaining probability mass of 0.1 is left for the 12 shuffles associated with the unlocked expression , for a shuffle probability of . The list below enumerates some of the pairs that belong to the probability distribution defined by our example:

F Áå C ¦ ¾± ÂÀ» ¾ ±¼² » áC ¦ U Ùâ06 ! 3 á 7 å 06 ! ! 3 ±8 2Q ± 7 F 80º 6 3 Ã9ÁS¾ Á 06 3   c 2)h!² ½±hX)± Ah²º ' ± 7 á 06 » 3 o²â 8 X "cXQ W W Wº W º 7 W± º  7 Á z²  7 w¾  ± ²  w wE  Ûv Ý t 7 ! X»  ¾  È±wÛ² v  t Ý eb 7 ¢E Á hs '  S¾ ÁÀw¿º Ý X~¢
Figure 1: An example of a WIDL-expression. , starting from and ending in . traversal of Each path (and its associated string) has a probability value induced by the probability distribution functions associated with the edge labels of .A WIDL-expression and its corresponding WIDLare said to be equivalent because they graph represent the same distribution .
% £$ ¢ 4 5 § #¢ © ¨¨ !  " ¨  1 3 0) (& 3' §¢ ¤  X ¤ 

¹c  ¹c  è ÁU

¥

¦ U¹AUç g

T

C

2 # 0) 1 © ¨¨

¡

(& ' D EC

¥

g e

U SC

 
¥

U ½

C

c

1

§ ¨

1

P QI

B

B

§ ¡

¹(b c  c ! c ! ¹I  U U c ~A

G HC

C

P

¦¤ £¥ P QT C C

~c c ~c  e U½ U c c é g U

I

A  ¤ # §¢

! ²A¾¿ ² º º 7 A¾ 6W z²  eÎ606

¢  £¡ I

R SC

% £$ ¢ T

C

F 7C

9 @

P

T

 
§ ¢

U

§¢ ¤ # C

6 7

 
8


fig

hti

ng

nt :

me

rn

h

ve

is

rn

me

nt

els reb attacked

in
rebels

1

go

tu

go

.3

h :0

is

go

h

tu

tu

in
20

iraq
22

nt


21

23

h

:1

ht fig els reb attacked :0.18 rebels :1

is

rk

tu

tu

rk

is

[v v v ,<3]
0 6 23 1

[v v v ,<32]
0 9 23 1

[v v v ,<32]
0 19 23 1

(a)

(b)

Figure 2: The WIDL-graph corresponding to the WIDL-expression in Figure 1 is shown in (a). The probabilistic finite-state acceptor (pFSA) that corresponds to the WIDL-graph is shown in (b).

3.2

Algorithms for Intersecting WIDL-expressions with Language Models


Algorithm W I D L - N G L M - A (Figure 3) solves the search problem defined by Equation 2 for a 3 Stochastic Language Generation from WIDL-expression (which provides feature funcWIDL-expressions -gram language models (which tion ) and provide feature functions . It does 3.1 Interpolating Probability Distributions in so by incrementally computing U N F O L D for a Log-linear Framework (i.e., on-demand computation of the correspondof strings over a Let us assume a finite set ing pFSA ), by keeping track of a set of active finite alphabet , representing the set of possi. The set of newly U N F O L D ed states, called ble sentence realizations. In a log-linear framestates is called . Using Equation 1 (unnorwork, we have a vector of feature functions malized), we E VA L UAT E the current scores , and a vector of parameters for the states. Additionally, E VA L UAT E . For any , the interpolated uses an admissible heuristic function to compute probability can be written under a log-linear states. future (admissible) scores for the model as in Equation 1: The algorithm P U S H es each state from the current into a priority queue , which sorts (1) the states according to their total score (current admissible). In the next iteration, is a singleton set containing the state P O P ed out from the We can formulate the search problem of finding top of . The admissible heuristic function we use the most probable realization under this model is the one defined in (Soricut and Marcu, 2005), as shown in Equation 2, and therefore we do not using Equation 1 (unnormalized) for computing need to be concerned about computing expensive the event costs. Given the existence of the adnormalization factors. missible heuristic and the monotonicity property (2) of the unfolding provided by the priority queue , the proof for A optimality (Russell and Norvig, For a given WIDL-expression over , the set 1995) guarantees that W I D L - N G L M - A finds a is defined by , and feature function that provides an optimal solution. path in
1108

1

n

m

B

! Xhf·g


   loS

hg f e  klqs

!

cb

m

7WWW U XSSS7 §b

R 8$ b  !  "¥ ©

higxfqSd e    x

¥ 


hg f e  kxjs


1

W

hg f e  ixqSd

p qW

m

g

b

g b

are responsible for adding and removing, respectively, the , symbols in the -stack. The probabilities associated with transitions are computed using the vertex set and the -stack of each state, together with the distribution functions of the and operators. For a detailed presentation of the U N F O L D relation we refer the reader to (Soricut, 2006).

is taken to be . Any language model we want to employ may be added in Equation 2 as . a feature function ,

h :1

1= { 2 1 3 0.2, other perms 2 = { 1 0.65, 2 0.35 }

ir

go

aq

aq

ing

[v v v ,<0]
0 19 23 1

go

0.7, shuffles

0.1 }

ve

ir

ve

0 6 21 1

0 9 21 1

0 11 21 1

rn

aq

rn

v

v

v

¦ ¶e ¦b

a

! @ Xhf

tb te

@ ! Ü hf!  t @ shf

¥ ¡ g | ct ¡ 5


b te tb te


f

a

a

rp sqi p

£

!!     ) "¥ GH©Ä0'

g| t u v g r5 p ¦ !  |c ct¡ 5 ¡ sqsirp qi ¡ Bshf·g

1

f

W

  yw ¦    yw ÃxB!XhfÞg p Ãx

F

PQQP

F

¡

! XhfÞg

HIIH


c ~c

A

A

!

A

ce c db

8

WSSW |Se W U eg WW SSW U b g b

DEED

rebels

v

[v v v ,<3]

[v v v ,<32]

[v v v ,<32]

me

ir

me

nt :

1

1

13

14

15

16

17

18

rk

rk

3

v

v

v

v

v

v

2

3 1

3

is

is

:0.

fig

h

6

(

2 2


1

rebels 1 

1 fighting 1 


v

)

2

v

in
hti

19

in

ng

go

1

tu

v s

1

7

8

9

10

11

12

'&&'

v e

ve

ve

2

2

rk

2

(2

v

v

v

v

v

v

)1

0 6 20 1 1

0 9 20 1

1

rn

rn

1

is


1 attacked 1 

1 rebels

1

FGGF XX WSWS WXXSWXXS " WSWSWXWX "## WXSWXS WXSWXS SSWXWWXX YY YWX SWX YSYY``S !! SSWX SY`YY``Y Y``SY``SY $ Y`SY`SS `S`S`Y`S YSYSYS YSYSYS SSSYY``Y` Y``SY``SY``S %$%


©¨¨©

§¦¦§

CBBC

UU ¢££¢ SSUVUVVU UVVSUVVS UUVSUUVS UVSUVSU UVSUVS TSTS RSRS RRTTUVSRRTTUVS SSRTUUVV   ¡ RSRSRRTTUV ¡ RTTSRTTS RTSRTSRRTT ¥¤¤¥ SSRT

1

me

[v v v ,< > ]

[v v v ,<2]

me

1 1

nt

1

rebels

nt

0

1

2

3

4

5


tu

tu

v

v

v

v

v

v

[v ,  ] s

[v v v ,<2] 0 15 20 1 els reb 1 attacked :0.1 rebels :1

rk

go

in

els reb attacked

rk

ve


1 turkish 1 

1 government

1

:0

ght .2 fi

is

fig

rn

ing

hti

h

ng

me

[v v v ,<1 ]
2 6 20 1

:1

nt

in

in

rk

ve

h

A@@A

3223

7667

5445

1001

b

9889

ir ir

aq

F

1 2

)(()

aq
fig hti ng

ir

 
els reb attacked :0.18 rebels :1

aq

[v ,  ] e

els reb attacked

[v v v ,< 0 > ] 0 19 23 1 1 [v v v ,< 321 > ]
0 19 23 1
1

8

8

W


then for each in do P U S H
POP

Figure 3: A algorithm for interpolating WIDLexpressions with -gram language models. An important property of the W I D L - N G L M - A algorithm is that the U N F O L D acceptor) is relation (and, implicitly, the computed only partially, for those states for which the total cost is less than the cost of the optimal path. This results in important savings, both in space and time, over simply running a single-source shortest-path algorithm for directed acyclic graphs (Cormen et al., 2001) over the full (Soricut and Marcu, 2005). acceptor

4 Headline Generation using WIDL-expressions
We employ the WIDL formalism (Section 2) and the W I D L - N G L M - A algorithm (Section 3) in a summarization application that aims at producing both informative and fluent headlines. Our headlines are generated in an abstractive, bottom-up manner, starting from words and phrases. A more common, extractive approach operates top-down, by starting from an extracted sentence that is compressed (Dorr et al., 2003) and annotated with additional information (Zajic et al., 2004). Automatic Creation of WIDL-expressions for Headline Generation. We generate WIDLexpressions starting from an input document. First, we extract a weighted list of topic keywords from the input document using the algorithm of Zhou and Hovy (2003). This list is enriched with phrases created from the lexical dependencies the topic keywords have in the input document. We associate probability distributions with these phrases using their frequency (we assume


Headline Generation Evaluation. To evaluate the accuracy of our headline generation system, we use the documents from the DUC 2003 evaluation competition. Half of these documents are used as development set (283 documents),

1109

Q


b

! 8 Ã 

b

! G ·

4

¦

c w

¦ ê Ã


1 p

W


   loS

9

return

F HC

if

TURKISH GOVERNMENT ATTACKED REBELS IN IRAQ AND SYRIA

Figure 4: Input and output for our automatic headline generation system. that higher frequency is indicative of increased importance) and their position in the document (we assume that proximity to the beginning of the document is also indicative of importance). In Figure 4, we present an example of input keywords and lexical-dependency phrases automatically extracted from a document describing incidents at the Turkey-Iraq border. The algorithm for producing WIDLexpressions combines the lexical-dependency phrases for each keyword using a operator with the associated probability values for each phrase multiplied with the probability value of each topic keyword. It then combines all the -headed expressions into a single WIDL-expression using a operator with uniform probability. The WIDLexpression in Figure 1 is a (scaled-down) example of the expressions created by this algorithm. On average, a WIDL-expression created by this algorithm, using keywords and an average of lexical-dependency phrases per keyword, compactly encodes a candidate set of about 3 million possible realizations. As the specification of the operator takes space for uniform , Theorem 1 guarantees that the space complexity of these expressions is . Finally, we generate headlines from WIDLexpressions using the W I D L - N G L M - A algorithm, which interpolates the probability distributions represented by the WIDL-expressions with -gram language model distributions. The output presented in Figure 4 is the most likely headline realization produced by our system.

©

E VA L UAT E

WIDL-expression & trigram interpolation

F

C

B ¡

UNFOLD

F

C

! m     ¨x  !  l  § 7 m hg f e  ixjs  l  §  £ ¥¤ F eC 6 5 ¦ x F C ¢   !|e b @ ihx7g qSd  fe ! hloS 7 7   gfe ¦khxjs 71  £ ¤


while do

F

C

D 7C

1 2 3 4 5 6 7 8

F

WIDL-NGLM-A0¡B

Keywords iraq 0.32, syria 0.25, rebels 0.22, kurdish 0.17, turkish 0.14, attack 0.10 Phrases iraq in iraq 0.4, nor thern iraq 0.5,iraq and iran 0.1 , syria into syria 0.6, and syria 0.4 rebels attacked rebels 0.7,rebels fighting 0.3 ...

C

¢ ¥¤  £ F eC C 8 ¡loS ¢ F     ! e 7 b 7 1 @ 7 5

1 
W


Extractive Lead10

Abstractive Keywords Webcl

Automatic Creation of WIDL-expressions for MT. We generate WIDL-expressions from Chinese strings by exploiting a phrase-based translation table (Koehn et al., 2003). We use an algorithm resembling probabilistic bottom-up parsing to build a WIDL-expression for an input Chinese string: each contiguous span over a is considered a possible "conChinese string stituent", and the "non-terminals" associated with each constituent are the English phrase translations that correspond in the translation table to the Chinese string . Multiple-word English phrases, such as , are represented as WIDL-expressions using the precedence ( ) and

1110

 
! u  7

© 

! "  

a 

R V


a 

R "


a 

i

R

a

and the other half is used as test set (273 documents). We automatically measure performance by comparing the produced headlines against one reference headline produced by a human using ROUGE (Lin, 2004). For each input document, we train two language models, using the SRI Language Model Toolkit (with modified Kneser-Ney smoothing). A general trigram language model, trained on 170M English words from the Wall Street Journal, is used to model fluency. A document-specific trigram language model, trained on-the-fly for each input document, accounts for both fluency and content validity. We also employ a word-count model (which counts the number of words in a proposed realization) and a phrase-count model (which counts the number of phrases in a proposed realization), which allow us to learn to produce headlines that have restrictions in the number of words allowed (10, in our case). The interpolation weights (Equation 2) are trained using discriminative training (Och, 2003) using ROUGE as the objective function, on the development set. The results are presented in Table 1. We compare the performance of several extractive algorithms (which operate on an extracted sentence to arrive at a headline) against several abstractive algorithms (which create headlines starting from scratch). For the extractive algorithms, Lead10 is a baseline which simply proposes as headline the lead sentence, cut after the first 10 words. HedgeTrimmer is our implementation of the Hedge Trimer system (Dorr et al., 2003), and Topiary is our implementation of the Topiary system (Zajic et al., 2004). For the abstractive algorithms, Keywords is a baseline that proposes as headline the sequence of topic keywords, Webcl is the system
§

5 Machine Translation using WIDL-expressions
We also employ our WIDL-based realization engine in a machine translation application that uses a two-phase generation approach: in a first phase, WIDL-expressions representing large sets of possible translations are created from input foreignlanguage sentences. In a second phase, we use our generic, WIDL-based sentence realization engine to intersect WIDL-expressions with an gram language model. In the experiments reported here, we translate between Chinese (source language) and English (target language).

¨

Table 1: Headline generation evaluation. We compare extractive algorithms against abstractive algorithms, including our WIDL-based algorithm.


¦

¥

WIDL-A

¤

Topiary

£

HedgeTrimmer

458 399 576 585 311 562

114 104 115 22 76 126

9.9 7.4 9.9 9.9 7.3 10.0

20.8 18.1 26.2 26.6 14.1 25.5

11.1 9.9 12.5 5.5 7.5 12.9

¢

¡

 
ALG

(uni)

(bi)

Len.

Rouge

Rouge

THREE GORGES PROJECT IN CHINA HAS WON APPROVAL WATER IS LINK BETWEEN CLUSTER OF E. COLI CASES SRI LANKA 'S JOINT VENTURE TO EXPAND EXPORTS OPPOSITION TO EUROPEAN UNION SINGLE CURRENCY EURO OF INDIA AND BANGLADESH WATER BARRAGE

Figure 5: Headlines generated automatically using a WIDL-based sentence realization system. described in (Zhou and Hovy, 2003), and WIDLis the algorithm described in this paper. This evaluation shows that our WIDL-based approach to generation is capable of obtaining headlines that compare favorably, in both content and fluency, with extractive, state-of-the-art results (Zajic et al., 2004), while it outperforms a previously-proposed abstractive system by a wide margin (Zhou and Hovy, 2003). Also note that our evaluation makes these results directly comparable, as they use the same parsing and topic identification algorithms. In Figure 5, we present a sample of headlines produced by our system, which includes both good and not-so-good outputs.
A


e


gunman was killed by police .

MT Performance Evaluation. When evaluated against the state-of-the-art, phrase-based decoder Pharaoh (Koehn, 2004), using the same experimental conditions ­ translation table trained on the FBIS corpus (7.2M Chinese words and 9.2M English words of parallel text), trigram language model trained on 155M words of English (Equation 2) newswire, interpolation weights trained using discriminative training (Och, 2003) (on the 2002 NIST MT evaluation set), probabilistic beam set to 0.01, histogram beam set to 10 ­ and BLEU (Papineni et al., 2002) as our metric, the W I D L - N G L M - A algorithm produces translations that have a BLEU score of 0.2570, while Pharaoh translations have a BLEU score of 0.2635. The difference is not statistically significant at 95% confidence level. These results show that the WIDL-based approach to machine translation is powerful enough to achieve translation accuracy comparable with state-of-the-art systems in machine translation.

6 Conclusions
The approach to sentence realization we advocate in this paper relies on WIDL-expressions, a formal language with convenient theoretical properties that can accommodate a wide range of generation scenarios. In the worst case, one can work with simple bags of words that encode no context

1

English reference:

the gunman was shot dead by the police.

1111


i

X


e


m


á¦ Ã

lock ( ) operators, as . To limit the number of possible translations corresponding to a Chinese span , we use a probabilistic beam and a histogram beam to beam out low probability translation alternatives. At this span is "tiled" with likely translapoint, each tions taken from the translation table. Tiles that are adjacent are joined together in a larger tile by a operator, where . That is, reordering of opthe component tiles are permitted by the erators (assigned non-zero probability), but the longer the movement from the original order of the tiles, the lower the probability. (This distortion model is similar with the one used in (Koehn, 2004).) When multiple tiles are available for the , they are joined by a operasame span tor, where is specified by the probability distributions specified in the translation table. Usually, statistical phrase-based translation tables specify not only one, but multiple distributions that account for context preferences. In our experiments, we consider four probability distributions: , and , where and are Chinese-English phrase translations as they appear in the translation table. In Figure 6, we show an example of WIDL-expression created by this algorithm1 . On average, a WIDL-expression created by this algorithm, using an average of tiles per sentence (for an average input sentence length of 30 words) and an average of possible translations per tile, encodes a candidate set of about 10 possible translations. As the specification of the operators takes space , Theorem 1


é

G

b

b

f yb

Figure 6: A Chinese string is converted into a WIDL-expression, which provides a translation as the best scoring hypothesis under the interpolation with a trigram language model.

P

WIDL-expression & trigram interpolation

guarantees that these WIDL-expressions encode compactly these huge spaces in . In the second phase, we employ our WIDLbased realization engine to interpolate the distribution probabilities of WIDL-expressions with a trigram language model. In the notation of Equafor tion 2, we use four feature functions the WIDL-expression distributions (one for each probability distribution encoded); a feature funcfor a trigram language model; a feature tion function for a word-count model, and a feature function for a phrase-count model. As acknowledged in the Machine Translation literature (Germann et al., 2003), full A search is not usually possible, due to the large size of the search spaces. We therefore use an approximation algorithm, called W I D L - N G L M - A , which considers for unfolding only the nodes extracted from the priority queue which already unfolded a path of length greater than or equal to the maximum length already unfolded minus (we used in the experiments reported here).
b

7WWW g XSSS7 |b

! G ·

v ¦' ¦8 0 ¦u Yu Yu Yu Y  v Yu  0u ¦0) ¦u0u ¦u Yu Y Y  t) Y u  Y u u Y u  Y )t ¦' ¦')t 0)t ¦u s    t u t ¦0) ¦' t ¦u s Y Yu Yu Y s  Yu Yu  Yu Y A ¦8 0) ¦8 ¦u s   Y Au u  ¦u Y Y ¦u t ¦u  Y c s xb W a v  Y¦u Y¦8  ¦u y t u Yu u Y  Y u  Y u t  Y u  Y A 0(t ¦0)t ¦'(t ¦u s  u ) ¦u' ¦u'u  Y0) ¦u s   Y  Y  uY  Y  u ¦u Y ¦u t ¦8  u s Yu Y c s $ db Y A ¦u Yu 0t u Y¦8 ¦u s uY v Y r e  q p g e P  Hw c b Au (t ©©g(t ©i htf G0!¢tS dE¡3 a  t scb yt xw¢ a t s a 1 Y1¨ @X D @X @X `%A% @ U U 7¢¦2$ U U 7( U U 75£ ¥1$ U TS & R9£ 7¢%2V @ (Q4CQ W %5%I#)$ H¢$GF#%%4E%(CBA% 7(¢%(¢48%7¢5%£ 1P F 1  $ $ 9 & D  " £   ¨ @ $ 9  1  5$¡ $ 9 & $ 6  " £  ¥  1  $    ¨ 1  $    ¨ & $ " £        ¨£ 3 ¥ £ %%%0)2%%%0)('%#!©§¢ 4¡ ¦¤¡ ¡ ¡ ¢ 

c S ¦Q

c Bb

a 

R V R i a! V   Ý 

X

6 H

! | f

â¦

Q 

!

! 8 Ã 


d e

 F


a 

 9

¦ ·


! hf 

F c w 8

 
(

b 

 )F

9 @

! u  7

 
a

3 t v

© 

! ! 7 o f  7 hf 

R

 
©

Q

 %b

a 

c ¢

i

9 @

R

¾ ²±¼ )hC

Ý

a

f

 
gé

9


preferences (Soricut and Marcu, 2005). One can also work with bags of words and phrases that encode context preferences, a scenario that applies to current approaches in statistical machine translation (Section 5). And one can also encode context and ordering preferences typically used in summarization (Section 4). The generation engine we describe enables a tight coupling of content selection with sentence realization preferences. Its algorithm comes with theoretical guarantees about its optimality. Because the requirements for producing WIDLexpressions are minimal, our WIDL-based generation engine can be employed, with state-of-the-art results, in a variety of text-to-text applications. Acknowledgments This work was partially supported under the GALE program of the Defense Advanced Research Projects Agency, Contract No. HR0011-06-C-0022.

K. Knight and V. Hatzivassiloglou. 1995. Two level, many-path generation. In Proceedings of the ACL. Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase based translation. In Proceedings of the HLT-NAACL, pages 127­133. Philipp Koehn. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine transltion models. In Proceedings of the AMTA, pages 115­124. I. Langkilde-Geary. 2002. A foundation for generalpurpose natural language generation: sentence realization using probabilistic models of language. Ph.D. thesis, University of Southern California. Chin-Yew Lin. 2004. ROUGE: a package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004). Christian Matthiessen and John Bateman. 1991. Text Generation and Systemic-Functional Linguistic. Pinter Publishers, London. Mehryar Mohri, Fernando Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 16(1):69­88. Mark-Jan Nederhof and Giorgio Satta. 2004. IDLexpressions: a formalism for representing and parsing finite languages in natural language processing. Journal of Artificial Intelligence Research, pages 287­317. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the ACL, pages 160­167. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In In Proceedings of the ACL, pages 311­318. Stuart Russell and Peter Norvig. 1995. Artificial Intelligence. A Modern Approach. Prentice Hall. Radu Soricut and Daniel Marcu. 2005. Towards developing generation algorithms for text-to-text applications. In Proceedings of the ACL, pages 66­74. Radu Soricut. 2006. Natural Language Generation for Text-to-Text Applications Using an Information-Slim Representation. Ph.D. thesis, University of Southern California. David Zajic, Bonnie J. Dorr, and Richard Schwartz. 2004. BBN/UMD at DUC-2004: Topiary. In Proceedings of the NAACL Workshop on Document Understanding, pages 112­119. Liang Zhou and Eduard Hovy. 2003. Headline summarization at ISI. In Proceedings of the NAACL Workshop on Document Understanding.

References
Srinivas Bangalore and Owen Rambow. 2000. Using TAG, a tree model, and a language model for generation. In Proceedings of the Fifth International Workshop on Tree-Adjoining Grammars (TAG+). Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001. Introduction to Algorithms. The MIT Press and McGraw-Hill. Simon Corston-Oliver, Michael Gamon, Eric K. Ringger, and Robert Moore. 2002. An overview of Amalgam: A machine-learned generation module. In Proceedings of the INLG. Bonnie Dorr, David Zajic, and Richard Schwartz. 2003. Hedge trimmer: a parse-and-trim approach to headline generation. In Proceedings of the HLTNAACL Text Summarization Workshop, pages 1­8. Michael Elhadad. 1991. FUF User manual -- version 5.0. Technical Report CUCS-038-91, Department of Computer Science, Columbia University. Ulrich Germann, Mike Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2003. Fast decoding and optimal decoding for machine translation. Artificial Intelligence, 154(1­2):127-143. Nizar Habash. 2003. Matador: A large-scale SpanishEnglish GHMT system. In Proceedings of AMTA. J. Hajic, M. Cmejrek, B. Dorr, Y. Ding, J. Eisner, D. Gildea, T. Koo, K. Parton, G. Penn, D. Radev, and O. Rambow. 2002. Natural language generation in the context of machine translation. Summer workshop final report, Johns Hopkins University.

1112