Learning to Generate Naturalistic Utterances Using Reviews in Spoken Dialogue Systems Ryuichiro Higashinaka NTT Corporation rh@cslab.kecl.ntt.co.jp Abstract Spoken language generation for dialogue systems requires a dictionary of mappings between semantic representations of concepts the system wants to express and realizations of those concepts. D ictionary c r e a t i o n is a c o s t l y p r o c e s s ; i t i s c u r r e n t l y done by hand for each dialogue domain. We propose a novel unsupervised method f o r le a r n i n g s u c h m a p p i n g s f r o m u s e r r e views in the target domain, and test it on restaurant reviews. We test the hypothesis that user reviews that provide individual ratings for distinguished attributes of the domain entity make it possible to map review sentences to their semantic representation w ith high precision. Experimental analyses show that the mappings learned cover most of the domain ontology, and provide good linguistic variation. A subjective user evaluation shows that the consistency between the semantic representat i o n s a n d th e l e a r n e d r e a l i z a t i o n s i s h i g h a n d th a t th e n a t u r a l n e s s o f th e r e a l i z a t i o n s i s h i g h e r th a n a h a n d - c r a f t e d b a s e l i n e . Rashmi Prasad University of Pennsylvania rjprasad@linc.cis.upenn.edu Marilyn A. Walker University of Sheffield walker@dcs.shef.ac.uk An example user review (we8there.com) Ratings Food=5, Service=5, Atmosphere=5, Value=5, Overall=5 Review The best Spanish food in New York. I am comment from Spain and I had my 28th birthday there and we all had a great time. Salud! Review comment after named entity recognition The best {NE=foodtype, string=Spanish} {NE=food, string=food, rating=5} in {NE=location, string=New York}. . . . Mapping between a semantic representation (a set of relations) and a syntactic structure (DSyntS) · Relations: RESTAURANT has FOODTYPE RESTAURANT has foodquality=5 RESTAURANT has LOCATION ([foodtype, food=5, location] for shorthand.) · DS yn tS : lexeme : food class : common noun number : sg article : def l A ATTR exeme : best class : adjective lexeme : FOODTYPE class : common noun TTR number : sg article : no-art lexeme : in class : preposition lexeme : LOCATION ATTR II class : proper noun number : sg article : no-art 1 Introduction One obstacle to the widespread deployment of spoken dialogue systems is the cost involved with hand-crafting the spoken language generation module. Spoken language generation requires a dictionary of mappings between semantic representations of concepts the system wants to express and realizations of those c oncepts. D ictionary crea t i o n is a c o s t l y p r o c e s s : a n a u t o m a t i c m e t h o d for creating them would make dialogue technology more scalable. A secondary benefit is that a learned dictionary may produce m ore natural and colloquial utterances. We propose a novel method for mining user reviews to automatically acquire a domain specific g e n e r a t i o n d i c t i o n a r y f o r in f o r m a t i o n p r e s e n t a t i o n i n a d i a l o g u e s y s t e m . O u r h y p o t h e s i s is th a t r e views that provide individual ratings for various distinguished attributes of review entities can be used to map review sentences to a semantic rep265 Figure 1: E xample of procedure for acquiring a generation dictionary mapping. resentation. Figure 1 shows a user review in the restaurant domain, where w e hypothesize that the user rating food=5 indicates that the semantic repr e s e n t a t i o n fo r t h e se n t e n c e " T h e b e s t Sp a n i s h food in New York" includes the relation `R E S TAU R A N T has foodquality=5.' We apply the method to extract 451 mappings from restaurant reviews. Experimental analyses show that the mappings learned cover most of the domain ontology, and provide good linguistic variation. A subjective user evaluation indicates that the consistency between the semantic representat i o n s a n d th e l e a r n e d r e a l i z a t i o n s i s h i g h a n d th a t the naturalness of the realizations is significantly h i g h e r th a n a h a n d - c r a f t e d b a s e l i n e . Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 265­272, Sydney, July 2006. c 2006 Association for Computational Linguistics Section 2 provides a step-by-step description of the method. Sections 3 and 4 present the evaluation results. Section 5 covers related work. Section 6 summarizes and discusses future work. Dist. Attr. food service atmosphere value overall 2 Learning a Generation Dictionary Our automatically created generation dictionary consists of triples (U , R , S ) r e p r e s e n t i n g a m a p ping between the o riginal u tterance U in the u ser review, its semantic representation R(U ), and its syntactic structure S (U ). A lthough templates a re widely used in many practical systems (Seneff and Polifroni, 2000; Theune, 2003), we derive synt a c t i c s t r u c t u r e s t o r e p r e s e n t th e p o t e n t i a l r e a l i z a tions, in order to allow aggregation, and other syntactic transformations of utterances, as w ell as context specific prosody assignment (Walker et al., 2003; Moore et al., 2004). The m ethod is outlined briefly in Fig. 1 and described below. It comprises the following steps: 1. Collect user reviews on the web to create a population of utterances U . 2. To derive semantic representations R(U ): · Identify distinguished attributes and construct a domain ontology; · Specify lexicalizations of attributes; · Scrape webpages' structured data for named-entities; · Tag named-entities. 3. Derive syntactic representations S (U ). 4 . F i l t e r in a p p r o p r i a t e m a p p i n g s . 5. Add mappings (U , R, S ) to dictionary. 2.1 Creating the corpus We created a corpus of restaurant reviews by scraping 3,004 user reviews of 1,810 restaurants posted at we8there.com (http://www.we8there.com/), where each individual review includes a 1-to-5 Likert-scale rating of different restaurant attributes. The corpus consists of 18,466 sentences. 2.2 Deriving semantic representations The distinguished attributes are extracted from the webpages for each restaurant entity. They include attributes that the users are asked to rate, i.e. food, service, atmosphere, value, and overall, which have scalar values. In addition, other attributes are extracted from the webpage, such as the n a m e, foodtype a n d location of the restaurant, which have categorical values. The name attribute is assumed to correspond to the restaurant entity. Given the distinguished attributes, a Lexicalization food, meal service, staff, waitstaff, wait staff, server, waiter, waitress atmosphere, decor, ambience, decoration value, price, overprice, pricey, expensive, inexpensive, cheap, affordable, afford recommend, place, experience, establishment Table 1: tributes. Lexicalizations for distinguished at- simple domain ontology can be automatically derived by assuming that a meronymy relation, represented by the predicate `has', holds between the entity type (R E S TAU R A N T ) and the distinguished attributes. Thus, the domain ontology consists of the relations: R E S TAU R A N T has foodquality R E S TAU R A N T has servicequality R E S TAU R A N T has valuequality R E S TAU R A N T has overallquality R E S TAU R A N T has foodtype R E S TAU R A N T R E S TAU R A N T has atmospherequality h a s lo c a t i o n We assume that, although users may discuss other attributes of the entity, at least some of the utterances in the reviews realize the relations specified in the ontology. Our problem then is to identify these utterances. We test the hypothesis that, if an utterance U contains named-entities corresponding to the distinguished attributes, that R for t h a t u t t e r a n c e in c l u d e s th e r e l a t i o n c o n c e r n i n g t h a t attribute in the domain ontology. We define named-entities for lexicalizations of the distinguished attributes, starting with the seed word for that attribute on the webpage (Table 1).1 For named-entity recognition, we use GATE (Cunningham et al., 2002), augmented with namede n t i t y l i s t s f o r lo c a t i o n s , f o o d t y p e s , r e s t a u r a n t names, and food subtypes (e.g. pizza), scraped f r o m th e w e 8 t h e r e w e b p a g e s . We also hypothesize that the rating given for the distinguished attribute specifies the scalar value of the relation. For example, a sentence containi n g f o o d o r m e a l i s a s s u m e d to r e a l i z e th e r e lation `R E S TAU R A N T has foodquality.', and the value of the foodquality attribute is assumed to be the value specified in the user rating for that attribute, e.g. `R E S TAU R A N T has foodquality = 5' in Fig. 1. Similarly, the other relations in Fig. 1 are a s s u m e d to b e r e a l i z e d b y t h e u t t e r a n c e " T h e b e s t Spanish food in New York" because it contains 1 In future, we will investigate other techniques for bootstrapping these lexicalizations from the seed word on the webpage. 266 filter No Relations Filter Other Relations Filter Contextual Filter Unknown Words Filter Parsing Filter filtered 7,947 5,351 2,973 1,467 216 retained 10,519 5,168 2,195 728 512 Rating Dist.Attr. Table 2: Filtering statistics: the number of sentences filtered and retained by each filter. one F O O D T Y P E named-entity and one L O C AT I O N named-entity. Values of categorical attributes are replaced by variables representing their type bef o r e t h e le a r n e d m a p p i n g s a r e a d d e d to th e d i c t i o nary, as shown in Fig. 1. 2.3 Parsing and DSyntS conversion We adopt Deep Syntactic Structures (DSyntSs) as a format for syntactic structures because they can be realized by the fast portable realizer RealPro (Lavoie and Rambow, 1997). Since DSyntSs are a type of dependency structure, we first process the sentences with Minipar (Lin, 1998), a nd then convert Minipar's representation into DSyntS. Since user reviews are different from the newspaper articles on which Minipar was trained, the output of Minipar can be inaccurate, leading to failure in conversion. We check whether conversion is successful in the filtering stage. 2.4 Filtering The goal o f filtering is to identify U that realize the distinguished attributes and to guarantee high p r e c i s i o n f o r th e l e a r n e d m a p p i n g s . R e c a l l is le s s important since systems need to convey requested information as accurately as possible. Our procedure for deriving semantic representations is based o n t h e h y p o t h e s i s th a t if U c o n t a i n s n a m e d - e n t i t i e s that realize the distinguished attributes, that R will include the relevant relation in the domain ontology. We also assume that if U contains namedentities that are not covered by the domain ontology, or words indicating that the meaning of U depends on the surrounding context, that R will not completely characterizes the m eaning o f U , a n d so U should be eliminated. We also require an accur a t e S f o r U . T h e r e f o r e , th e fi l t e r s d e s c r i b e d b e low eliminate U that (1) realize semantic relations not in the ontology; (2) contain words indicating that its meaning depends on the context; (3) contain unknown words; or (4) cannot be parsed accurately. N o R e l a t i o n s F i l t e r : T h e se n t e n c e d o e s n o t c o n tain any named-entities for the distinguished attributes. Other R elations Filter: T h e se n t e n c e c o n t a i n s named-entities for food subtypes, person 267 food service atmosphere value overall Total 1 5 15 0 0 3 23 2 8 3 3 0 2 15 3 6 6 3 1 5 21 4 18 17 8 8 15 64 5 57 56 31 12 45 201 Total 94 97 45 21 70 327 Table 3: Domain coverage of single scalar-valued relation m appings. names, country names, dates (e.g., today, tomorrow, Aug. 26th) or prices (e.g., 12 doll a r s ) , o r P O S ta g C D f o r n u m e r a l s . T h e s e in dicate relations not in the ontology. Contextual Filter: The sentence contains indexicals such as I, you, that or cohesive markers of rhetorical relations that connect it to some part of the preceding text, which means that t h e se n t e n c e c a n n o t b e i n t e r p r e t e d o u t o f c o n text. These include discourse markers, such as list item markers with LS as the POS tag, that signal the organization structure of the text (Hirschberg and Litman, 1987), as well as discourse connectives that signal semantic and p ragmatic relations of the sentence w ith other parts of the text (Knott, 1996), such as coordinating conjunctions at the beginning of the utterance like and and but etc., and conjunct adverbs such as however, also, then. Unknown Words Filter: The sentence contains words not in WordNet (Fellbaum, 1998) ( w h i c h i n c l u d e s ty p o g r a p h i c a l e r r o r s ) , o r P O S ta g s c o n t a i n N N ( N o u n ) , w h i c h m a y in dicate an unknown named-entity, or the sentence has more than a fixed length of words,2 i n d i c a t i n g th a t i t s m e a n i n g m a y n o t b e e s t i mated solely b y n amed entities. Parsing Filter: The sentence fails the parsing to DSyntS conversion. Failures are automatic a l l y d e t e c t e d b y c o m p a r i n g th e o r i g i n a l s e n t e n c e w i t h t h e o n e r e a l i z e d b y R e a l P r o ta k i n g the converted DSyntS as an input. We apply the filters, in a cascading manner, to the 18,466 sentences with semantic representations. As a result, we obtain 512 (2.8%) mappings of (U , R, S ). After removing 61 duplicates, 451 distinct (2.4%) mappings remain. Table 2 shows the number of sentences eliminated by each filter. 3 Objective Evaluation We evaluate the learned expressions with respect to domain coverage, linguistic variation and generativity. 2 We used 20 as a threshold. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Combination of Dist. Attrs food-service food-value atmosphere-food atmosphere-service atmosphere-food-service food-foodtype atmosphere-food-value location-overall food-foodtype-value food-service-value food-foodtype-location food-overall atmosphere-foodtype atmosphere-overall service-value overall-service overall-value foodtype-overall food-foodtype-location-overall atmosphere-food-service-value atmosphere-food-overallservice-value Total Count 39 21 14 10 7 4 4 3 3 2 2 2 2 2 1 1 1 1 1 1 1 122 Table 4: Counts for multi-relation mappings. 3.1 Domain Coverage To be usable for a dialogue system, the mappings must have good domain coverage. Table 3 shows the distribution of the 327 mappings realizing a single scalar-valued relation, categorized by the associated rating score.3 For example, there are 57 mappings with R of `R E S TAU R A N T has foodquality=5,' and a large number of mappings for both the foodquality and servicequality relations. A lt h o u g h w e c o u l d n o t o b t a i n m a p p i n g s f o r so m e re lations such as price={1,2}, coverage for expressing a single relation is fairly complete. There are also mappings that express several relations. Table 4 shows the counts of mappings for m ulti-relation m appings, w ith those c ontaining a food or service relation occurring more frequently as in the single scalar-valued relation mappings. We found only 21 combinations of relations, which is surprising given the large potential number of combinations (There are 50 combinations if we treat relations with different scalar values differently). We also find that most of the mappings have two or three relations, perhaps suggesting that system utterances should not express too many relations in a single sentence. 3.2 Linguistic Variation We also wish to assess whether the linguistic variation of the learned mappings was greater than what we could easily have generated with a hand-crafted dictionary, or a hand-crafted dictionary augmented with aggregation operators, as in 3 There are two other single-relation but not scalar-valued mappings that concern L O C AT I O N in our mappings. (Walker et al., 2003). Thus, we first categorized the mappings by the patterns of the DSyntSs. Table 5 shows the most common syntactic patterns ( m o r e th a n 1 0 o c c u r r e n c e s ) , i n d i c a t i n g th a t 3 0 % o f t h e le a r n e d p a t t e r n s c o n s i s t o f th e s i m p l e f o r m "X is A D J" where A D J is an adjective, or "X is R B A D J," where R B is a degree modifier. Furthermore, u p t o 5 5 % o f t h e le a r n e d m a p p i n g s c o u l d b e g e n e r a t e d f r o m th e s e b a s i c p a t t e r n s b y th e a p p l i c a t i o n o f a c o m b i n a t i o n o p e r a t o r th a t c o o r d i n a t e s m u l tiple adjectives, or coordinates predications over distinct attributes. However, there are 137 syntact i c p a t t e r n s in a l l , 9 7 w i t h u n i q u e s y n t a c t i c s t r u c tures and 21 with two occurrences, accounting for 45% of the learned mappings. Table 6 shows examples of learned m appings with distinct syntactic structures. It would be surprising to see this type of variety in a hand-crafted generation dictionary. I n a d d i t i o n , t h e le a r n e d m a p p i n g s c o n t a i n 2 7 5 d i s tinct lexemes, with a minimum of 2, maximum of 15, and mean of 4.63 lexemes per DSyntS, indicating that the method extracts a wide variety of expressions of varying lengths. A n o t h e r in t e r e s t i n g a s p e c t o f t h e le a r n e d m a p pings is the wide variety of adjectival phrases (APs) in the common patterns. Tables 7 and 8 show the APs in single scalar-valued relation mappings for food and service categorized by the associated ratings. Tables for atmosphere, value and overall can be found in the Appendix. Moreover, the meanings for some of the learned APs are very specific to the particular attribute, e.g. cold and burnt associated with foodquality of 1, attentive and prompt for servicequality of 5, silly and inattentive for servicequality of 1. and mellow for a tmosphere of 5. In addition, our method places the adjectival phrases (APs) in the common patterns o n a m o r e fi n e - g r a i n e d s c a l e o f 1 t o 5 , s i m i l a r to the strength classifications in (Wilson et al., 2004), in contrast to other automatic methods that classify expressions into a binary positive or negative polarity (e.g. (Turney, 2002)). 3.3 Generativity Our motivation for deriving syntactic representations for the learned expressions was the possibility of using an off-the-shelf sentence planner to derive new combinations of relations, and apply aggregation and other syntactic transformations. We examined how many of the learned DSyntSs can be combined with each other, by taking every pair of D SyntSs in the m appings and applying the built-in merge operation in the SPaRKy generator (Walker et al., 2003). We found that only 306 combinations out of a potential 81,318 268 # 1 2 3 4 5 6 7 syntactic pattern NN VB JJ NN VB RB JJ JJ NN NN VB JJ CC JJ RB JJ NN NN VB JJ CC NN VB JJ NN CC NN VB JJ example utterance The atmosphere is wonderful. The atmosphere was very nice. Bad service. The food was flavorful but cold. Very trendy ambience. The food is excellent and the atmosphere is great. The food and service were fantastic. count 92 52 36 25 22 13 10 ratio 20.4% 11.5% 8.0% 5.5% 4.9% 2.9% 2.2% accum. 20.4% 31.9% 39.9% 45.5% 50.3% 53.2% 55.4% Table 5: Common syntactic patterns of DSyntSs, flattened to a POS sequence for readability. NN, VB, JJ, RB, CC stand for noun, verb, adjective, adverb, and conjunction, respectively. [overall=1, value=2] Very disappointing experience for the money charged. [food=5, value=5] The food is excellent and plentiful at a reasonable price. [food=5, service=5] The food is exquisite as well as the service and setting. [food=5, service=5] The food was spectacular and so was the service. [food=5, foodtype, value=5] Best F O O D T Y P E food with a great value for money. [food=5, foodtype, value=5] An absolutely outstanding value with fantastic F O O D T Y P E food. [food=5, foodtype, location, overall=5] This is the best place to eat F O O D T Y P E food in L O C AT I O N . [food=5, foodtype] Simply amazing F O O D T Y P E food. [food=5, foodtype] R E S TAU R A N T N A M E is the best of the best for F O O D T Y P E food. [food=5] The food is to die for. [food=5] What incredible food. [food=4] Very pleasantly surprised by the food. [food=1] The food has gone downhill. [atmosphere=5, overall=5] This is a quiet little place with great atmosphere. [atmosphere=5, food=5, overall=5, service=5, value=5] The food, service and ambience of the place are all fabulous and the prices are downright cheap. food=1 food=2 food=3 food=4 awful, bad, burnt, cold, very ordinary acceptable, bad, flavored, not enough, very bland, very good adequate, bland and mediocre, flavorful but cold, pretty good, rather bland, very good absolutely wonderful, awesome, decent, excellent, good, good and generous, great, outstanding, rather good, really good, traditional, very fresh and tasty, very good, very very good absolutely delicious, absolutely fantastic, absolutely great, absolutely terrific, ample, well seasoned and hot, awesome, best, delectable and plentiful, delicious, delicious but simple, excellent, exquisite, fabulous, fancy but tasty, fantastic, fresh, good, great, hot, incredible, just fantastic, large and satisfying, outstanding, plentiful and outstanding, plentiful and tasty, quick and hot, simply great, so delicious, so very tasty, superb, terrific, tremendous, very good, wonderful food=5 Table 7: Adjectival phrases (APs) in single scalarvalued relation mappings for foodquality. t i o n , a n d th e n a t u r a l n e s s o f th e r e a l i z a t i o n . For comparison, we used a baseline of handcrafted mappings from (Walker et al., 2003) except that we changed the word decor to atmosphere and added five mappings for overall. For scalar relations, this consists of the realization "R E S TAU R A N T has A D J L E X" where A D J is mediocre, decent, good, very good, or excellent for rating values 1-5, and L E X is food quality, service, atmosphere, value, or overall depending on the relation. R E S TAU R A N T is filled with the name of a restaurant at runtime. For example, `R E S TAU R A N T has foodquality=1' is realized as "R E S TAU R A N T has mediocre food quality." The location and food type relations are mapped to "R E S TAU R A N T is located in L O C AT I O N" and "R E S TAU R A N T is a F O O D T Y P E restaurant." T h e le a r n e d m a p p i n g s i n c l u d e 2 3 d i s t i n c t s e mantic representations for a single-relation (22 for scalar-valued relations and one for location) and 50 for m ulti-relations. T herefore, using the handcrafted m appings, w e first c reated 23 utterances for the single-relations. We then created three utterances for each of 50 multi-relations using different clause-combining operations from (Walker et al., 2003). This gave a total of 173 baseline utterances, w hich together with 451 learned m appings, Table 6: Acquired generation patterns (with shorthand for relations in square brackets) whose syntactic patterns occurred only once. combinations (0.37%) were successful. This is because the merge operation in SPaRKy requires that the subjects and the verbs of the two DSyntSs are identical, e.g. the subject is R E S TAU R A N T and verb is has, whereas the learned DSyntSs often place the attribute in subject position as a definite noun phrase. However, the learned DSyntS can be incorporated into SPaRKy using the semantic representations to substitute learned D SyntSs into nodes in the sentence plan tree. Figure 2 shows some example utterances generated by SPaRKy with its original dictionary and example utterances w h e n t h e le a r n e d m a p p i n g s a r e in c o r p o r a t e d . T h e resulting utterances seem more natural and colloquial; we examine whether this is true in the next section. 4 Subjective Evaluation We evaluate the obtained mappings in two respects: the consistency between the automatically derived semantic representation and the realiza- 269 service=1 service=2 service=3 service=4 service=5 awful, bad, great, horrendous, horrible, inattentive, forgetful and slow, marginal, really slow, silly and inattentive, still marginal, terrible, young overly slow, very slow and inattentive bad, bland and mediocre, friendly and knowledgeable, good, pleasant, prompt, very friendly all very warm and welcoming, attentive, extremely friendly and good, extremely pleasant, fantastic, friendly, friendly and helpful, good, great, great and courteous, prompt and friendly, really friendly, so nice, swift and friendly, very friendly, very friendly and accommodating all courteous, excellent, excellent and friendly, extremely friendly, fabulous, fantastic, friendly, friendly and helpful, friendly and very attentive, good, great, great, prompt and courteous, happy and friendly, impeccable, intrusive, legendary, outstanding, pleasant, polite, attentive and prompt, prompt and courteous, prompt and pleasant, quick and cheerful, stupendous, superb, the most attentive, unbelievable, very attentive, very congenial, very courteous, very friendly, very friendly and helpful, very friendly and pleasant, very friendly and totally personal, very friendly and welcoming, very good, very helpful, very timely, warm and friendly, wonderful Original SPaRKy utterances · Babbo has the best overall quality among the selected restaurants with excellent decor, excellent service and superb food quality. · Babbo has excellent decor and superb food quality with excellent service. It has the best overall quality among the selected restaurants. Combination of SPaRKy and learned DSyntS · Because the food is excellent, the wait staff is professional and the decor is beautiful and very comfortable, Babbo has the best overall quality among the selected restaurants. · Babbo has the best overall quality among the selected restaurants because atmosphere is exceptionally nice, food is excellent and the service is superb. · Babbo has superb food quality, the service is exceptional and the atmosphere is very creative. It has the best overall quality among the selected restaurants. F i g u r e 2 : U t t e r a n c e s i n c o r p o r a t i n g le a r n e d DSyntSs (Bold font) in SPaRKy. baseline mean sd. 4.714 0.588 4.227 0.852 learned mean sd. 4.459 0.890 4.613 0.844 stat. sig. + + Consistency Naturalness Table 8: Adjectival phrases (APs) in single scalarvalued relation mappings for servicequality. yielded 624 utterances for evaluation. Ten subjects, all native English speakers, evalu a t e d th e m a p p i n g s b y r e a d i n g th e m f r o m a w e b page. For each system utterance, the subjects were asked to express their degree of agreement, on a scale of 1 (lowest) to 5 (highest), with the statement (a) The meaning of the utterance is consistent with the ratings expressing their semantics, and with the statement (b) The style of the utterance is very natural and colloquial. They were asked not to correct their decisions and also to rate each utterance on its own merit. 4.1 Results Table 9 shows the means and standard deviations o f t h e sc o r e s fo r b a s e l i n e v s . l e a r n e d u t t e r a n c e s f o r consistency and naturalness. A t-test shows that the consistency of the learned expression is significantly lower than the baseline (df=4712, p < .001) but that their naturalness is significantly higher than the baseline (df=3107, p < .001). However, consistency is still high. Only 14 of the learned utterances (shown in Tab. 10) have a mean consistency score lower than 3, which indicates that, by and large, the human judges felt that the inferred semantic representations were consistent with the meaning of the learned expressions. The correlation coefficient between consistency and naturalness scores is 0.42, which indicates that consis- Table 9: Consistency and naturalness scores averaged over 10 subjects. tency does not greatly relate to naturalness. We also performed an ANOVA (ANalysis Of VAriance) of the effect of each relation in R on naturalness and consistency. There were no significant effects except that mappings combining food, service, and atmosphere were significantly worse (df=1, F=7.79, p=0.005). However, there i s a tr e n d f o r m a p p i n g s t o b e r a t e d h i g h e r f o r the food attribute (df=1, F=3.14, p=0.08) and the value attribute (df=1, F=3.55, p=0.06) for consistency, suggesting that perhaps it is easier to learn some mappings than others. 5 Related Work Automatically finding sentences with the same meaning has been extensively studied in the field of automatic paraphrasing using parallel corpora and c orpora w ith multiple descriptions of the same events (Barzilay and McKeown, 2001; Barzilay and Lee, 2003). Other work finds predicates of s i m i l a r m e a n i n g s b y u s i n g th e s i m i l a r i t y o f c o n texts around the predicates (Lin and Pantel, 2001). However, these studies find a set of sentences with the same meaning, but do not associate a specific meaning with the sentences. One exception is (Barzilay and Lee, 2002), which derives mappings between semantic representations and realizations using a parallel (but unaligned) corpus consisting of both complex semantic input and corresponding natural language verbalizations for mathemat- 270 shorthand for relations and utterance [food=4] The food is delicious and beautifully prepared. [overall=4] A wonderful experience. [service=3] The service is bland and mediocre. [atmosphere=2] The atmosphere here is eclectic. [overall=3] Really fancy place. [food=3, service=4] Wonderful service and great food. [service=4] The service is fantastic. [overall=2] The R E S TAU R A N T N A M E is once a great place to go and socialize. [atmosphere=2] The atmosphere is unique and pleasant. [food=5, foodtype] F O O D T Y P E and F O O D T Y P E food. [service=3] Waitstaff is friendly and knowledgeable. [atmosphere=5, food=5, service=5] The atmosphere, food and service. [overall=3] Overall, a great experience. [service=1] The waiter is great. score 2.9 2.9 2.8 2.6 2.6 2.5 2.5 2.2 2.0 1.8 1.7 1.6 1.4 1.4 Table 10: The 14 utterances with consistency scores below 3. ical proofs. However, our technique does not require parallel corpora or previously existing semantic transcripts or labeling, and user reviews are widely available in many different domains (See http://www.epinions.com/). There is also significant previous work on mining user reviews. For example, Hu and Liu (2005) use reviews to find adjectives to describe products, and Popescu a nd Etzioni (2005) automatically find f e a t u r e s o f a p r o d u c t to g e t h e r w i t h t h e p o l a r i t y o f adjectives used to describe them. They both aim at summarizing reviews so that users can make decisions easily. Our method is also capable of finding polarities of modifying expressions including adjectives, but on a more fine-grained scale of 1 to 5. However, it might be possible to use their approach to create rating information for raw review texts as in (Pang and Lee, 2005), so that we can create mappings from reviews without ratings. 6 Summary and Future Work We proposed automatically obtaining mappings between semantic representations and realizations from reviews with individual ratings. The results show that: (1) the learned mappings provide good coverage of the domain ontology and exhibit good linguistic variation; (2) the consistency between the semantic representations and realizations is high; and (3) the naturalness of the realizations are s i g n i fi c a n t l y h i g h e r th a n th e b a s e l i n e . There are also limitations in our method. Even though consistency is rated highly by human subj e c t s , t h i s m a y a c t u a l l y b e a ju d g e m e n t o f w h e t h e r t h e p o l a r i t y o f t h e le a r n e d m a p p i n g i s c o r r e c t l y placed on the 1 to 5 rating scale. Thus, alternate ways of expressing, for example foodquality=5, shown in Table 7, cannot be guaranteed to be synonymous, which may be required for use in spoken language generation. Rather, an examination of the adjectival phrases in Table 7 shows that different aspects of the food are discussed. For example ample and plentiful refer to the portion size, fancy may refer to the p resentation, and delicious describes the flavors. This suggests that perhaps the ontology would benefit from representing these sub-attributes of the food attribute, and sub-attributes in general. Another problem with consistency is that the same AP, e.g. very good in Table 7 may appear with multiple ratings. For example, very good is used for every foodquality rating from 2 to 5. Thus some further automatic o r b y - h a n d a n a l y s i s i s r e q u i r e d to r e fi n e w h a t i s learned before actual use in spoken language gene r a t i o n . S t i l l , o u r m e t h o d c o u l d r e d u c e th e a m o u n t of time a system designer spends developing the spoken language generator, and increase the naturalness of spoken language generation. A n o t h e r is s u e is th a t th e r e c a l l a p p e a r s to b e quite low given that all of the sentences concern the same domain: only 2.4% of the sentences could be used to create the mappings. One way to increase recall m ight be to automatically augment the list of distinguished attribute lexicalizations, using WordNet or work on automatic identification of synonyms, such as (Lin and Pantel, 2001). However, the method here has high precision, and automatic techniques m ay introduce n o i s e . A r e l a t e d is s u e is th a t th e fi l t e r s a r e in s o m e cases too strict. For example the contextual filt e r is b a s e d o n P O S - t a g s , s o th a t s e n t e n c e s t h a t d o not require the prior context for their interpretation a re eliminated, such a s sentences containing subordinating conjunctions like because, when, if, whose arguments are both given in the same sentence (Prasad et al., 2005). In addition, recall is affected by the domain ontology, and the automati c a l l y c o n s t r u c t e d d o m a i n o n t o l o g y f r o m th e r e view webpages may not cover all of the domain. In some review domains, the attributes that get individual ratings are a limited subset of the domain ontology. Techniques for automatic feature identification (Hu and L iu, 2005; Popescu a nd Etzioni, 2005) could possibly help here, although these techniques currently have the limitation that they do not automatically identify different lexicalizations of the same feature. A different type of limitation is that dialogue s y s t e m s n e e d t o g e n e r a t e u t t e r a n c e s f o r in f o r m a tion gathering whereas the m appings we obtained 271 can only be used for information presentation. Thus these would have to be constructed by hand, as in current practice, or perhaps other types of corpora or resources could be utilized. In a ddition, the u tility of syntactic structures in the m appings should be further examined, especially given the failures in DSyntS conversion. An alternative would be to leave some sentences unparsed and u s e th e m a s te m p l a t e s w i t h h y b r i d g e n e r a t i o n te c h niques (White and Caldwell, 1998). Finally, while we believe that this technique will apply across domains, it would be useful to test it on domains such as movie reviews or product reviews, which have more complex domain ontologies. Acknowledgments We thank the anonymous reviewers for their helpful comments. This work was supported by a Royal Society Wolfson award to Marilyn Walker and a research collaboration grant from N TT to the Cognitive Systems Group at the University of Sheffield. Ana-Maria Popescu and Oren Etzioni. 2005. Extracting product features and opinions from reviews. In Proc. HLT/EMNLP, pages 339­346. Rashmi Prasad, Aravind Joshi, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, and Bonnie Webber. 2005. The Penn Discourse TreeBank as a resource for natural language generation. In Proc. Corpus Linguistics Workshop on Using Corpora for NLG. Stephanie Seneff and Joseph Polifroni. 2000. Formal and natural language generation in the mercury conversational system. In Proc. ICSLP, volume 2, pages 767­770. Mariet Theune. 2003. From monologue to dialogue: natural ¨ language generation in OVIS. In AAAI 2003 Spring Symposium on Natural Language Generation in Written and Spoken Dialogue, pages 141­150. Peter D. Turney. 2002. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proc. 40th ACL, pages 417­424. Marilyn Walker, Rashmi Prasad, and Amanda Stent. 2003. A trainable generator for recommendations in multimodal dialog. In Proc. Eurospeech, pages 1697­1700. Michael White and Ted Caldwell. 1998. EXEMPLARS: A practical, extensible framework for dynamic text generation. In Proc. INLG, pages 266­275. Theresa Wilson, Janyce Wiebe, and Rebecca Hwa. 2004. Just how mad are you? finding strong and weak opinion clauses. In Proc. AAAI, pages 761­769. References Regina Barzilay and Lillian Lee. 2002. Bootstrapping lexical choice via multiple-sequence alignment. In Proc. EMNLP, pages 164­171. Regina Barzilay and Lillian Lee. 2003. Learning to paraphrase: An unsupervised approach using multiplesequence alignment. In Proc. HLT/NAACL, pages 16­23. Regina Barzilay and Kathleen McKeown. 2001. Extracting paraphrases from a parallel corpus. In Proc. 39th ACL, pages 50­57. Hamish Cunningham, Diana Maynard, Kalina Bontcheva, and Valentin Tablan. 2002. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proc. 40th ACL. Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. Julia Hirschberg and Diane. J. Litman. 1987. Now let's talk about NOW: Identifying cue phrases intonationally. In Proc. 25th ACL, pages 163­171. Minqing Hu and Bing Liu. 2005. Mining and summarizing customer reviews. In Proc. KDD, pages 168­177. Alistair Knott. 1996. A Data-Driven Methodology for Motivating a Set of Coherence Relations. Ph.D. thesis, University of Edinburgh, Edinburgh. Benoit Lavoie and Owen Rambow. 1997. A fast and portable realizer for text generation systems. In Proc. 5th Applied NLP, pages 265­268. Dekang Lin and Patrick Pantel. 2001. Discovery of inference rules for question answering. Natural Language Engineering, 7(4):343­360. Dekang Lin. 1998. Dependency-based evaluation of MINIPAR. In Workshop on the Evaluation of Parsing Systems. Johanna D. Moore, Mary Ellen Foster, Oliver Lemon, and Michael White. 2004. Generating tailored, comparative descriptions in spoken dialogue. In Proc. 7th FLAIR. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proc. 43st ACL, pages 115­124. Appendix Adjectival phrases (APs) in single scalar-valued relation mappings for atmosphere, value, and overall. atmosphere=2 atmosphere=3 atmosphere=4 atmosphere=5 eclectic, unique and pleasant busy, pleasant but extremely hot fantastic, great, quite nice and simple, typical, very casual, very trendy, wonderful beautiful, comfortable, excellent, great, interior, lovely, mellow, nice, nice and comfortable, phenomenal, pleasant, quite pleasant, unbelievably beautiful, very comfortable, very cozy, very friendly, very intimate, very nice, very nice and relaxing, very pleasant, very relaxing, warm and contemporary, warm and very comfortable, wonderful very reasonable great, pretty good, reasonable, very good best, extremely reasonable, good, great, reasonable, totally reasonable, very good, very reasonable just bad, nice, thoroughly humiliating great, really bad bad, decent, great, interesting, really fancy excellent, good, great, just great, never busy, not very busy, outstanding, recommended, wonderful amazing, awesome, capacious, delightful, extremely pleasant, fantastic, good, great, local, marvelous, neat, new, overall, overwhelmingly pleasant, pampering, peaceful but idyllic, really cool, really great, really neat, really nice, special, tasty, truly great, ultimate, unique and enjoyable, very enjoyable, very excellent, very good, very nice, very wonderful, warm and friendly, wonderful value=3 value=4 value=5 overall=1 overall=2 overall=3 overall=4 overall=5 272