Intrinsic vs. Extrinsic Evaluation Measures for Referring Expression Generation Anja Belz Natural Language Technology Group University of Brighton Brighton BN2 4GJ, UK a.s.belz@brighton.ac.uk Albert Gatt Department of Computing Science University of Aberdeen Aberdeen AB24 3UE, UK a.gatt@abdn.ac.uk Abstract In this paper we present research in which we apply (i) the kind of intrinsic evaluation metrics that are characteristic of current comparative H LT evaluation, and (ii) extrinsic, human task-performance evaluations more in keeping with N L G traditions, to 15 systems implementing a language generation task. We analyse the evaluation results and find that there are no significant correlations between intrinsic and extrinsic evaluation measures for this task. the two correlate for N L G tasks. The results show a surprising lack of correlation between the two types of measures, suggesting that intrinsic metrics and extrinsic methods can represent two very different views of how well a system performs. 2 Task, Data and Systems Referring expression generation (R E G) is concerned with the generation of expressions that describe entities in a given piece of discourse. R E G research goes back at least to the 1980s (Appelt, Grosz, Joshi, McDonald and others), but the field as it is today was shaped in particular by Dale and Reiter's work (Dale, 1989; Dale and Reiter, 1995). R E G tends to be divided into the stages of attribute selection (selecting properties of entities) and realisation (converting selected properties into word strings). Attribute selection in its standard formulation was the shared task in the A S G R E Challenge: given an intended referent (`target') and the other domain entities (`distractors') each with possible attributes, select a set of attributes for the target referent. The A S G R E data (which is now publicly available) consists of all 780 singular items in the T U NA corpus (Gatt et al., 2007) in two subdomains, consisting of descriptions of furniture and people. Each data item is a paired attribute set (as derived from a humanproduced R E) and domain representation (target and distractor entities represented as possible attributes and values). A S G R E participants were asked to submit the outputs produced by their systems for an unseen test data set. The outputs from 15 of these systems, shown in the left column of Table 1, were used in 1 Introduction In recent years, N L G evaluation has taken on a more comparative character. N L G now has evaluation results for comparable, but independently developed systems, including results for systems that regenerate the Penn Treebank (Langkilde, 2002) and systems that generate weather forecasts (Belz and Reiter, 2006). The growing interest in comparative evaluation has also resulted in a tentative interest in shared-task evaluation events, which led to the first such event for N L G (the Attribute Selection for Generation of Referring Expressions, or A S G R E, Challenge) in 2007 (Belz and Gatt, 2007), with a second event (the Referring Expression Generation, or R E G, Challenge) currently underway. In H LT in general, comparative evaluations (and shared-task evaluation events in particular) are dominated by intrinsic evaluation methodologies, in contrast to the more extrinsic evaluation traditions of N L G . In this paper, we present research in which we applied both intrinsic and extrinsic evaluation methods to the same task, in order to shed light on how 197 Proceedings of ACL-08: HLT, Short Papers (Companion Volume), pages 197­200, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics the experiments reported below. Systems differed in terms of whether they were trainable, performed exhaustive search and hardwired use of certain attributes types, among other algorithmic properties (see the A S G R E papers for full details). In the case of one system (I S - F B S), a buggy version was originally submitted and used in Exp 1. It was replaced in Exp 2 by a corrected version; the former is marked by a * in what follows. 3 Evaluation Methods 1. Extrinsic evaluation measures: We conducted two task-performance evaluation experiments (the first was part of the A S G R E Challenge, the second is new), in which participants identified the referent denoted by a description by clicking on a picture in a visual display of target and distractor entities. To enable subjects to read the outputs of peer systems, we converted them from the attribute-value format described above to something more readable, using a simple attribute-to-word converter. Both experiments used a Repeated Latin Squares design, and involved 30 participants and 2,250 individual trials (see Belz & Gatt (2007) for full details). In Exp 1, subjects were shown the domain on the same screen as the description. Two dependent measures were used: (i) combined reading and identification time (R I T), measured from the point at which the description and pictures appeared on the screen to the point at which a picture was selected by mouse-click; and (ii) error rate (E R - 1). In Exp 2, subjects first read the description and then initiated the presentation of domain entities. We computed: (i) reading time (RT), measured from the presentation of a description to the point where a subject requested the presentation of the domain; (ii) identification time (I T), measured from the presentation of the domain to the point where a subject clicked on a picture; and (iii) error rate (E R - 2). 2. R E G-specific intrinsic measures: Uniqueness is the proportion of attribute sets generated by a system which identify the referent uniquely (i.e. none of the distractors). Minimality is the proportion of attribute sets which are minimal as well as unique (i.e. there is no smaller unique set of attributes). These measures were included because they are commonly named as desiderata for attribute 198 selection algorithms in the R E G field (Dale, 1989). The minimality check used in this paper treats referent type as a simple attribute, as the A S G R E systems tended to do.1 3. Set-similarity measures: The Dice similarity coefficient computes the similarity between a peer attribute set A1 and a (human-produced) reference 1 attribute set A2 as 2|×|1A+A2|| . MASI (Passonneau, A | | A2 2006) is similar but biased in favour of similarity where one set is a subset of the other. 4. String-similarity measures: In order to apply string-similarity metrics, peer and reference outputs were converted to word-strings by the method described under 1 above. String-edit distance (S E) is straightforward Levenshtein distance with a substitution cost of 2 and insertion/deletion cost of 1. We also used the version of string-edit distance (`S E B') of Bangalore et al. (2000) which normalises for length. BLEU computes the proportion of word ngrams (n 4 is standard) that a peer output shares with several reference outputs. The NIST M T evaluation metric (Doddington, 2002) is an adaptation of B L E U which gives more importance to less frequent (hence more informative) n-grams. We also used two versions of the RO U G E metric (Lin and Hovy, 2003), ROUGE-2 and ROUGE-SU4 (based on non-contiguous, or `skip', n-grams), which were official scores in the D U C 2005 summarization task. 4 Results Results for all evaluation measures and all systems are shown in Table 1. Uniqueness results are not included, as all systems scored 100%. We ran univariate analyses of variance (A N OVAs) using S Y S T E M as the independent variable (15 levels), testing its effect on the extrinsic taskperformance measures. For error rate (E R), we used a Kruskal-Wallis ranks test to compare identification accuracy rates across systems2 . The main effect of S Y S T E M was significant on R I T (F (14, 2249) = 6.401, p < .001), RT (F (14, 2249) = 2.56, p < .01), and I T (F (14, 2249) = 1.93, p < .01). In neither experiment was there a significant effect on E R. As a consequence, the Minimality results we report here look different from those in the A S G R E report. 2 A non-paramteric test was more appropriate given the large number of zero values in E R proportions, and a high dependency of variance on the mean. 1 extrinsic RIT CAM-B C A M - BU CAM-T CAM-TU D I T- D S GR-FP GR-SC IS-FBN IS-FBS *IS-FBS I S - I AC NIL T- A S + T- A S T- R S + T- R S RT IT ER-1 ER-2 2784.80 2659.37 2626.02 2572.82 2785.40 2724.56 2811.09 3570.90 ­ 4008.99 2844.17 1960.31 2652.85 2864.93 2759.76 2514.37 1309.07 1251.32 1475.31 1297.37 1304.12 1382.04 1349.05 1837.55 1461.45 ­ 1356.15 1482.67 1321.20 1229.42 1278.01 1255.28 1952.39 1877.95 1978.24 1809.04 1859.25 2053.33 1899.59 2188.92 2181.88 ­ 1973.19 1960.31 1817.30 1766.35 1814.93 1866.94 9.33 9.33 10 8.67 10.67 8.67 11.33 15.33 ­ 10 8.67 10 9.33 10 6.67 8.67 5.33 4 5.33 4 2 3.33 2 6 7.33 ­ 6 5.33 4.67 4.67 1.33 4.67 REG Min 8.11 10.14 0 0 0 4.73 4.73 1.35 100 39.86 0 20.27 0 0 0 0 string-similarity RSU4 R-2 NIST BLEU SE SEB .673 .663 .698 .677 .651 .65 .644 .771 .485 ­ .612 .525 .671 .683 .677 .694 .647 .638 .723 .691 .679 .649 .644 .772 .448 ­ .623 .509 .684 .692 .697 .711 2.70 2.61 3.50 3.28 4.23 3.24 2.42 4.75 2.11 ­ 3.77 3.32 2.62 2.99 2.85 3.16 .309 .317 .415 .407 .457 .358 .305 .521 .166 ­ .442 .32 .298 .342 .303 .341 4.42 4.23 3.67 3.71 3.55 3.87 4 3.15 5.53 ­ 3.43 4.12 4.24 4.10 4.32 4.18 .307 .359 .496 .494 .525 .441 .431 .438 .089 ­ .559 .447 .37 .393 .36 .383 set-similarity Dice MASI .620 .403 .630 .420 .725 .560 .721 .557 .750 .595 .689 .480 .671 .466 .770 .601 .368 .182 .527 .281 .746 .597 .625 .477 .660 .452 .645 .422 .669 .459 .655 .432 Table 1: Results for all systems and evaluation measures (E R-1 = error rate in Exp 1, E R-2 = error rate in Exp 2). (R = RO U G E ; system IDs as in the A S G R E papers, except G R = G R A P H ; T = T I T C H ). Table 2 shows correlations between the automatic metrics and the task-performance measures from Exp 1. R I T and E R - 1 are not included because of the presence of * I S - F B S in Exp 1 (but see individual results below). For reasons of space, we refer the reader to the table for individual correlation results. We also computed correlations between the taskperformance measures across the two experiments (leaving out the I S - F B S system). Correlation between R I T and RT was .827**; between R I T and I T .675**; and there was no significant correlation between the error rates. The one difference evident between RT and I T is that E R correlates only with I T (not RT) in Exp 2 (see Table 2). 5 Discussion In Table 2, the four broad types of metrics we have investigated (task-performance, R E G-specific, string similarity, set similarity) are indicated by vertical and horizontal lines. The results within each of the resulting boxes are very homogeneous. There are significant (and mostly strong) correlations not only among the string-similarity metrics and among the set-similarities, but also across the two types. There are also significant correlations between the three task-performance measures. However, the correlation figures between the taskperformance measures and all others are weak and not significant. The one exception is the correlation between N I S T and RT which is actually in the wrong direction (better N I S T implies worse reading times). 199 This is an unambiguous result and it shows clearly that similarity to human-produced reference texts is not necessarily indicative of quality as measured by human task performance. The emergence of comparative evaluation in N L G raises the broader question of how systems that generate language should be compared. In M T and summarisation it is more or less taken as read that systems which generate more human-like language are better systems. However, it has not been shown that more human-like outputs result in better performance from an extrinsic perspective. Intuitively, it might be expected that higher humanlikeness entails better task-performance (here, shorter reading/identification times, lower error). The lack of significant covariation between intrinsic and extrinsic measures in our experiments suggests otherwise. 6 Conclusions Our aim in this paper was to shed light on how the intrinsic evaluation methodologies that dominate current comparative H LT evaluations correlate with human task-performance evaluations more in keeping with N L G traditions. We used the data and systems from the recent A S G R E Challenge, and compared a total of 17 different evaluation methods for 15 different systems implementing the A S G R E task. Our most striking result is that none of the metrics that assess humanlikeness correlate with any of the task-performance measures, while strong correlations are observed within the two classes of mea- extrinsic RT RT IT ER-2 Min R-SU4 R-2 NIST BLEU SE SEB IT ER-2 Dice MASI 1 .8** .46 .18 .10 .05 .54* .39 -.30 .02 .12 .23 .8** 1 .59* .56* -.24 -.33 .22 .04 .09 -.31 -.28 -.17 .46 .59* 1 .51 -.29 -.36 .03 -.08 .22 -.34 -.39 -.29 REG Min .18 .56* .51 1 -.76** -.81** -.46 -.66** .79** -.8** -.90** -.79** string-similarity R-SU4 R-2 NIST BLEU SE SEB .10 -.24 -.29 -.76** 1 .98** .45 .63* -.63* .42 .72** .57* .05 -.33 -.36 -.81** .98** 1 .51 .68** -.69** .53* .78** .65** .54* .22 .03 -.46 .45 .51 1 .94** -.84** .68** .74** .82** .39 .04 -.08 -.66** .63* .68** .94** 1 -.96** .82** .89** .93** -.30 .09 .22 .79** -.63* -.69** -.84** -.96** 1 -.92** -.96** -.97** .02 -.31 -.34 -.8** .42 .53* .68** .82** -.92** 1 .92** .95** set-similarity Dice MASI .12 .23 -.28 -.17 -.39 -.29 -.90** -.79** .72** .57* .78** .65** .74** .82** .89** .93** -.96** -.97** .92** .95** 1 .97** .97** 1 Table 2: Pairwise correlations between all automatic measures and the task-performance results from Exp 2. ( = significant at .05; at .01). R = RO U G E. sures ­ intrinsic and extrinsic. Somewhat worryingly, our results show that a system's ability to produce human-like outputs may be completely unrelated to its effect on human task-performance. Our main conclusions for R E G evaluation are that we need to be cautious in relying on humanlikeness as a quality criterion, and that we leave extrinsic evaluation behind at our peril as we move towards more comparative forms of evaluation. Given that the intrinsic metrics that dominate in competetive H LT evaluations are not assessed in terms of correlation with extrinsic notions of quality, our results sound a more general note of caution about using intrinsic measures (and humanlikeness metrics in particular) without extrinsic validation. Acknowledgments We gratefully acknowledge the contribution made to the evaluations by the faculty and staff at Brighton University who participated in the identification experiments. Thanks are also due to Robert Dale, Kees van Deemter, Ielka van der Sluis and the anonymous reviewers for very helpful comments. The biggest contribution was, of course, made by the participants in the A S G R E Challenge who created the systems involved in the evaluations. References S. Bangalore, O. Rambow, and S. Whittaker. 2000. Evaluation metrics for generation. In Proceedings of the 1st International Conference on Natural Language Generation (INLG '00), pages 1­8. A. Belz and A. Gatt. 2007. The attribute selection for GRE challenge: Overview and evaluation results. In Proceedings of the 2nd UCNLG Workshop: Language Generation and Machine Translation (UCNLG+MT), p ag es 7 5 ­ 8 3 . A. Belz and E. Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In Proc. EACL'06, p ag es 3 1 3 ­ 3 2 0 . R. Dale and E. Reiter. 1995. Computational interpretations of the Gricean maxims in the generation of referring expressions. Cognitive Science, 19(2):233­263. R. Dale. 1989. Cooking up referring expressions. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics. G. Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proc. ARPA Workshop on Human Language Technology. A. Gatt, I. van der Sluis, and K. van Deemter. 2007. Evaluating algorithms for the generation of referring expressions using a balanced corpus. In Proceedings of the 11th European Workshop on Natural Language Generation (ENLG'07), pages 49­56. I. Langkilde. 2002. An empirical verification of coverage and correctness for a general-purpose sentence generator. In Proceedings of the 2nd International Natural Language Generation Conference (INLG '02). C.-Y. Lin and E. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proc. HLT-NAACL 2003, pages 71­78. R. Passonneau. 2006. Measuring agreement on setvalued items (MASI) for semantic and pragmatic annotation. In Proceedings of the 5th Language Resources and Evaluation Converence (LREC'06). 200