A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova Stanford University Lucy Vanderwende Microsoft Research Kathleen McKeown Columbia University anenkova@stanford.edu ABSTRACT lucyv@microsoft.com kathy@cs.columbia.edu The usual approach for automatic summarization is sentence extraction, where key sentences from the input documents are selected based on a suite of features. While word frequency often is used as a feature in summarization, its impact on system p erformance has not b een isolated. In this pap er, we study the contribution to summarization of three factors related to frequency: content word frequency, comp osition functions for estimating sentence imp ortance from word frequency, and adjustment of frequency weights based on context. We carry out our analysis using datasets from the Document Understanding Conferences, studying not only the impact of these features on automatic summarizers, but also their role in human summarization. Our research shows that a frequency based summarizer can achieve p erformance comparable to that of state-of-the-art systems, but only with a good comp osition function; context sensitivity improves p erformance and significantly reduces rep etition. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing General Terms Measurement, Exp erimentation, Human Factors Keywords multi-document summarization, frequency, comp ositionality, context-sensitivity 1. INTRODUCTION Most current automatic summarization systems rely on sentence extraction1 , where key sentences in the input documents are selected to form the summary. Even systems that 1 A description of some most recent systems can b e found Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. go b eyond sentence extraction, reformulating or simplifying the text of the original articles, must decide which sentences should b e simplified, compressed, fused together or rewritten [10, 11, 28, 2, 6]. Common approaches for identifying imp ortant sentences to include in the summary include training a binary classifier (e.g., [12]), training a Markov model (e.g., [4]), or directly assigning weights to sentences based on a variety of features and heuristically determined feature weights (e.g., [26, 14]). But the question of which comp onents and features of automatic summarizers contribute most to their p erformance has largely remained unanswered [18]. In this pap er, we examine several design decisions and the impact they have on the p erformance of generic multidocument summarizers of news. More sp ecifically, we study the following issues: Content word frequency. Word frequency is one feature that has b een used in many summarization systems and originated in the earliest summarization research [17]. In this approach, content words such as nouns, verbs and adjectives serve as surrogates for the atomic units of meaning in text. While frequency has b een used as a feature in many summarization systems, no study has isolated its impact on system p erformance. Only recently have large testsets for evaluation b ecome available as a result of the annual Document Understanding Conference (DUC) run by NIST, which enable analysis of p erformance, and by the time DUC b egan, most systems were using a combination of features and not frequency alone. In this pap er, we study the contribution of content word frequency in the input to system p erformance, showing that content word frequency also plays a role in human summarization b ehavior. Choice of composition function. The frequency, and thus the imp ortance, of content words can easily b e estimated from the input to a summarizer. But is this enough to build a summarization system? Normally, a summarizer produces readable text as a summary, not a list of keywords, and thus it must estimate the imp ortance of larger text units, typically sentences. A comp osition function needs to b e chosen that will estimate the imp ortance of a sentence as a function of the imp ortance of the content words that app ear in the sentence. There are many p ossibilities for the choice of comp osition function, and in Section 3 we will discuss three of them, showing that the choice can have a significant impact on the p erformance of the summarizer, ranging from close to baseline p erformance to overall stateof-the-art p erformance. in the online proceedings of the Document Understanding Conference http://duc.nist.gov 573 Context sensitivity. The notion of imp ortance is not static: it dep ends on what has b een already said in a summary. Context adjustment is esp ecially imp ortant for multidocument summarization (MDS), where the input consists of many articles on the same topic. Several articles might contain sentences expressing the same information. It is p ossible that they all get high imp ortance weights and the summary will contain rep etitive information. Avoiding repetition in the summary is a goal in summarization systems, since the very purp ose of the summary is to reduce redundancy. We prop ose a method for context sensitivity and pinp oint its contribution to multi-document summarization p erformance. In Section 4 we show how context sensitivity adjustment improves content selection and reduces rep etition in the summary. We now proceed to a detailed discussion of these three asp ects in the following sections. 2.1.1 Words frequent in the input appear in human summaries We first turn to the question Are content words that are very frequent in the input likely to appear in at least one of the human summaries? We exclude stop words from consideration in this study, and use only nouns, verbs and adjectives. Table 1 shows the p ercentage of the N most frequent content words from the input documents that also app ear in the human models, for N = 5, 8, 12. In order to compare how many of these matches are achieved by a good automatic summarizer, we picked one of the top p erforming summarizers and computed how many of the N most frequent words from the input documents app eared in its automatic summaries, and the numb ers are shown in the second row of table 1. For example, the table shows that, across the 30 sets, 95% of the five most frequent content words in the input were also used in at least one of the summaries, while the automatic summarizer used only 84% (first column of table 1). A comparison of this nature is helpful b ecause in the commonly used intrinsic evaluations for summarization (discussed in more detail in Section 5), automatic summaries are evaluated by measuring their overlap with multiple human summaries/models. Used by Human Machine 5 most freq 94.66% 84.00% 8 most freq 91.25% 77.87% 12 most freq 85.25% 66.08% 2. FREQUENCY IN HUMAN SUMMARIES One of the issues studied since the inception of automatic summarization in the 60s is that of human agreement [24]: different p eople choose different content for their summaries [27, 23, 19]. More recently, others have studied the degree of overlap b etween input documents and human summaries [5, 1]. The natural question that arises if we combine the two typ es of studies is whether features in the input can allow us to predict what content humans would choose in a summary, and what content they would agree on. If such predictors are identified, they could b e used as features for content selection by an automatic system. In this section, we focus on frequency, investigating the association b etween content that app ears frequently in the input, and the likelihood that it will b e selected by a human summarizer for inclusion in a summary. This question is esp ecially imp ortant for the multi-document summarization task, where the input consists of several articles on the same topic and usually contains a considerable amount of rep etition of the same facts across documents. We first discuss the link b etween frequency in the input at the word level and the app earance of words in human summaries (Section 2.1), and then look at frequency at a semantic level, using manually identified semantic content units (Section 2.2). Table 1: Percentage of the N most frequent words from the input documents that appear in the four human models and in a state-of-the-art automatic summarizer (average across 30 input sets). Two observations can b e made ab out the table: 1. The high frequency words from the input are very likely to app ear in the human models: the more frequent a word is in the input, the more likely it is that it will app ear in a human summary. This confirms that frequency is one of the factors that impact a human's decision to include sp ecific content in a summary. Probably observing frequency in the input indirectly helps the writers to resolve other constraints such as p ersonal interests and background knowledge. 2. For the automatic summarizer, the trend to include more frequent words is preserved: the automatic summaries include 84% of the five most frequent words in the input, 78% of the 8 most frequent words, and 66% of the 12 most frequent. But the numb ers are lower than those for the human summaries and the overlap b etween the machine summary and the human models can b e improved if the inclusion of these most frequent words is targeted. As we will show later, it is p ossible to develop a summarizer that includes a p ercentage of most frequent words equivalent to that in four human summaries. Trying to maximize the numb er of matches with the human models is reasonable, since on average across the 30 sets, the machine summary contained 30 content words that did not match any word in a human model.2 2 2.1 Content word frequency and importance In order to study how frequency influences human summarization choices, we used the 30 test sets for the multidocument summarization task from the large-scale common data set evaluation conducted within the DUC 2003. For each set, the input for summarization was available, along with four human abstracts for the input and the summaries produced by automatic summarizers that participated in the conference that year. Each of the inputs contained around 10 documents and the summaries were 100 words long. The counts for frequency in the input were taken over the concatenation of the documents in the input set. The following instructions had b een given to the human summarizers: "To write this summary, assume you have b een given a set of stories on a news topic and that your job is to summarize them for the general news sections of the Washington Post. Your audience is the educated adult American reader with varied interestes and background in current and recent events." Even though no rigorous study of the issue has b een done, it 574 Class Ci C4 C3 C2 C1 C0 Average |Ci | 7 11 24 82 1115 Average frequency 31 14 9 5 2 Table 2: Ci (the first column) is the class of words that appear in i human summaries, Average |Ci | (the second column) is the average size of class Ci , and the third column gives the average frequency of words in each class. The averages are computed for the 30 DUC'03 test sets. Summarizer Human1: Human2: Human3: Human4: Human5: System1: Human6: Human7: System2: System3: Human8: System4: System5: Log-likeliho o d -198.65 -205.90 -205.91 -206.21 -206.37 -208.21 -208.23 -208.90 -210.14 -211.06 -211.95 -212.57 -213.08 Sum. System6: Human9: System7: System8: System9: System10: System11: System12: System13: System14: System15: System16: Human10: Log-like. -213.65 -215.65 -215.92 -216.04 -216.20 -216.24 -218.53 -219.21 -220.31 -220.93 -223.03 -225.20 -227.17 2.1.2 Humans agree on words that are frequent in the input In the previous section we observed that the high frequency words in the input will tend to app ear in some human model. But will high frequency words b e words that the humans will agree on, and that will app ear in many human summaries? In other words, we want to partition the words in the input into five classes Cn dep ending on how many human summaries they app ear in, n = 0...4, and check if high class numb er is associated with higher frequency in the input for the words in the class. A word falls in C0 if it does not app ear in any of the human summaries, in C1 if it app ears in only one human summary and so on. Now we are interested to see how frequent the words in each class were in the resp ective input. We found that, in fact, the words that human summarizers agreed to use in their summaries include the high frequency ones and the words that app ear in only one human summary tend to b e low frequency words as can b e seen in table 2. The content words that were used by all four summarizers (in class C4 ) had average frequency in the input equal to 31, while the words that never app eared in a human summary app eared on average ab out two times in the entire input of ten articles. In the 30 sets of DUC 2003 data, the state-of-the-art machine summary contained 69% of the words app earing in all 4 human models and 46% of the words that app eared in 3 models. This indicates that high-frequency words, which human summarizers will tend to select and thus will b e rewarded for example during automatic evaluation, are missing from the summary. Table 3: Average log-likelihood for the summaries of human and automatic summarizers in DUC'03. All summaries were truncated to 80 words to neutralize the effect of deviations from the required length of 100 words summary. It is obvious that words with high frequency in the input will b e assigned high emission probabilities. The likelihood of a summary then is L[sum; p(wi )] = N! p(w1 )n1 · ... · p(wr )nr n1 !...nr ! (1) 2.1.3 Formalizing frequency: the multinomial model The findings from the previous sections suggest that frequency in the inputs is strongly indicative of whether a word will b e used in a human summary. We start out with assessing the plausibility of a formal method capturing the relation b etween the occurrence of content words in the input and in summaries by modeling the app earance of words in the summary under a multinomial distribution estimated from the input. That is, for each word w in the input vocabulary, we associate a probability p(w) for it to b e emitted into a can b e considered that the content words that do not match any of the models describ e "off-topic" events. This is consistent with the results from the quality evaluation of machine summaries in which human judges p erceived more than half of the summary content to b e "unnecessary, distracting or confusing." where N is the numb er of words in the summary, r is the numb er of unique words in the summary, n1 + ... + nr = N and for each i, ni is the numb er of times word wi app ears in the summary and p(wi ) is the probability of wi app earing in the summary estimated from the input documents. In order to confirm the hyp othesis that human summaries have high likelihood under a multinomial model, we computed the log-likelihood log [L(sum; p(wi ))] of all human and machine summaries from DUC'03 (see Table 3). There were 30 summaries from each system, and 12 summaries from each p erson. The log-likelihood is computed rather than the likelihood in order to avoid numeric problems such as underflow for very small probabilities. If human summaries have higher likelihood under the model than machine ones, we can conclude that a multinomial model captures more asp ects of the human summarization process than of that of current automatic summarizers. And indeed: the loglikelihood of summaries produced by human summarizers were overall higher than for those produced by systems and the fact that the top five highest log-likelihood scores b elong to humans indicate that some humans indeed employ a summarization strategy informed by frequency.3 2.2 Frequency of semantic content units We established that high-frequency content words in the input are very likely to b e used in human summaries, and that there will b e a consensus ab out their inclusion in a summary b etween different human summarizers. But the co-occurrence of words in the inputs and the human summaries does not necessarily entail that the same facts have b een covered. A b etter granularity for such investigation is the semantic content unit, an atomic fact expressed in a 3 Other humans might have other strategies, such as giving maximum coverage of topics mentioned in the input, even those mentioned only once. Human10 app ears to have such a strategy for example (after examination of his summaries). 575 text, such as the summary content units that form the basis of the pyramid method used for evaluation in the last DUC [19, 22]. In this annotation procedure, the content units are manually annotated4 , and expressions with the same meaning are linked together, even when there are differences in wording. For example, two documents can contain the sentences "Pinochet was arrested in the UK" and "Pinochet's arrest in Britain caused international controversy". While the wording is not exactly the same, b oth sentences express the content units Pinochet was arrested and The arrest took place in Britain. Evans and McKeown [8] annotated 11 sets of DUC 2004 input documents and human written summaries for content units following the pyramid approach. Based on their annotation, we were able to measure how predictive the frequency of content units in the documents is for the selection of a content unit in a human summary. As in our study for words, we looked at the N most frequent content units in the inputs and calculated the p ercentage of these that app eared in any of the human summaries. Similarly to the case of words, of the 5 most frequent content units, 96% app eared in a human summary across the 11 sets. The resp ective p ercentages for the top 8 and top 12 content units were 92% and 85%. Thus content unit frequency is highly predictive for inclusion in a human summary, with the p ercentage of high frequency content units that are expressed in human summaries almost identical to the p ercentage for content words, presented in table 1. Content units that are expressed in more human summaries, also occurred more often in the input, in agreement with the conclusion we drew from the analogous investigation on the word level. In an additional exp eriment to confirm the hyp othesis that frequency of content units is a predictive feature for summarization, we used the summarizer evaluation based on the 11 sets and rep orted in [7], and we computed the correlation b etween the weight of a content unit from the input documents (equal to the numb er of times the content unit was expressed in the input/its frequency) and the content unit weight from human summaries (equal to the numb er of summarizers that expressed the content unit in their summaries of the input). The Pearson's correlation coefficient b etween the input and human summaries weights is 0.64 (p-value=0), strongly indicating that content units that are rep eated in several documents are likely to b e picked in consensus by several humans and showing that frequency in the input helps predict human agreement in terms of content units. The lower than p erfect correlation shows that there are other factors at play that influence human content selection decisions, which we do not find surprising at all and the discovery of which will b e the focus of future work. usual units for extraction in summarization? We can define a family of summarizers, SUMC F , where C F is the combination function yielding the imp ortance of a sentence based on the words contained in that sentence. Different choices of C F will give different summarizers from the frequency based summarizer family. Below we outline the overall summarization algorithm and discuss p ossible choices of C F . Context-sensitive frequency-based summarizer Step 1 Compute the probability distribution over the words wi app earing in the input, p(wi ) for every i; p(wi ) = n , where n is the numb er of times the word app eared N in the input, and N is the total numb er of content word tokens in the input. Only verbs, nouns, adjectives and numb ers are considered in the computation of the probability distribution. Note that if part-ofsp eech tag were unavailable, we could use a simple stop word list in order to decide which words to count as content words. Step 2 Assign an imp ortance weight to each sentence Sj in the input as a function of the imp ortance of its content words. W eig ht(Sj ) = C F [p(wi )] for wi Sj Step 3 Pick the b est scoring sentence under the scoring function C F from the previous step. Step 4 If the desired summary length has not b een reached, go back to Step 2. Different summarizers SUMC F can b e obtained by making different choices for the comp osition function C F . Three obvious candidates for C F are: Q Product (C F ) For this choice of C F Q W eig ht(Sj ) = wi Sj p(wi ) Average (C F AvP For this choice of C F r) W eig ht(Sj ) = P Sum (C F ) FoP his choice of C F rt W eig ht(Sj ) = wi Sj p(wi ) Each of these choices for C F leads to a different frequency based summarizer and we will see that the sp ecific choice has a huge impact on the p erformance of the summarizer; not all frequency-based summarizers p erform well. How does a summarizer SUMC F do in terms of inclusion of top frequency words compared to humans and other top p erforming systems? Table 4 shows the p ercentage of the N most frequent words from the DUC'03 documents that also app ear in SUMAvr summaries. As exp ected, these are much higher than the p ercentages for the non-frequency oriented machine summarizer; moreover, they are even higher than in all four human models taken together. |{wi |wi Sj }| wi Sj p(wi ) 3. COMPOSITION FUNCTIONS Now that we have shown that frequency is a good predictor of content in human summaries and that human summaries have higher likelihood under a multinomial model, how can we extend these empirical findings to building a summarizer? The question is not trivial: normally, only the frequency of content words can b e easily obtained from the input, but how is the frequency of words to b e combined in order to get an estimate for the imp ortance of sentences, the 4 4. CONTEXT ADJUSTMENT Using frequency alone to determine summary content in multi-document summarization will result in a rep etitive summary. We can adjust the algorithm to account for information included so far by adding Step 3.5, shown b elow. Using a convenient visualization tool, DUCView. 576 Used by Human Machine SUMAvr 5 most freq 94.66% 84.00% 96.00% 8 most freq 91.25% 77.87% 95.00% 12 most freq 85.25% 66.08% 90.83% System SUMQ SUMAvr SUMP # of sentences 270 223 155 Sentences per summary 5.40 4.46 3.10 Table 4: Percentage of the N most frequent words from the input documents that appear in one of the four human models, a state-of-the-art machine summarizer and SUMAvr , a new machine summarizer based on frequency that uses the average as a composition function. Table 5: Number of sentences in systems' summaries: the choice of composition function C F affects systems' preference to longer or shorter sentences and SUMAvr is the more balanced one. Step 3.5 For each word wi in the sentence chosen at step 3, up date its probability by setting it to a very small numb er close to 0. Here we used 0.0001 for this numb er. It serves a threefold purp ose: 1. It gives the summarizer sensitivity to context. The notion of what is most imp ortant to include in the summary changes dep ending on what information has already b een included in the summary. 2. By up dating the probabilities in this intuitive way, we also allow words with initially low probability to have higher impact on the choice of subsequent sentences. If we look back at table 2, we see that this is a reasonable goal, since the large class of words that were expressed only in one model were not that frequent; that is, content that humans will not necessarily agree on, but is still good for inclusion, is not characterized by high frequency. 3. The up date of word probability gives a natural way to deal with the redundancy in the multi-document input. In fact, in terms of content units, the inclusion of the same unit twice in the same summary is rather improbable. As we see in the following evaluation section, no further checks for duplication seem to b e necessary. In the next section, we evaluate the algorithm b oth with and without step 3.5, showing that when it is removed from the algorithm, the summarizer does worse on content selection and there is a substantial increase in information rep etition in the summary. 5. EVALUATION RESULTS To evaluate the p erformance of the three SUMC F summarizers, b oth with and without context sensitive adjustment, we use the test data from two large common data set evaluation initiatives--the 50 test sets for multi-document summarization task for DUC 2004 and the common test set provided in the 2005 Machine Translation and Summarization Evaluation (MSE) initiative. Both tasks were to produce a generic 100-word summary of several related articles, but in the MSE task some of the input consisted of machine translated text. Document Understanding Conference We used the data from the 2003 DUC conference for development and the data from the 2004 DUC as test data, which we rep ort on here. We tested the SUMC F family of summarizers on the 50 sets from the generic summary task in 2004 DUC. Even b efore analysis of quantitative metrics, we can see that the choice of combination function C F has a significant impact on summarizer p erformance. One would exp ect that the probabilistic summarizer SUMQ would favor shorter sentences b ecause as the sentence gets longer, their overall probability involves the multiplication of more word probabilities (numb ers b etween 0 and 1) and thus overall longer sentences will have lower probability. Exactly the opp osite would b e exp ected from SUMP , which assigns sentences a weight equal to the sum of probabilities of the words in the sentence. The more words there are in the sentence, the higher the sentence weight will tend to b e. SUMAvr is a compromise b etween the two extremes. To confirm this intuition ab out the b ehavior of the summarizers dep ending on the choice of C F , we looked at the length in sentences of the summaries that they produced. Table 5 shows the numb er of sentences across the 50 summaries produced by each of the systems. Our intuition is confirmed, with SUMP producing summaries of ab out three sentences and SUMQ getting ab out five sentences p er summary, for the same size in words. The average human summary for the same topics has around four sentences, close that for SUMAvr . For the evaluation, we use the ROUGE-1 automatic metric, which has b een shown to correlate well with human judgments based on comparison with a single model [15, 13] and which was found to have one of the b est correlations with human judgment on the DUC 2004 data [21] among the several p ossible automatic metrics. In addition, we rep ort the ROUGE-2 and ROUGE-SU4 metrics, which were used as official automatic evaluation metrics for MSE 2005 and DUC 2005. The results are obtained with ROUGE version 1.5.5 with the settings used for DUC 2005 (with -s option for removing stopwords for ROUGE-1).5 All summaries were truncated to 100 words (space delimited tokens) for the evaluation, as is normally done in DUC evaluations. The first column of table 6 also lists the numb er of words in the 50 summaries in the test set. Some systems did not generate the longest p ossible summary. Peer 120 was an extreme example, producing summaries with average length of 78 words. But the impact of p eer summary length on the final ranking of the systems is unlikely to b e big, since most systems produced summaries very close to the required 100 word limit. An approximate result on determining which differences in scores are significant can b e obtained by comparing the 95% confidence intervals for each mean. Significant differences 5 The exact parameters we used were -n 2 -x -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -d 577 SYSTEM peer 65 (4988) peer 34 (4954) peer 102 (4951) SUMP (5000) peer 124 (4988) SUMAvr (5000) peer 44 (4854) peer 81 (4994) peer 55 (4971) peer 93 (4612) SUMAvr N oAdj ust (5000) peer 120 (3903) peer 117 (4997) peer 140 (5000) peer 11 (4172) peer 138 (5000) SUMQ (5000) Baseline (4899) peer27 (4686) peer123 (4338) peer 111 (5000) ROUGE-1 0.305 (0.289; 0.320) 0.287 (0.271; 0.305) 0.285 (0.268; 0.303) 0.283 (0.267; 0.300) 0.282 (0.265; 0.300) 0.280 (0.265; 0.297) 0.273 (0.256; 0.290) 0.268 (0.251; 0.285) 0.262 (0.247; 0.280) 0.253 (0.235; 0.271) 0.252 (0.235; 0.269) 0.251 (0.231; 0.271) 0.238 (0.221; 0.257) 0.239 (0.219; 0.260) 0.239 (0.218; 0.259) 0.230 (0.211; 0.253) 0.227 (0.210; 0.245) 0.202 (0.183; 0.221) 0.185 (0.166; 0.204) 0.189 (0.173; 0.206) 0.063 (0.053; 0.073) ROUGE-2 0.089 (0.081; 0.098) 0.074 (0.065; 0.083) 0.084 (0.076; 0.091) 0.079 (0.072; 0.087) 0.081 (0.073; 0.088) 0.076 (0.069;0.084) 0.076 (0.067; 0.084) 0.078 (0.070; 0.087) 0.069 (0.062; 0.077) 0.072 (0.066; 0.080) 0.075 (0.069; 0.083) 0.077 (0.068; 0.085) 0.057 (0.051; 0.063) 0.068 (0.060; 0.076) 0.071 (0.062; 0.080) 0.069 (0.061; 0.077) 0.058 (0.050; 0.065) 0.061 (0.052; 0.070) 0.046 (0.039; 0.055) 0.049 (0.043; 0.056) 0.016 (0.013; 0.019) ROUGE-SU4 0.130 (0.123; 0.137) 0.121 (0.113; 0.129) 0.126 (0.119; 0.132) 0.122 (0.115; 0.129) 0.123 (0.116; 0.131) 0.121 (0.115; 0.127) 0.119 (0.111; 0.126) 0.121 (0.113; 0.128) 0.114 (0.107; 0.121) 0.107 (0.101; 0.114) 0.116 (0.108; 0.124) 0.108 (0.099; 0.117) 0.107 (0.100; 0.113) 0.108 (0.101; 0.116) 0.105 (0.096; 0.114) 0.106 (0.098; 0.113) 0.104 (0.097; 0.110) 0.098 (0.092; 0.106) 0.090 (0.083; 0.098) 0.090 (0.084; 0.096) 0.057 (0.053; 0.061) Table 6: DUC'04 ROUGE-1, ROUGE-2 and ROUGE-SU4 stemmed, stop-words removed for ROUGE-1 test set scores and their 95% confidence intervals for participating systems, the baseline, and SUMC F . are those where the confidence intervals for the estimates of the means for the two systems either do not overlap at all, or where the two intervals overlap but neither contains the b est estimate for the mean of the other, though [25] warns that the latter approach may indicate significance more often than it should. Table 6 also shows scores for the 16 other participating systems from DUC 2004, and the baseline, which was selecting the b eginning of the latest article as a summary. Several conclusions can b e drawn from the table: Comparison between SUMC F summarizers: All three SUMC F summarizers use word frequency in the input as a feature but have a different comp osition function C F to assign weights to sentences. SUMQ is a probabilistic summarizer and the weight it assigns to each sentence is in fact the probability of the sentence. SUMAvr and SUMP assign to sentences weight equal to the average and the sum of the probabilities of the words in the sentence resp ectively. For these two latter summarizers, the raw frequency of words could b e used instead of word probabilities. For all three automatic metrics, SUMQ is significantly worse than SUMP and SUMAvr and is in fact very close to baseline p erformance. SUMAvr and SUMP are almost identical in terms of ROUGE scores. The effect of context adjustment: In the table we have listed the automatic scores for SUMAvrN oAdj ust . This is the summarizer for which the comp osition function C F Av r , but without Step 3.5 from the summarization algorithm, which is resp onsible for adjusting the weights for words that app ear in sentences already chosen for inclusion in the summary. All three metrics indicate that the content selection capability of the summarizer is affected by the removal of the context adjustment step. According to ROUGE-1, removing the context adjustment leads to significantly lower results, while for the other two metrics the deterioration is not significant. In order to assess how much Step 3.5 affected the occurance of rep etition in the summaries, we analyzed 10 of the produced summaries for rep eated content units. There were 3 rep eated content units in the SUMAvr summaries, and 13 rep eated content units in the SUMAvrN oAdj ust summaries, which is a substantial increase. Comparison with other DUC systems SUMP and SUMAvr p erform extremely well compared to the other DUC 2004 systems. Peer 65 is the only system that significantly outp erforms them, while ten (more than half ) of the other systems are significantly worse. It is worth noting that p eer 65 is a sup ervised HMM system [4], requring training data and parameter adjustment, while the SUMC F summarizers are non-sup ervised and totally data-driven. In sum, the SUMC F summarizers are ab out as good as the b est DUC 2004 participants. Overall, SUMAvr is the b est of the SUMC F family in balancing content selection scores and sentence length preference, and this is the summarizer we choose for later comparisons. Its sentence selection scores are comparable to that of the b est DUC 2004 summarizers, it has most success in avoiding rep etition in the summary from the frequency summarizer family, and it is least sensitive to the influence of sentence length on the sentence weight. Machine translation and summarization evaluation 2005 In April 2005, a multi-document summarization evaluation task was conducted as part of the Machine Translation and Summarization Workshop at ACL.6 The task was to produce a 100-word summary from multi-document inputs consisting of a mixture of English documents and machine translations to English of Arabic documents on the same topic. Some summarizers were modified for this task to use redundancy to correct errors in the machine translations, or to avoid MT text altogether and choose only sentences from the English input. We ran SUMAvr without any modifications to account for the non-standard input [29]. The light-weight version of the summarizer was run, which did not require part of sp eech 6 http://www.isi.edu/ cyl/MTSE2005/MLSummEval.html 578 system 1 28 19 SUMAvr 10 16 13 25 4 7 pyramid 0.52859 0.48926 0.45852 0.45274 0.44254 0.45059 0.43429 0.39823 0.37297 0.37159 R-2 0.13076 0.16036*** 0.11849 0.12678 0.13038 0.13355 0.08580*** 0.11678 0.12010 0.09654*** R-SU4 0.15670 0.18627*** 0.14971*** 0.15938 0.16568 0.16177 0.11141*** 0.15079 0.15394 0.13593 repetition 1.4 3.4*** 1.3 0.6 1.2 0.9 0.4 2.7*** 4.1*** 0.4 6. RELATED WORK Maximal Marginal Relevance (MMR) is the method for redundancy removal mentioned most often in the context of summarization research. The method was first introduced in [3] and was applied for multi-document summarization in [9]. The MMR approach was develop ed primarily for information retrieval and query-focused summarization, and gives a summarizer sensitivity to context by reweighting sentences using a linear combination of the similarity b etween the sentence and 1) the query and 2) the summary sentences already selected in the summary. The b est sentence is considered the one that is most similar to the query and least similar to the text that is already in the summary. In [9], the technique was used to create multi-document extracts of 25 sets of 10 articles each. The evaluation was done by computing the cosine similarity b etween the extract and a human model extract for the same set. In this setting, extracts produced using MMR and those not using the technique received the same evaluation score, and thus the usefulness of the technique could not b e demonstarted. Many systems use the MMR idea for generic multi-document summarization,7 where no user query is available, by setting a single paramater for similarity and rejecting all sentences that have similarity with the already chosen part of the summary that exceeds this predefined treshold. An evaluation of how changing this paramater influences the quality of the summaries has not b een rep orted. In addition to this similarity parameter, the similarity measure that is used makes a difference for the success in duplication removal, as rep orted in [20], who focused on the study of different similarity metrics for duplication removal. Table 7: Results from the MSE evaluation. Pyramid scores and duplication is computed for 10 test sets, automatic scores for all 25 test sets. Numbers flagged by "***" are significantly different from the results form SUMAvr . For repetition, higher numbers are worse, indicating that there was more repetition in the summary. tags and which excluded stop words from a given stop word list. The official evaluation metrics adopted for the workshop were the manual pyramid score, ROUGE-2 (the bigram overlap metric) and ROUGE-SU4 (skip bigram). The skip bigram metric measures the occurrence of a pair of words in their original sentence order, p ermitting up to four intervening words. The metric was originally prop osed for machine translation evaluation and was shown to correlate well with human judgments b oth for machine translation and for summarization [13, 16]. The pyramid method was used to evaluate only 10 of the test sets, while the automatic metrics were applied to all 25 test sets. The average results for each p eer for the three metrics is shown in table 7. For the manual pyramid scores, none of the differences b etween systems were significant according to a paired t-test at the 5% level of significance. This is not surprising, given the small numb er of test p oints. There were only three p eers with average scores larger than that of SUMAvr , and six systems with lower average pyramid p erformance. We again see that SUMAvr is comp etitive in comparison with other, more sophisticated, MDS systems in terms of content selection and is one of the b est systems in avoiding rep etion in the summaries. For the automatic metrics, significance was based again on the 95% confidence interval provided by ROUGE. One system was significantly b etter than SUMAvr , and for each of the automatic metrics there were two systems that were significantly worse than SUMAvr . The rest of the differences were not significant. In table 7, results that are significantly different from those for SUMAvr are flagged by "***". During the annotation for the pyramid scoring, the content units that were rep eated in an automatic summary were marked up: we include in the results table the average numb er of rep eated SCUs p er summary for all systems. SUMAvr was one of the systems with the lowest amount of rep etition in its summaries, with three of the other p eers including significantly more rep etitive information. These results confirm our intuition that the weight up date of words to adjust for context is sufficient for dealing with duplication removal problems. This exp eriment also confirms that SUMAvr is a robust summarizer with good p erformance. 7. CONCLUSIONS Our analysis using the DUC datasets shows that frequency has a p owerful impact on the p erformance of summarization systems, provided that a good comp osition function is used. Our results show that averaging word probabilities yields a system that p erforms comparably to other state-of-theart systems and that outp erfoms many of the participating systems. When context is taken into account and probabilities are adjusted when the word has already app eared in the summary, p erformance based on content shows an improvement, but more imp ortantly, rep etition in the summary significantly decreases. These results suggest that the more complex combination of features used by state-of-the-art systems today may not b e necessary and the contribution of such features needs to b e precisely isolated. They highlight the fact that comp osition plays an imp ortant role in p erformance, but is an unknown for most state-of-the-art systems, who often do not rep ort the comp osition function that was used. Furthermore, they demonstrate that rep etition can b e reduced within the same frequency-based model. It is worth noting that the presented summarization algorithm uses frequency in a greedy way, choosing the current b est sentence at each iteration. Such an approach does not take advantage of the result we demostrated that human summaries tend to have high likelihood under a multinomial model. This fact could b e used in a global optimization algorithm, p ossibly leading to b etter results. 7 See for example the online DUC 2004 proceeding 579 8. REFERENCES [1] M. Banko and L. Vanderwende. Using n-grams to understand the nature of summaries. In Proceedings of HLT/NAACL'04, 2004. [2] R. Barzilay and K. McKeown. Sentence fusion for multidocument news summarization. Computational Linguistics, 31(3), 2005. [3] J. Carb onell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), pages 335­336, 1998. [4] J. Conroy, J. Schlesinger, J. Goldstein, and D. O'Leary. Left-brain/right-brain multi-document summarization. In Proceedings of the 4th Document Undersatnding Conference (DUC'04), 2004. [5] T. Cop eck and S. Szpakowicz. Vocabulary agreement among model summaries and source documents. In Proceedings of the Document Understanding Conference DUC'04, 2004. [6] H. Daumī I I I and D. Marcu. Bayesian e multi-document summarization at mse. In Proceedings of the Workshop on Multilingual Summarization Evaluation (MSE), Ann Arb or, MI, June 29 2005. [7] D. K. Elson. Pro ject logline: Rhetorical categorization for multidocument news summarization. Master's thesis, Columbia University, 2005. [8] D. K. Evans and K. McKeown. Identifying similarities and differences across english and arabic news. In Proceedings of the International Conference on Intel ligence Analysis, 2005. [9] J. Goldstein, V. Mittal, J. Carb onell, and J. Callan. Creating and evaluating multi-document sentence extract summaries. In CIKM '00: Proceedings of the ninth international conference on Information and know ledge management, pages 165­172, 2000. [10] H. Jing and K. McKeown. Cut and paste based text summarization. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL'00), 2000. [11] K. Knight and D. Marcu. Summarization b eyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intel ligence, 139(1), 2002. [12] J. Kupiec, J. Perersen, and F. Chen. A trainable document summarizer. In Research and Development in Information Retrieval, pages 68­73, 1995. [13] C.-Y. Lin. Rouge: a package for automatic evaluation of summaries. In Proceedings of the Workshop in Text Summarization, ACL'04, 2004. [14] C.-Y. Lin and E. Hovy. Automated multi-document summarization in neats. In Proceedings of the Human Language Technology Conference (HLT2002 ), 2002. [15] C.-Y. Lin and E. Hovy. Automatic evaluation of summaries using n-gram co-occurance statistics. In Proceedings of HLT-NAACL 2003, 2003. [16] C.-Y. Lin and F. J. Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), 2004. [17] H. P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2):159­165, 1958. [18] D. Marcu and L. Gerb er. An inquiry into the nature of multidocument abstracts, extracts, and their evaluation. In Proceedings of the NAACL-2001 Workshop on Automatic Summarization, 2001. [19] A. Nenkova and R. Passonneau. Evaluating content selection in summarization: The pyramid method. In Proceedings of HLT/NAACL 2004, 2004. [20] E. Newman, W. Doran, N. Stokes, J. Carthy, and J. Dunnion. Comparing redundancy removal techniques for multi-document summarisation. In Proceedings of STAIRS, pages 223­228, 2004. [21] P. Over and J. Yen. An introduction to duc 2004 intrinsic evaluation of generic news text summarization systems. In Proceedings of DUC 2004, 2004. [22] R. Passonneau, A. Nenkova, K. McKeown, and S. Sigleman. Pyramid evaluation ot duc 2005. In Proceedings of the Document Understanding Conference (DUC'05), 2005. [23] D. Radev, S. Teufel, H. Saggion, and W. Lam. Evaluation challenges in large-scale multi-document summarization. In ACL, 2003. [24] G. J. Rath, A. Resnick, and R. Savage. The formation of abstracts by the selection of sentences: Part 1: sentence selection by man and machines. American Documentation, 2(12):139­208, 1961. [25] N. Schenker and J. Gentleman. On judging the significance of differences by examining the overlap b etween confidence intervals. The American Statistician, 55(3):182­186, 2001. [26] B. Schiffman, A. Nenkova, and K. McKeown. Exp eriments in multidocument summarization. In Proceedings of the Human Language Technology Conference, 2002. [27] H. van Halteren and S. Teufel. Examining the consensus b etween human summaries: initial exp eriments with factoid analysis. In HLT-NAACL DUC Workshop, 2003. [28] L. Vanderwende, M. Banko, and A. Menezes. Event-centric summary generation. In Proceedings of the Document Understanding Conference (DUC'04), 2004. [29] L. Vanderwende and H. Suzuki. Frequency-based summarizer and a language modeling extention. In MSE 2005 common data task evaluation, 2005. 580