WWW 2007 / Track: Data Mining

Session: Mining Textual Data

Summarizing Email Conversations with Clue Words
Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou
Depar tment of Computer Science University of British Columbia, Canada
{carenini,

rng, xdzhou}@cs.ubc.ca

ABSTRACT
Accessing an ever increasing numb er of emails, p ossibly on small mobile devices, has b ecome a ma jor problem for many users. Email summarization is a promising way to solve this problem. In this pap er, we prop ose a new framework for email summarization. One novelty is to use a fragment quotation graph to try to capture an email conversation. The second novelty is to use clue words to measure the imp ortance of sentences in conversation summarization. Based on clue words and their scores, we prop ose a method called CWS, which is capable of producing a summary of any length as requested by the user. We provide a comprehensive comparison of CWS with various existing methods on the Enron data set. Preliminary results suggest that CWS provides b etter summaries than existing methods.

Categories and Subject Descriptors
H.2.8 [Database applications]: [Data mining]

General Terms
Algorithms

Keywords
Text mining, email summarization

1.

INTRODUCTION

With the ever increasing p opularity of emails, email overload b ecomes a ma jor problem for many email users [17]. Users sp end a lot of time reading, replying and organizing their emails. To help users organize their email folders, many forms of supp ort have b een prop osed, including spam filtering[10], email classification[18] and email visualization[14]. In this pap er, we discuss a different form of supp ort - email summarization. The goal is to provide a concise, informative summary of emails contained in a folder, thus saving the user from browsing through each email one by one. The summary is intended to b e multi-granularity in that the user can sp ecify the size of the concise summary (e.g., dep ending on how much time the user wants to sp end on the folder). Email summarization can also b e valuable for users reading emails with mobile devices. Given the small screen size of handheld devices, efforts have b een made to re-design the user interface. However, providing a concise summary may b e just as imp ortant.
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005.

Email summarization is challenging in at least the following asp ects. Many emails are asynchronous resp onses to some previous messages and as such they constitute a conversation, which may b e hard to reconstruct in detail. A conversation may involve many users, many of whom may have different writing styles (e.g., short vs long sentences, formal vs informal). Finally, hidden emails may carry imp ortant information to b e part of the summary. As defined in [3], a hidden email is an email quoted by at least one email in the folder but is not present itself in the user's folders. Several recent attempts have b een made to capture conversations by email threading. Many email programs provide features to group emails into threads using headers[6]. Research studies by Ramb ow et al.[9], Wan et al.[15] and Lam et al.[4] go further by using features defined on the threads to generate summaries. While a more detailed comparison will b e discussed later, we b elieve that threading at the granularity level of emails is not sufficient and can b e significantly refined. Furthermore, none of these approaches handle hidden emails. In this pap er, we claim the following contributions. In Section 3, we prop ose using the fragment quotation graph to capture conversations. Based on an analysis of the quotations emb edded in emails, the graph provides a fine representation of the referential structure of a conversation. The graph is also capable of dealing with hidden emails. In Section 4, we prop ose an email summarization method, called ClueWordSummarizer (CWS), based on a novel concept called clue words. A clue word from a node is a word (modulo stemming) that app ears also in its parent node(s) and/or child node(s) in the quotation graph. It is imp ortant to note that a clue word takes into account simultaneously (part of ) the content and the structure of the quotation graph. Moreover, CWS can produce summaries of any size as requested by the user. It is an op en question how human would summarize email conversations. Thus, in Section 5, we present results of a user study on summarizing 20 conversations from the Enron data set. Not only does this study provide a gold standard to evaluate CWS and other summarization methods, but it also sheds light on the imp ortance of clue words and hidden emails to human summarizers. In Section 6, we evaluate the effectiveness of CWS on the Enron data set. We compare CWS with other summarization approaches. Our preliminary results show that b oth the quotation graph and clue words are valuable for summarizing email conversations.

91


WWW 2007 / Track: Data Mining

Session: Mining Textual Data

2.

RELATED WORK

Email summarization can b e viewed as a sp ecial case of multi-document (MD) summarization. Radev et al. develop MEAD which gives a score to each sentence based on its similarity to the TFIDF centroid of the whole document set and other prop erties such as p osition in a document, sentence length and inter-sentence similarity [7]. Erkan et al.[5] develop the LexPageRank to rank sentences based on the eigenvector centrality. They compute a sentence linkage matrix as the sentence similarity and use this matrix with the well-known PageRank algorithm. Wan et al.[16] generate an affinity graph from multiple documents and use this graph for summarization. They consider b oth the information richness and the sentence novelty based on the sentence affinity graph. However, MD summarization methods, when applied to emails, do not take into account the key differences b etween emails and conventional documents. Key differences include the referential structure of conversations, the existence of hidden emails and the high variability of writing styles. Section 6 will compare CWS with MEAD for email summarization. Ramb ow et al. apply a machine learning approach to email summarization [9]. They use RIPPER as a classifier to determine which sentences should b e included in a summary. Features used for learning include linguistic features, and features describing the email and the threading structure. Such an approach requires a large numb er of p ositive examples and cannot produce summaries with varying length based on the users request. It is also not clear how this approach can handle hidden emails. Section 6 will compare CWS with RIPPER. Wan et al. study decision-making summarization for email conversations [15]. Email threading is used. Among the various sets of features explored, their exp eriments show that a centroid based method is effective. In our earlier studies, we focus on the re-construction of hidden emails [3, 2]. The focus here is completely different in generating summaries of conversations, regardless of whether there are hidden emails or not. In [3, 2], we use a precedence graph to re-construct hidden emails. The fragment quotation graph here is different in at least two ways. First, the nodes are different as a fragment quotation graph creates nodes for b oth new and hidden fragments. More imp ortantly, the edges in a precedence graph capture textual ordering of the nodes within one hidden email, whereas the edges in a fragment quotation graph reflect the referential relationship among multiple emails. As for extracting conversations, Yeh et al. study how to use quotation matching to construct email threads [6]. Their exp eriments show a higher recall than the headerbased threading method. This supp orts our use of the quotation graph as a representation of email conversation. Shrestha et al. prop ose methods to automatically identify the questionanswer pairs from an email thread [11]. Their method may b e useful in building the conversation structure for the purp ose of email summarization. Agrawal et al. extract social networks from newsgroups [8]. Stolfo et al. study the b ehavior model of email users based on the social network analysis among email corresp ondences [12]. They develop an email mining toolkit and use it to identify target emails without analyzing the email content.

3. BUILDING THE FRAGMENT QUOTATION GRAPH
For any given email folder, there may b e multiple email conversations. To capture these different conversations, we assume that if one email quotes another email, they b elong to the same conversation. We use a fragment quotation graph to represent conversations. A fragment quotation graph G = (V , E ) is a directed graph, where each node u  V is a text unit in the email folder, and an edge (u, v ) means node u is in reply to node v . We are aware that this framework cannot represent every kind of email conversations. For example, there are cases where the original email is not quoted by the replies, and there are cases where the quotation is irrelevant to the topic discussed. Nonetheless, we b elieve that fragment quotation graphs can b e applied to many practical situations.

3.1 Identifying Quoted and New Fragments
Quotation identification is itself a research problem [13]. Here we assume that there exist one or more quotation markers (e.g., ">") that are used as a prefix of every quoted line. We define the quotation depth of a line as the numb er of quotation markers ">" in the prefix. The quotation depth reflects the numb er of times that this line has b een quoted since the original message containing this line was sent. A quoted fragment is a maximally contiguous block of quoted lines having the same quotation depth. A new fragment is a maximally contiguous block of lines that are not prefixed by the quotation markers. In other words, an email can b e viewed as an alternating sequence of quoted and new fragments, or vice versa. For convenience, we use Mi .q uote and Mi .new to denote the set of quoted and new fragments in email Mi resp ectively. We use Mi .f r ag to denote the sequence of fragments, b oth quoted and new ones. The order of the fragments is in accordance to their textual order in Mi . We denote the quotation depth of fragment F as F.q tDepth. The quotation depth of a new fragment is defined as 0.

3.2 Creating Nodes
Given an email folder F dr = {M1 , . . . , Mn }, we construct a fragment quotation graph G as follows. After the aforementioned identification of quoted and new fragments in each email Mi in F dr , the first step is to identify distinct fragments, each of which will b e represented as a node in the graph. Note that when a user quotes an email, the user might p erform various actions, as she can edit the fragments as free text. She can quote the exact sequence verbatim; or she can delete some parts of it. For example, a new fragment from an earlier email can b e of the form N F1 = F1 , F2 , F3 . In a latter email, a quoted fragment may b e QF2 = F1 , F3 . In another email, it may app ear as QF3 = F1 , F4 , F3 , where fragment F4 was quoted from yet another email. The p oint is that a fragment can b e quoted by many emails, with each user p ossibly p erforming different edits to it. The goal of the task here is to avoid as much duplication as p ossible in the quoted and new fragments. To do so, quoted and new fragments from all emails in F dr are matched against each other to identify overlaps. We say that there is an overlap b etween two fragments if there is a common substring that is sufficiently long. In this process, fragments may b e split. Using the ab ove examples, when

92


WWW 2007 / Track: Data Mining
E1 a E2 b >a E3 c >b >>a E6 >g i >h j e a b c d

Session: Mining Textual Data

E4 d e >c >>b >>>a

E5 g h >>d >f >>e

h f g

j

i

(a) Conversation involving 6 Emails

(b) Fragment Quotation Graph Figure 1: A Real Example

QF1 is matched against QF2 , the fragments F1 , F2 , F3 are identified (assuming that they are longer than the overlap threshold). QF1 , QF2 are then replaced by F1 , F2 , F3 . And when QF3 is processed, F4 is identified as well. This also shows how hidden fragments (e.g., F4 ) can b e identified. At the end of this process, every distinct fragment is represented as a node in the fragment quotation graph G. Note that while the processing is quadratic with resp ect to the numb er of emails in F dr , this is designed to b e as accurate as p ossible to extract the conversations. Indexing techniques similar to those develop ed in [2] [6] can b e used to significantly reduce the processing time. Figure 1(a) shows a real example of a conversation from a b enchmark data set involving 6 emails. For the ease of representation and space limit, we do not show the original content and abbreviate them as a sequence of fragments. In the fragment identification step, all new and quoted fragments are identified. For instance, E3 is decomp osed into 3 fragments: new fragment c and quoted fragments a and b (i.e., different quotation depth). Similarly, 4 fragments are identified from each of E4 , E5 and E6 . Then in the node creation step, overlaps are identified, fragments are split if necessary (e.g., fragment g h in E5 split into g and h when matched with E6 ), and duplicates are removed. At the end, 10 distinct fragments a, . . . , j give rise to 10 nodes in the graph shown in Figure 1(b). Note that amongst all the fragments quoted, f never app ears as a new fragment, and is hence lab eled a hidden fragment.

only shows the minimum equivalent graph with all the redundant edges removed. Thus, b ecause of the edge (b, a), the edge (c, a) is not included in Figure 1(b). Similarly, for E6 , there are two edges from node i to g and h, while there is only a single edge from j to h. Other edges are added for the same reason ­ except for the edges involving hidden fragment f . For a hidden fragment, additional edges are created within the quoted block, following the same neighb oring quotation assumption. For instance, b ecause of E5 , edges are added from f to d and e. We use the minimum equivalent graph as the fragment quotation graph, which is transitively equivalent to the original graph. Recall that a folder may contain emails involved in multiple distinct conversations. In the fragment quotation graph, each of these conversations will b e reflected as weakly connected comp onents. Figure 1(b) shows the fragment quotation graph of the conversation shown in Figure 1(a) with all the redundant edges removed. In contrast, if threading is done at the coarse granularity of entire emails, as adopted in many studies, the threading would b e a simple chain from E6 to E5 , E5 to E4 and so on. This example clearly shows the advantage of using fragment quotation graphs.

4. EMAIL SUMMARIZATION METHODS
Once the fragment quotation graph corresp onding to a conversation is extracted, the remaining task is to identify the most informative sentences to b e included in the summary. In this section, we first present a novel method called ClueWordSummarizer. Then we briefly describ e how two existing approaches, MEAD and RIPPER, can b e applied to a fragment quotation graph.

3.3 Creating Edges
In general, it is difficult to determine whether one fragment is actually replying to another fragment. In this pap er, we make the following simplifying assumption. We assume that any new fragment is a potential reply to neighboring quotations ­ quoted fragments immediately preceding or following it. In that case, an edge is added b etween the two corresp onding nodes in the fragment quotation graph. Recall that in the fragment identification step, an email is decomp osed into an alternating sequence of new and quoted blocks. For example, E6 in Figure 1(a) is decomp osed into quoted g , new i, quoted h and new j . In general, partly b ecause of p ossible differences in quotation depth, a block may contain multiple fragments. Thus, for the general situation when QSp precedes N S , which is then followed by QSf , we create an edge (v , u) for each fragment u  (QSp  QSf ) and v  N S. Let us consider E3 in Figure 1(a). There are the edges (c, b) and (c, a). As will b e p ointed out later, Figure 1(b)

4.1 Clue Words
It is well known in linguistics that coherent text tends to b e lexically cohesive. Words related to the current topic tend to reoccur more frequently in successive utterances. In our preliminary analysis of email folders, we found this to b e true also for fragments in the quotation graph. That is to say, some words in a fragment reoccur in the fragments replying to it. We call those words clue words. A clue word in node (fragment) F is a word which also appears in a semantical ly similar form in a parent or a child node of F in the fragment quotation graph. Note that the definition of a clue word is established by examining the textual content of a fragment. At the same time, a clue word takes into account of the referential re-

93


WWW 2007 / Track: Data Mining
lationship b etween two nodes connected by an edge. Thus, we b elieve that clue words are imp ortant for summarizing email conversations.
(a) Below are the proposed discounts we discussed with Ken Lay this afternoon: ... ... If the midpoint (value between PG&E and Enron settlement offers) is acceptable from a liquidity and P+L standpoint, propose countering at a discount of $123 million (move half way to midpoint) to provoke a counter offer. Intent is to settle at midpoint discount of $161 million. The Excel file is attached. (b) Ken Lay called Rick Buy and I up to his office talk about settlement just now. He would like to settle for liquidity/good news. Rick Buy is going to discuss with Whalley.

Session: Mining Textual Data
Input: fragment quotation graph G = (V , E ), summary length k Output: a sequence of sentences S U M = [s1 ; . . . ; sk ]. 1. For each no de u  V , let u.S temS et denote the multiset of all words' stems in u. For every sentence s  u, do the following: (a) Tokenize s into a sequence of words W . (b) Remove all stop-words from W . (c) For each word w  W compute its stem wstem with Porter's stemming algorithm. (d) Insert wstem into u.S temS et. 2. For each sentence s in each no de u  V , compute the C lueS cor e(s) as follows: For each non-stop word w , compute C lueS cor e(w , u) as the numb er of o ccurrence of wstem in all x.S temS et, where x are u's parent or child no des. C lueS cor e(s) = The ClueScore of sentence s is computed as follows: P ws C lueS cor e(w , u)

Figure 2: Example of Clue Words Figure 2 shows a real example of two nodes from a conversation in the Enron data set. Fragments (a) and (b) are two adjacent nodes with (b) as the parent node of (a). Between the two nodes, there are 5 clue words Ken, Lay, settle, discuss and liquidity. Note that clue words do not need to reoccur verbatim. The clue words "discussed" and "settle" in (a) reoccur as "discuss" and "settlement" in (b). There are also synonyms, such as "discuss" in (a) and "talk" in (b ). From a preliminary analysis, we observe 3 ma jor kinds of reoccurrence: · the same root(stem) with different forms, e.g., "settle" vs. "settlement" and "discuss" vs. "discussed" as in the example ab ove. · synonyms/antonyms or words with similar/contrary meaning, e.g., "talk" vs. "discuss" and "p eace" vs. "war". · words that have a looser semantic link, e.g., "deadline" with "Friday morning". In our exp erimentation, we observe that stemming occurs the most frequently among the three typ es discussed ab ove. Thus, in this pap er, we only apply stemming to the identification of clue words. We use the Porter's stemming algorithm to compute the stem of each word, and use the stems to judge the reoccurrence. Several studies in the NLP literature have explored the reoccurrence of similar words within one document due to the text cohesion. The idea has b een formalized in the construct of lexical chains, i.e., sequences of semantically related words app earing in sequences of contiguous sentences within a document. Some approaches use lexical chains to generate single-document summaries [1]. While clue words and lexical chains b oth rely on lexical cohesion, the two concepts are quite different with resp ect to the kind of linkages considered. For lexical chains, the concept of "chain" is based on similarities b etween lexical items in contiguous sentences within a single document. In contrast, for clue words, the linkage is based on the existing conversation structure which is represented by the quotation graph. In other words the "chain" in clue word is not only "lexical" but also "conversational", and typically spans over several emails (i.e., documents).

3. Rank all sentences according to their ClueScore and select the top-k ones as S U M .

Figure 3: Algorithm CWS

4.2 Algorithm CWS
Algorithm ClueWordSummarizer (CWS) uses clue words as the main feature for email summarization. The assumption is that if those words reoccur b etween parent and child nodes, they are more likely to b e relevant and imp ortant to the conversation. A skeleton of algorithm CWS is presented in Figure 3. In order to evaluate the significance of the clue words quantitatively, CWS uses ClueScore(CW, F) to represent the imp ortance of a clue word C W in fragment F : X f r eq (C W, par ent(F )) + C lueS cor e(C W, F ) =
par ent(F )

X
child(F )

f r eq (C W, child(F ))

where f r eq denotes the frequency the clue word app earing in a fragment. The ab ove formula generalizes to the sentence level: X C lueS cor e(s) = C lueS cor e(C Wi , F )
C Wi s

where s is a sentence in fragment F . For the example in Figure 2, consider the ClueScore for the sentence: "If the midp oint . . . counter offer." in fragment (a). ClueScore("liquidity") = 1 and ClueScore("settlement") = 2 b ecause "liquidity" app ears once and "settlement" app ears twice in fragment (b). Thus, the ClueScore of this sentence is 3. To select sentences to b e included in a summary of k sentences, algorithm CWS first tokenizes each sentence in every node into a multiset of words. After the stop words are removed and the stem of words are obtained, the ab ove ClueScore formulas are applied first to words then to sentences. CWS selects the sentences with the highest ClueScore to return as the summary.

4.3 MEAD: a centroid-based multi-document summarizer
Algorithm CWS describ ed ab ove uses the ClueScore to evaluate the imp ortance of each sentence based on the quotation graph. Since the ClueScore is computed based on the

94


WWW 2007 / Track: Data Mining
reoccurrence of the clue words in the parent and child nodes, it may tend to capture more local salience and miss more global salience related to the whole conversation. So it is reasonable to assume that additional information is needed. To satisfy these needs we explore using MEAD, a centroidbased multi-document summarizer to generate email summaries. As the first step, MEAD computes the centroid of all emails in one conversation. A centroid is a vector of words' average TFIDF values in all documents. MEAD compares each sentence s with the centroid and assigns it a score as the sum of all the centroid values of the commoPwords shared n by the sentence s and the centroid. C (s) = ws Cw In addition to the centroid, MEAD also uses other features (e.g., the sentence p osition and the similarity to the selected sentences) to compute the sentence score, MEADScore. Then all sentences are ranked based on the MEADScore, and the top ones are included in the summary. The details can b e found in [7]. Compared with MEAD, CWS may app ear to use more "local" features. But notice that locality in CWS is defined with resp ect to the quotation graph. That is, two sentences with clue words are not textually proximal to each other, but are brought together in a parent and a child node in the quotation graph. Because of this, CWS may achieve a good compromise b etween local and global information, which may explain its promising p erformance describ ed in Section 6.

Session: Mining Textual Data
feature subsets out of the 14 features, starting with 8 "basic" linguistic features. Then they add successively 2 ("basic+") and 4 ("basic++") additional structural features. In the end, the results with all features show that including additional features based on the emails and thread structure does improve classification accuracy. We have adopted the framework prop osed in [9] with only the following exception. Since some of the features used in [9] are based on the email thread structure and our algorithm is based on the fragment quotation graph, We needed to slightly change some features to fit the quotation graph. For example, the numb er of direct resp onse to an email is replaced by the numb er of direct resp onse to the fragment(node), in which sentence s is contained.

5. RESULT 1: USER STUDY
In order to compare different approaches of email summarization a gold standard is needed. In practice, for comparing extractive summarizers, we need to know what sentences a human summarizer would extract as most imp ortant from a target email corpus. Notice that having such a gold standard may also allow us to verify our assumptions on what information and algorithms an effective extractive email summarizer should rely on. In particular, we verify whether some imp ortant sentences do come from hidden emails and whether ClueScore correlates with sentence imp ortance. A few gold standards have b een develop ed in previous work [9] [15], but unfortunately they are not public. In addition, their gold standards only involve 2 human summarizers. It is not clear whether p ersonal preferences are avoided in those gold standards. So we build our own in a user study describ ed in the following.

4.4 Hybrid ClueScore with MEADScore
As stated ab ove, ClueScore and MEADScore tend to represent different asp ects of the imp ortance of a sentence. So it is natural to combine b oth methods together. In order to avoid either one score overwhelming the other, we standardize b oth scores as follows. We compute ctr , the standard ro deviation of the centroid of all words, and use centctird(w) as the centroid to compute the hybrid score. Similarly, we also standardize the ClueScore for each word w  F as C lueS ccore(w,F ) , where clue is the standard deviation of  lue all p ositive C lueS cor e(w, F ) for all words in every fragment F . For the ease of representation we still use the previous symb ol to represent b oth standardized scores. We use the linear combination of ClueScore and MEADScore together as the hybrid score of the sentence s. Let   [0, 1] denote the p ercentage of ClueScore in the final result. When  = 0 or 1, it is pure MEADScore or pure ClueScore resp ectively. Linear C lueM E AD(s) =   C lueS cor e(s) + (1 - )  M E ADS cor e(s).

5.1 Dataset Setup
We collected 20 email conversations from the Enron email dataset as the testb ed and recruited 25 human summarizers to review them. The 20 email conversations were selected as follows. From the 10 largest inbox folders in the Enron email dataset, we discovered 296 email conversations. Since we are studying multiple email summarization in a conversational context, we required that each conversation contained at least 4 emails. Thirty eight conversations satisfied this requirement. Moreover, we wanted to study the effect of conversation structure, i.e., the fragment quotation graph. The selected emails also need to represent different typ es of conversation structure. According to our preliminary study, we found that the email threads could b e divided into two typ es. One is the single chain typ e, and the other is the thread hierarchy typ e. In order to cover b oth structure in the context of email summarization, we randomly select 4 single chains and 16 trees, which is close to their ratio in the 38 conversations. Secondly, we recruited 25 human summarizers to review those 20 selected email conversations. All 25 human summarizers were undergraduate or graduate students in University of British Columbia. Their ma jors covered various disciplines including Arts, Law, Science and Engineering. Since many emails in the Enron dataset relate to business and law issues, the variety of the human summarizers, esp ecially those with business and legal background are of an asset to this user study. Each summarizer reviewed 4 distinct conversations in one hour. In this way, each email conversation were reviewed

4.5 Summarization Using RIPPER
In [9], Ramb ow et al. prop ose to use machine learning to extract representative sentences from the original emails to form a summary. They use the RIPPER system to induce a classifier for determining if a given sentence should b e included in the summary. RIPPER is trained on a corpus in which each sentence is describ ed by a set of 14 features and annotated with the correct classification (i.e., whether it should b e included in the summary or not). The features describing each sentence comprise linguistic features (e.g., similarity to the centroid) and features describing the email and the threading structure, (e.g., numb er of recipients of the email containing the sentence). They use 3 incremental

95


WWW 2007 / Track: Data Mining
by 5 different human summarizers. For each given email conversation, human summarizers were asked to generate a summary by directly selecting sentences from the original emails in that conversation. The generated summary contained ab out 30% of the original sentences. The selected sentences need to represent the ma jor ideas of the original email conversations. The human summarizers were told that they need to select those sentences in a way that, by reading the generated summary, the readers did not need to refer to the original emails in most cases. Moreover, human summarizers were asked to classify each selected sentence as either essential or optional. The essential sentences are crucial to the email conversation and have to b e extracted in any case. The optional sentences are not critical to the conversation but are useful to help readers understand the email conversation if the given summary length p ermits. By classifying essential and optional sentences, we can distinguish the core information from the supp orting ones. Moreover, the summarization accuracy changes a lot with different summarization length. With the choice of essential and optional, we ask the human summarizers to include ab out 30% of the total sentences. Thus, we can analyze the result and find the most convincing sentences that most human summarizers agrees on. GSValue 8 9 10 11 12 13 14 15

Session: Mining Textual Data
p ossible selections "2 ess + 2 opt" "2 ess + 3 opt" or "3 ess" "3 ess + 1 opt" "3 ess + 2 opt" "4 ess" "4 ess + 1 opt" imp ossible score "5 ess"

Table 1: GSValues and Possible Selections for Overall Essential Sentences hidden emails. In addition, among the 88 overall essential sentences, ab out 18% sentences are from hidden emails. This clearly shows that the hidden emails do carry crucial information and have to b e considered by the email summarization system. Significance of clue words - In the user study, we find that among the overall essential sentences in the gold standard, the clue words are very p opular. Recall that the clue words are chosen based on the fragment quotation graph, this validates our hyp othesis that the conversation structure(fragment quotation graph) plays an imp ortant role in the email summarization. After we got the ClueScore of all the sentences, we study whether ClueScore is consistent with users' judgment. We do this by comparing the average ClueScore of overall essential sentences in the gold standard with the average of all other sentences. Figure 4 shows how significant the ClueScore is for the overall essential sentences. This figure is the histogram of the distribution of the ratios. The x-axis is the ratio which ranges from 0 to 18, and the y-axis is the numb er of conversations in that range. There are 3 conversations with a ratio of 1, meaning that there is no difference. At the other extreme, there is one conversation with a ratio of 16. For the remaining conversations, the ratio falls within [2,8]. The average ratio is 3.9. This suggests that ClueScore is significantly different b etween sentences in the gold standard and those that are not. Though there exist non-gold standard sentences whose ClueScore is very high, and there are gold standard sentences whose ClueScore is very low, in general the higher the ratio, the more likely a sentence with a high ClueScore is included in the gold standard. This suggests that CWS which uses ClueScore to rank sentences can lead to b etter precision and recall.

5.2 Result of the User Study
As we all know, summarization is a sub jective activity. Different p eople make different choices. We cannot exp ect all summarizers agree to each other on all sentences. In addition, according to the design of the user study, essential sentences are more imp ortant than the optional ones. Thus, we give more weights to the essential selections. We assign a GS V alue for each sentence to evaluate its imp ortance according to human summarizers' selection. The score is designed as follows. For each sentence s, one essential selection has a score of 3, one optional selection has a score of 1. Supp ose ke summarizers classify s as "essential", and ko summarizers classify s as "optional". The GSValue of sentence s is 3  ke + ko . The reason that we choose 3 as the coefficient for essential selection is that we assume that two essential selections are more imp ortant than five optional selections. In this way, we can reduce the side effect bringing in by the 30% summarization length for each user. Thus, the GSValue of a sentence ranges from 0 to 15. If a sentence has a GSValue no less than 8, we take it as an overal l essential sentence. The GSValue of 8 corresp onds to 2 essential and 2 optional selections. Table 1 shows the p ossible cases of overall essential sentences. Intuitively, we can see that an overall essential sentence requires that at least 4 human summarizers select it in their summary and at least 2 select it as essential. It is obvious that the overall essential sentence are considered imp ortant by most summarizers. Out of the 741 sentences in the 20 conversations, 88 are overall essential sentences which is ab out 12% of the overall sentences. In addition, this also reveals that in general only ab out 12% sentences are agreed by most human summarizers. With the scoring system discussed ab ove, we study the human summarizers' selection in detail with resp ect to the following two asp ects: hidden emails and significance of clue words. Hidden emails - We sort all sentences by their GSValue and ab out 17% sentences in the top-30% sentences are from

6. RESULT 2: EMPIRICAL EVALUATION O F CW S
The User Study provides us with a gold standard (GS ), which identifies the overall essential sentences(i.e., GSValue  8) in the corpus. We apply three summarization methods to the 20 conversations and compare their result with GS . We use precision, recall and F -measure to evaluate the accuracy of those three summarization methods.

6.1 Comparing CWS with MEAD and RIPPER
We start by applying the summarization approach based on RIPPER to our dataset. We try two different ways to train RIPPER and compare it with MEAD and CWS.

96


WWW 2007 / Track: Data Mining

Session: Mining Textual Data
sentence-level 0.2 0.33 0.4 conversation-level 0.225 0.4 0.35

7

6

5 number of conversations

RIPPER: MEAD: CWS:

4

Table 2: Precision of the Three Methods in Table 2 shows that even when CWS is forced to restrict the summary to a length of 2%, CWS still offers a precision doubling that of RIPPER. MEAD does surprisingly well as compared with RIPPER.
0 2 4 6 8 10 12 local score ratio of gold standard/rest 14 16 18

3

2

1

0

6.1.2 Conversation-level training
In the exp eriment ab ove, we train and test RIPPER by randomly selecting sentences in all conversations. This may not b e accurate b ecause the test set itself does not represent a conversation. Thus, to complement the sentencelevel training ab ove, we p erform a separate exp eriment using conversation-level training. We apply a 20-fold cross validation as follows. In each round, we select 19 conversations as the training set and the remaining one conversation as the test set. This is rep eated 20 times, with every conversation selected once as a test set. As for CWS and MEAD, we force them to generate exactly the same numb er of sentences as selected by RIPPER. As is often the case for many conversations, RIPPER does not select any sentence. Whenever that happ ens, to allow the comparison to b e completed, we force CWS and MEAD to return a single sentence with the highest score. The exp eriment shows that out of the 20 conversations, RIPPER does not generate any summary for 12 conversations, and picks 9 sentences from the remaining 8. Five out of those 9 sentences b elong to the gold standard. In contrast, CWS and MEAD selects 7 and 8 gold standard sentences out of 20 conversations resp ectively. The second column of the table in Table 2 summarizes the precision. When compared with the results rep orted in [9], the relatively p oor p erformance of RIPPER on the Enron data set is surprising in two ways. First, on the ACM data set used in [9], RIPPER could achieve a summary length around 20%. Furthermore, in [9], RIPPER easily outp erforms centroid methods, like MEAD. We b elieve that there are two p ossible explanations. First, for our testb ed, there are 20 conversations with 110 emails. In contrast, the ACM dataset contains 1600 emails. Secondly, the gold standard is obtained differently. In our user study, each conversation is reviewed by 5 human summarizers, and each summarizer directly selects essential and optional sentences. The gold standard sentences need to b e selected by at least 4 summarizers and at least 2 of them select it as an essential sentence. Thus, the gold standard sentences are highly consistent among the summarizers. The side effect is that only 12% sentences are selected as overall essential in the gold standard sentences. This is different from the gold standard in [9], where two human summarizers write their own summaries for all the conversations. The gold standard is generated by comparing the similarity of each sentence to the generated one. To verify the validity of the latter explanation, we relax the threshold for the gold standard and see whether RIPPER selects more sentences. When we reduce the GSValue threshold from 8 to 6, there are 19% gold standard sentences. The summary length of RIPPER increases to 8%. When we further reduce the threshold to 4, there are 30%

Figure 4: Ratio of ClueScore of Gold Standard Sentences and Non-Gold Standard Sentences

6.1.1 Sentence-level training
In sentence level training, we put all the sentences in the 20 conversations together. We use k-fold cross validation to train and test RIPPER, where k varies from 2 to 20. For each value of k, we rep eat the exp eriment 10 times and take the average. As describ ed b efore, there are 3 feature sets: "basic", "basic+" and "basic++". Figure 5 shows that additional features generally improve the accuracy. However, too many features may reduce the accuracy in some cases. This is different from the result in [9], where the full feature set ("basic++") gets the highest accuracy. For the results of RIPPER rep orted hereafter, it is based on using the feature set "basic++".
0 .1 0 .0 9 0 .0 8 0 .0 7 F - m easu r e 0 .0 6 0 .0 5 0 .0 4 0 .0 3 0 .0 2 0 .0 1 0 2 3 5 k - f o ld 10 15 20
basic basic+ basic++

Figure 5: F -measure of RIPPER with 3 Feature Sets Table 2 shows the precision of the summaries generated by all three methods. The reason why only precision is shown is that RIPPER surprisingly only selects 2% of the sentences to b e included in a summary. For such a low p ercentage, recall is not informative. Rememb er that CWS and MEAD are capable of generating summaries of any length requested by the user, we force MEAD and CWS to produce summaries of the same length, so as to provide a fair comparison. To make CWS and MEAD function prop erly when all the sentences from the 20 conversations are p ooled together, statistical standardization is applied to the ClueScore of sentences within each conversation b efore they are p ooled. In this way, we avoid selecting too many sentences from one conversation simply b ecause the ClueScore in one conversation is significantly higher than the rest. The first column of the table

97


WWW 2007 / Track: Data Mining
0.80
1 0.9 0.8

Session: Mining Textual Data
MEAD CWS

0.70 0.60 0.50

0.6 Accuracy 0.5 0.4 0.3

0.2 0.1 0 Precision(M) Precision(C) Recall(M) Recall(C) F1(M) Accuracy measure and summarization method F1(C)

recall

0.7

0.40 0.30 0.20 0.10 0.00 top-10% top-15% top-20% top-30% top-40%

summary length

(a) Recall
0.40 0.35 0.30 0.25 F-measure 0.20 0.15 0.10 0.05 0.00 top-10% top-15% top-20% summary length top-30% top-40% MEAD CWS

Figure 6: Box-plot for MEAD(M) and CWS(C) with sumLen = 15% gold standard sentences, and the summary length of RIPPER increases to ab out 25%. Because the training data is randomly selected, the numb er of p ositive examples in the training data generally agrees with the gold standard p ercentage. This shows that RIPPER is sensitive to the availability of p ositive examples. From the ab ove exp eriments, we can conclude that b oth CWS and MEAD have the following two advantages over RIPPER. First, they can guarantee to generate a summary for each conversation according to the need of the user. Second, CWS and MEAD show a b etter accuracy even when they are restricted to match the summary length of RIPPER. It is a natural idea to include ClueScore as another feature to the existing RIPPER feature set. Our exp eriment shows that there is little improvement and in some cases, it may b e even worse. For lack of space, we do not include the details.

(b) F -measure Figure 7: Accuracy of CWS and MEAD with Various sumLen and F -measure is marginally significant at 0.077 for precision, significant at 0.049 for recall and marginally significant at 0.053 for F -measure. Thus, we have at least 90% confidence that the mean of precision, recall and F -measure of MEAD and CWS differ when sumLen = 15%. In the exp eriments ab ove, we only compare the accuracy of MEAD and CWS with a fixed summary length. In the following, we compare b oth methods using different summary lengths. Figure 7 shows the recall and F -measure for b oth methods under different summary length varying from 8% to 40%. In Figure 7(a) and Figure 7(b), the recall and F measure of MEAD and CWS b oth improve with an increase of summary length. Thus, RIPPER may b e missing "the b oat" for not b eing able to produce summaries of varying lengths. When the summary length is no more than 30%, CWS seems to have a higher recall and F -measure than MEAD, while MEAD app ears to b e higher in recall and F -measure when the summary length is greater than 30%. The highest difference, which is greater than 0.1, takes place when summary length is 15% for all accuracy measures including precision, recall and F -measure. We can also see that the highest average F -measure seems to occur when the summary length is 30%. Note that when the requested summary length is larger, the summarization task is easier. In this case, b oth methods app ear to b e fine. However, for shorter summaries, which is a harder summarization problem, CWS dominates, esp ecially when the summary length is close to the size of the gold standard. Recall that CWS focuses on the local imp ortance (wrt. the quotation graph), and MEAD focuses on the global imp ortance. This result may reflect the relative imp ortance of local and global imp ortance with different summary lengths.

6.2 Comparing CWS with MEAD
In the previous exp eriments, in order to match the summary length of RIPPER, we restrict the summary length of CWS and MEAD to a very low 2%. Below we compare CWS and MEAD at various lengths. The default p ercentage we pick is 15%, as this roughly corresp onds to the gold standard p ercentage. Figure 6 shows the b ox-plot of the effectiveness of CWS and MEAD when the summary length is 15%. This figure describ es the statistic distribution of the two methods across all 20 conversations. The x-axis represents different evaluation measures in precision, recall and F -measure for MEAD and CWS, e.g., Precision(M) means the precision of MEAD with 15% summary length. The y-axis shows the values in decimals. The line in the middle of the b ox is the median(med) of all the values. The upp er and lower edge of the b ox show the 75% and 25% of p ercentile of the data. The range b etween them is called an interquartile. The dot line outside the b ox shows the range of the data outside the interquartile. The length of the dot line is no more than the range of the interquartile. The data p oints b eyond two tails are called outliers, which are drawn as circles. This figure also shows that b oth the median and the b ox of CWS is higher than that of MEAD resp ectively. And hence, CWS app ear to b e more accurate than MEAD with the summary length of 15%. We also compute the student t-test for the accuracy of MEAD and CWS. The p-value we get for precision, recall

98


WWW 2007 / Track: Data Mining

Session: Mining Textual Data
folder name buy-r.0 buy-r.4 dasovich-j.3 nemec-g.6 shackleton-s.2 # of sent. 25 15 40 23 50 # of GS sent. 4 1 1 2 5 GSValues 9, 9, 8, 8 8 9 11, 8 13,10, 9, 9, 9

6.3 Effectiveness of the Fragment Quotation Graph
As a sanity check, we try two randomized selection algorithms to compare with CWS and MEAD. First, we try to randomly select k% sentences from each conversation and compute the precision, recall and F -measure. This method is lab eled as "Random-sent" in Figure 8. Clearly, it shows that CWS dominates such a random approach. For that matter, combining Figure 8 with Figure 7(b) indicates that MEAD also dominates this random approach. The second random approach is more intelligent. Here we create a random graph Grandom and replace the fragment quotation graph, in the hop e of evaluating the utility of a fragment quotation graph. The random graph contains exactly the same set of nodes as the fragment quotation graph, but the edges are randomly generated. We guaranteed that the minimum equivalent graph of Grandom contains the same numb er of edges as the fragment quotation graph. For each fragment quotation graph we generate 3 random graphs and use the average as the accuracy.
0.40 0.35 0.30 0.25 F-measure 0.20 C WS Random-graph Random-sent

Table 3: Zero Selection Folders b er of sentences, or have few GS sentences(dasovich-j.3 has only one). Notice that those gold standard sentences have relatively smaller GSValues even for those selected as GS sentences. We can conclude that even human summarizers are not consistent on their summary of those conversations. This shows that email summary is a challenging job when the summary length is short due to the characteristic of emails, e.g., the email length.
1.00 0.90 0.80 0.70 accuracy(%) 0.60 0.50 0.40 0.30 0.20 0.10 0.00 Precision Recall F-measure

0.15 0.10 0.05 0.00 top-10% top-15% top-20% summary length top-30% top-40%

Figure 9: Accuracy of CWS on Individual Folders

6.5 Hybrid Methods
In this section, we study whether the linear combination of ClueScore and MEADScore can improve the accuracy. Figure 10 shows the change of the F -measure with different p ercentage of ClueScore (from 0% which corresp onds to pure MEAD to 100% which corresp onds to pure CWS). This figure shows that, when the summary length is less or equal than 20%, the linear combination cannot improve the accuracy of CWS (pure 100% CWS is the max). In contrast, when the summary length is greater than 20%, there is b enefit to combine MEAD and CWS together (the max is b etween 0% and 100%). Notice that the accuracy usually get its maximal value when the p ercentage of ClueScore is 80%. In other words, CWS can b enefit from a little bit of MEAD when the summary is relatively long.

Figure 8: F-measure of CWS and Two Random Alternatives Figure 8 shows the accuracy of CWS under the fragment quotation graph and the random graph shown with the lab el "Random-graph". The gap b etween "Random-graph" and "Random-sent" indicates the contributions of node identification and ClueScore. In all cases, the gap is significant. The gap b etween "Random-graph" and CWS indicates the imp ortance of the edges of the fragment quotation graph. In all cases, CWS gives a higher value. Interestingly, combining Figure 8 with Figure 7(b) indicates that "Randomgraph" outp erforms MEAD in some cases. This set of exp eriments clearly shows the value of the fragment quotation graph for summarizing conversations.

6.4 Examining Individual Conversations
In addition to measuring the average accuracy, Figure 9 shows the accuracy of CWS in each conversation when the summary length is 15%. The x-axis is the 20 conversations, and the y-axis is the accuracy in terms of precision, recall and F -measure. There are 5 conversations where the summary contains no overall essential sentence. Among those 5 conversations, MEAD cannot select any overall essential sentences in 4 of them either. Table 3 shows the total numb er of sentences, numb er of gold standard sentences and their GSValue for the 5 conversations. Among the 5 conversations, we can see that they either have a smaller num-

7. CONCLUSIONS AND FUTURE WORK
In this pap er, we study how to generate accurate email summaries. We analyze the characters of emails and study the email conversation structure, which we argue have not b een sufficiently investigated in previous research on email summarization. We build a novel structure: the fragment quotation graph, to represent the conversation structure. This graph includes hidden emails and can represent the conversation in more details than a simple threading structure. Based on the fragment quotation graph, we also develop a new summarization approach CWS to select imp ortant sentences from an email conversation. Our exp eriments

99

n sh em ac ec sh kle -g. a c to n 0 sh kle -s. a c to n 6 k da leto s.4 so nvi s.9 ch sk -j.1 illi 0 ng bu j.1 yr da bu .2 y sh sov -r. a c ic 3 kl h-j et .2 o sh sk n-s a c illi .0 kl ng et -j da on- .5 ss da ovi .10 so ch vi -j.1 ch - j. 1 la 1 yk b .0 ne uym r.4 ec -g sh ac bu .6 kl y-r et .0 da on so -s vi .2 ch - j. 3

conversations


WWW 2007 / Track: Data Mining
0 .4 5 0 .4 0 0 .3 5 0 .3 0
F-measure

Session: Mining Textual Data
[5] Gunes Erkan and Dragomir R. Radev. Lexrank: ¨ Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intel ligence Research(JAIR), 22:457­479, 2004. [6] Aaron Harnly Jen-Yuan Yeh. Email thread reassembly using similarity matching. In Third Conference on Email and Anti-Spam (CEAS), July 27 - 28 2006. [7] Dragomir R. Radev, Hongyan Jing, Malgorzata Sty´, s and Daniel Tam. Centroid-based summarization of multiple documents. Information Processing and Management, 40:919­938, Decemb er 2004. [8] Ramakrishnan Srikant Rakesh Agrawal, Sridhar Ra jagopalan and Yirong Xu. Mining newsgroups using networks arising from social b ehavior. In WWW '03: Proceedings of the 12th International Conference on World Wide Web, pages 529­535, 2003. [9] Owen Ramb ow, Lokesh Shrestha, John Chen, and Chirsty Lauridsen. Summarizing email threads. In HLT/NAACL, May 2­7 2004. [10] Gordon Rios and Hongyuan Zha. Exploring supp ort vector machines and random forests for spam detection. In First Conference on Email and Anti-Spam (CEAS), July 30 - 31, 2004. [11] Lokesh Shrestha and Kathleen McKeown. Detection of question-answer pairs in email conversations. In Proceedings of COLING'04, pages 889­895, August 23­27 2004. [12] Salvatore J. Stolfo, Shlomo Hershkop, Chia-Wei Hu, Wei-Jen Li, Olivier Nimeskern, and Ke Wang. Behavior-based modeling and its application to email analysis. ACM Trans. Inter. Tech., 6(2):187­221, 2006. [13] Jie Tang, Hang Li, Yunb o Cao, and ZhaoHui Tang. Email data cleaning. In ACM SIGKDD'05, pages 489­498, 2005. [14] Gina Danielle Venolia and Carman Neustaedter. Understanding sequence and reply relationships within email conversations: a mixed-model visualization. In CHI '03: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 361­368, 2003. [15] Stephen Wan and Kathleen McKeown. Generating overview summaries of ongoing email thread discussions. In Proceedings of COLING'04, the 20th International Conference on Computational Linguistics, August 23­27 2004. [16] Xiao jun Wan and Jianwu Yang. Improved affinity graph based multi-document summarization. In HLT/NAACL, June 2006. [17] Steve Whittaker and Candace Sidner. Email overload: exploring p ersonal information management of email. In CHI '96: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 276­283, 1996. [18] Jihoon Yang and Sung-Yong Park. Email categorization using fast machine learning algorithms. In Discovery Science, pages 316­323, 2002.

0 .2 5 0 .2 0 0 .1 5 0 .1 0 0 .0 5 0.00 0% 20% 40% 60% 80% 100%
Percentage of ClueScore

to p - 1 0 % to p - 1 5 % to p - 2 0 % to p - 3 0 % to p - 4 0 % to p - 5 0 %

Figure 10: Average F -measure of Linear Combination of CWS and MEAD with the Enron email dataset not only indicate that hidden emails have to b e considered for email summarization, but also shows that CWS can generate more accurate summaries when compared with other methods. The CWS approach relies on a simple algorithm and is very easy to implement. Yet, it app ears to work b etter than other existing approaches. Since CWS is built on the fragment quotation graph, we b elieve this can b e at least partially attributed to the use of the fragment quotation graph. Our exp eriments on the random graphs also supp ort this. Others have also argued that the conversation structure is very imp ortant for email summarizations, we claim that it should b e paid even more attention. Furthermore, we b elieve that it is promising to combine the CWS methods together with other methods. Our future plan includes improving the fragment quotation graph generation with more sophisticated linguistic analysis. For example, we can use some linguistic analysis to decide whether we need to add an edge or not. In order to verify the generality of our findings, we are also working on evaluating our methods with different reallife data sets; creating the gold standard for a large reallife data set requires a lot of effort. Last but not least, we want to explore how to combine CWS with several machine learning algorithms. The low selection rate of RIPPER does not mean that the features are not useful. In a preliminary study, we find that different classifiers have very different selection rates and accuracy. We will look into this issue in the near future.

8.

REFERENCES

[1] Regina Barzilay and Michael Elhadad. Using lexical chains for text summarization. In Proceedings of the Intel ligent Scalable Text Summarization Workshop (ISTS'97), ACL, Madrid, Spain, 1997. [2] Giusepp e Carenini, Raymond T. Ng, and Xiaodong Zhou. Scalable discovery of hidden emails from large folders. In ACM SIGKDD'05, pages 544­549, 2005. [3] Giusepp e Carenini, Raymond T. Ng, Xiaodong Zhou, and Ed Zwart. Discovery and regeneration of hidden emails. In ACM SAC '05: Proceedings of the 2005 ACM Symposium on Applied Computing, pages 503­510, 2005. [4] Chris Schmandt Derek Lam, Steven L. Rohall and Mia K. Stern. Exploiting e-mail structure to improve summarization. In CSCW'02 Poster Session, 2002.

100