SIGIR 2007 Proceedings Session 6: Summaries CollabSum: Exploiting Multiple Document Clustering for Collaborative Single Document Summarizations Institute of Computer Science and Technology, Peking University, Beijing 100871, China Xiaojun Wan, Jianwu Yang and Jianguo Xiao {wanxiaojun, yangjianwu, xiaojianguo}@icst.pku.edu.cn ABSTRACT Almost all existing methods conduct the summarization tasks for single documents separately without interactions for each document under the assumption that the documents are considered independent of each other. This paper proposes a novel framework called CollabSum for collaborative single document summarizations by making use of mutual influences of multiple documents within a cluster context. In this study, CollabSum is implemented by first employing the clustering algorithm to obtain appropriate document clusters and then exploiting the graph-ranking based algorithm for collaborative document summarizations within each cluster. Both the with-document and cross-document relationships between sentences are incorporated in the algorithm. Experiments on the DUC2001 and DUC2002 datasets demonstrate the encouraging performance of the proposed approach. Different clustering algorithms have been investigated and we find that the summarization performance relies positively on the quality of document cluster. document without any additional clues and prior knowledge. In this paper, we focus on generic single document summarization. Very often, all single documents in a document set are required to be summarized. While almost all previous methods for single document summarization produce a summary for a specified document based only on the information contained in the document. One common assumption of existing methods is that the documents are independent of each other. Hence the summarization task is conducted separately without interactions for each document. However, some documents within an appropriate cluster context actually have mutual influence and contain useful clues which can help to extract summary from each other. For example, two documents about the same topic would provide additional knowledge for each other to better evaluate and extract salient information from each other. The idea is borrowed from human's perception that a user would better understand a topic expressed in a document if the user reads another document about the same topic. This study proposes a novel framework called CollabSum for collaborative document summarizations by making use of additional information from multiple documents within appropriate cluster context. The cluster context can be obtained by applying the clustering algorithm on the document set and we have investigated how the cluster context influences the summarization performance by employing different clustering algorithms. The proposed CollabSum employs the graph-ranking based algorithm for collaborative document summarization of each document in a specified cluster and both the cross-document relationships and the within-document relationships between sentences are incorporated in the algorithm, where the within-document relationships reflect the local information existing in the specified document and the cross-document relationships reflect the global information existing in the cluster context. We perform experiments on the DUC2001 and DUC2002 datasets and the results demonstrate the good effectiveness of CollabSum. The use of the cross-document relationships between sentences can much improve the performance of single document summarization. We find that the summarization performance is positively correlated with the quality of cluster context and existing clustering algorithms can yield appropriate cluster context for collaborative document summarizations. The rest of this paper is organized as follows: Section 2 briefly introduces the related work. The proposed CollabSum is described in detail in Section 3. We set up the experiments in Section 4 and give the results in Section 5. Section 6 discusses the results and lastly we conclude this paper in Section 7. -------------------------------- *This work was supported by the National Science Foundation of China (60642001). Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing ­ abstracting methods; I.2.7 [Artificial Intelligence]: Natural Language Processing ­ text analysis General Terms: Algorithms, Experimentation, Performance Keywords: CollabSum, Single document summarization, Collaborative summarization, Graph-ranking algorithm 1. INTRODUCTION Document summarization is the process of automatically creating a compressed version of a given document that delivers the main topic of the document. Automated document summarization has drawn much attention for a long time because it becomes more and more important in many text applications. For example, current search engines usually provide a short summary for each resultant document so as to facilitate users to browse the results and improve users' search experience. News portals usually provide concise headline news describing hot news topic each day and they also produce weekly news review to save users' time and improve service quality. Document summary can be either query-relevant or generic. Query-relevant summary should be closely related to the given query. Generic summary should reflect the main topic of the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007...$5.00. 143 SIGIR 2007 Proceedings Session 6: Summaries 2. RELATED WORK Single document summarization has been widely explored in the natural language processing and information retrieval communities. A series of workshops and conferences on automatic text summarization (e.g. DUC 1 and NTCIR 2 ), special topic sessions in ACL, COLING, and SIGIR have advanced the technology and produced a couple of experimental online systems. Generally speaking, single document summarization methods can be categorized into two categories: extraction-based methods and abstraction-based methods [11, 12, 14]. Extraction is just to select existing sentences while abstraction needs sentence compression and reformulation. In this paper, we focus on extraction-based methods. Extraction-based methods usually assign each sentence a saliency score and then rank the sentences in the document. The score is usually computed based on a combination of statistical and linguistic features, including term frequency [18], sentence position [9], cue words [6], stigma words [6], topic signature [17], lexical chains [25], etc. Machine learning methods are also employed to extract sentences, including classification-based methods [1, 15], clustering-based methods [22], HMM-based methods [5], CRF-based method [24], etc. Other methods include maximal marginal relevance (MMR) [4], latent semantic analysis (LSA) [8], and relevance measure [8]. In [27], the mutual reinforcement principle is employed to iteratively extract key phrases and sentences from a document. Moreover, a method based on text segmentation is proposed by McDonald and Chen [19] and the text segments instead of the sentences are ranked. Most recently, the graph-ranking based methods, including TextRank [20, 21] and LexPageRank [7], have been proposed for document summarization. Similar to PageRank [3] or HITS [13], these methods first build a graph based on the similarity relationships between the sentences in a document and then the importance of a sentence is determined by taking into account the global information on the graph recursively, rather than relying only on the local sentence-specific information. The basic idea underlying the graph-based ranking algorithm is that of "voting" or "recommendation". When a sentence links to another one, it is basically casting a vote for that other sentence. The higher the number of votes that are cast for a sentence, the higher the importance of the sentence is. Moreover, the importance of the sentence casting the vote determines how important the vote itself is. The computation of sentence importance is usually based on a recursive form, which can be transformed into the problem of solving the principal eigenvector of the transition matrix. However, all the above methods summarize each single document independently. Particularly, only the sentences within the same document cast votes for each other in the graph-ranking based methods. We believe that the sentences in other topic-related documents can also cast votes for the sentences in the specified document, so both the cross-document relationships and the within-document relationships between sentences are incorporated in the proposed CollabSum in this study. 3. THE PROPOSED COLLABSUM 3.1 Overview Given a document set in which each document needs to be summarized respectively, CollabSum first employs the clustering algorithm (e.g. the agglomerative algorithm, the divisive algorithm, the k-means algorithm, etc.) [10, 26] to group the documents into a few clusters. The documents within each cluster are expected to be topic-related and each cluster can be considered as a context for any document in the cluster. Given a document cluster, CollabSum incorporates both the within-document relationships (local information) and the cross-document relationships (global information) between sentences into the graph-ranking based algorithm to summarize each single document within the cluster. Figure 1 gives the framework of the proposed approach. 1. Document Clustering: Group the documents in the document set into a few clusters using the clustering algorithm; Document Summarization: For each cluster, perform the following steps respectively to produce summaries for single documents in the cluster: 1) Affinity Graph Building: Build a global affinity graph G based on all sentences in the documents of the given cluster: D={d1,d2,...dl}, where l is the number of documents. Let S={s1, s2, ..., sn} denote the sentence set for the cluster, where n is the number of sentences. 2) Informativeness Score Computation: Based on the global affinity graph G, the graph-ranking based algorithm is employed to compute the informativeness score IFScore(si) for each sentence si, where IFScore(si) quantifies the informativeness of the sentence si. 3) Within-Document Redundancy Removing: For any single document dk to be summarized, the greedy algorithm is employed to remove redundancy for the informative sentences. Finally, the sentences which are both informative and novel are chosen into the summary. Figure 1: The framework of CollabSum For the first step of the above framework, different clustering algorithms will yield different clusters and the documents in a high-quality cluster are usually deemed to be highly topic-related (i.e. appropriate cluster context), while the documents in a low-quality cluster are usually not topic-related (i.e. inappropriate cluster context). The quality of a cluster will influence the reliability of the contextual information for evaluating the importance of the sentences in the cluster. A number of clustering algorithms will be investigated in the experiments. For the second step of the above framework, step 1) aims to build a global affinity graph reflecting the relationships among all sentences in the document set of the given cluster. Step 2) aims to compute the informativeness score of each sentence based on the global affinity graph. The informativeness of a sentence indicates how much information about the main topic the sentence contains. Step 3) aims to remove redundant information in the summary and keep the sentences in the summary as novel as possible. Step 1) and 2) perform on all documents in the cluster in order to get highly informative sentences from a global perspective, while step 2. 1 2 http://duc.nist.gov http://research.nii.ac.jp/ntcir/index-en.html 144 SIGIR 2007 Proceedings Session 6: Summaries 3) performs only on each single document in order to remove redundancy from a local perspective. A summary is expected to include the sentences which are both highly informative and highly novel. Note that the summarization tasks are conducted in a batch mode for each cluster. The steps of 1), 2) and 3) will be described in next sections respectively. 3.3 Informativeness Score Computation Based on the global affinity graph G, the informativeness score IFScoreall(si) for sentence si can be deduced from those of all other sentences linked with it and it can be formulated in a recursive form as follows[7, 20, 21, 28]: 3.2 Affinity Graph Building Given a sentence collection S={si | 1OiOn} of a specified cluster, the affinity weight sim(si, sj) between a sentence pair of si and sj is calculated using the Cosine measure [2]. The weight associated with term t is calculated with the tft.isft formula, where tft is the frequency of term t in the sentence and isft is the inverse sentence frequency of term t, i.e. 1+log(N/nt), where N is the total number of sentences in a background corpus and nt is the number of sentences containing term t. If sentences are considered as nodes, the sentence collection can be modeled as an undirected graph by generating a link between two sentences if their affinity weight exceeds 0, i.e. an undirected link between si and sj (iRj) with the affinity weight sim(si,sj) is constructed if sim(si,sj)>0; otherwise no link is constructed. Thus, we construct an undirected graph G reflecting the relationships between sentences by their content similarity. The links (edges) between sentences in the graph can be categorized into two classes: within-document link and cross-document link. Given a link between a sentence pair of si and sj, if si and sj come from the same document, the link is a within-document link; and if si and sj come from different documents, the link is a cross-document link. Actually, the within-document link reflects the local information in a document, while the cross-document link reflects the global information in a cluster context, which is exploited by CollabSum to make use of mutual influences between different documents in the cluster. The graph G contains both kinds of links between sentences and is called as Global Affinity Graph. We use an adjacency (affinity) matrix M to describe G with each entry corresponding to the weight of a link in the graph. M = (Mi,j)n×n is defined as follows: IFScoreall (si ) d all j i IFScore all ~ (s j ) M j,i (1 d ) n (3) And the matrix form is: r where d MT r ~ (1 d ) e r n (4) r [ IFScoreall ( si )]n 1 is the vector of informativeness scores. e is a unit vector with all elements equaling to 1. d is the damping factor usually set to 0.85. For implementation, the initial informativeness scores of all sentences are set to 1 and the iteration algorithm in Equation (3) is adopted to compute the new informativeness scores of the sentences. Usually the convergence of the iteration algorithm is achieved when the difference between the informativeness scores computed at two successive iterations for any sentences falls below a given threshold (0.0001 in this study). Similarly, the informativeness score of sentence si can be deduced based on either the within-document affinity graph Gintra or the cross-document affinity graph Ginter as follows: r IFScorent ra (si ) d i all j i IFScore int ra ~ (s j ) (M int ra ) j,i ~ (s j ) (M int er ) j,i (1 d ) n (5) (1 d ) n (6) IFScorent er (si ) d i all j i IFScore int er M i,j sim ( s i ,s j ) , if i j o 0 , therwise (1) The final informativeness score IFScore(si) of sentence si can be either IFScoreall(si), IFScoreintra(si) or IFScoreinter(si), or the linear combination of IFScoreintra(si) and IFScoreinter(si) as follows: Then M is normalized to M as follows to make the sum of each row equal to 1: n n ~ IFScore( si ) : IFScoreint ra ( si ) (1 :) IFScoreint er ( si ) (7) ~ M i,j M i,j j M i,j , if o 1 , M i,j j1 0 (2) 0 therwise Similar to the above process, another two affinity graphs Gintra and Ginter are also built: the within-document affinity graph Gintra is to include only the within-document links between sentences (the entries of the cross-document links are set to 0); the cross-document affinity graph Ginter is to include only the cross-document links between sentences (the entries of the within-document links are set to 0). The corresponding adjacency (affinity) matrices of Gintra and Ginter are denoted by Mintra and Minter respectively. Mintra and Minter can be extracted from M and we have M=Mintra+Minter. Similar to Equation (2), Mintra and Minter are respectively normalized to sum of each row equal to 1. where [0,1] is a weighting parameter, specifying the relative contributions to the final informativeness scores from the cross-document relationships and the within-document relationships between sentences. If :=0, IFScore(si) is equal to IFScoreinter(si); if :=1, IFScore(si) is equal to IFScoreintra(si); and if :=0.5, the cross-document relationships and the within-document relationships are assumed to be equally important. We will investigate all the above methods for informativeness score computation. Note that all previous graph-ranking based methods do not consider the cross-document links and have IFScore(si)= IFScoreintra(si). 3.4 Within-Document Redundancy Removing For each single document dk to be summarized we can extract a sub-graph Gd only containing the sentences within dk and the k ~ ~ M intra and M inter to make the corresponding edges between them from the global affinity graph G. We assume the document dk has m (m