SIGIR 2007 Proceedings

Session 6: Summaries

CollabSum: Exploiting Multiple Document Clustering for Collaborative Single Document Summarizations
Institute of Computer Science and Technology, Peking University, Beijing 100871, China

Xiaojun Wan, Jianwu Yang and Jianguo Xiao

{wanxiaojun, yangjianwu, xiaojianguo}@icst.pku.edu.cn ABSTRACT
Almost all existing methods conduct the summarization tasks for single documents separately without interactions for each document under the assumption that the documents are considered independent of each other. This paper proposes a novel framework called CollabSum for collaborative single document summarizations by making use of mutual influences of multiple documents within a cluster context. In this study, CollabSum is implemented by first employing the clustering algorithm to obtain appropriate document clusters and then exploiting the graph-ranking based algorithm for collaborative document summarizations within each cluster. Both the with-document and cross-document relationships between sentences are incorporated in the algorithm. Experiments on the DUC2001 and DUC2002 datasets demonstrate the encouraging performance of the proposed approach. Different clustering algorithms have been investigated and we find that the summarization performance relies positively on the quality of document cluster. document without any additional clues and prior knowledge. In this paper, we focus on generic single document summarization. Very often, all single documents in a document set are required to be summarized. While almost all previous methods for single document summarization produce a summary for a specified document based only on the information contained in the document. One common assumption of existing methods is that the documents are independent of each other. Hence the summarization task is conducted separately without interactions for each document. However, some documents within an appropriate cluster context actually have mutual influence and contain useful clues which can help to extract summary from each other. For example, two documents about the same topic would provide additional knowledge for each other to better evaluate and extract salient information from each other. The idea is borrowed from human's perception that a user would better understand a topic expressed in a document if the user reads another document about the same topic. This study proposes a novel framework called CollabSum for collaborative document summarizations by making use of additional information from multiple documents within appropriate cluster context. The cluster context can be obtained by applying the clustering algorithm on the document set and we have investigated how the cluster context influences the summarization performance by employing different clustering algorithms. The proposed CollabSum employs the graph-ranking based algorithm for collaborative document summarization of each document in a specified cluster and both the cross-document relationships and the within-document relationships between sentences are incorporated in the algorithm, where the within-document relationships reflect the local information existing in the specified document and the cross-document relationships reflect the global information existing in the cluster context. We perform experiments on the DUC2001 and DUC2002 datasets and the results demonstrate the good effectiveness of CollabSum. The use of the cross-document relationships between sentences can much improve the performance of single document summarization. We find that the summarization performance is positively correlated with the quality of cluster context and existing clustering algorithms can yield appropriate cluster context for collaborative document summarizations. The rest of this paper is organized as follows: Section 2 briefly introduces the related work. The proposed CollabSum is described in detail in Section 3. We set up the experiments in Section 4 and give the results in Section 5. Section 6 discusses the results and lastly we conclude this paper in Section 7.
-------------------------------- *This work was supported by the National Science Foundation of China
(60642001).

Categories and Subject Descriptors:
H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing ­ abstracting methods; I.2.7 [Artificial Intelligence]: Natural Language Processing ­ text analysis

General Terms: Algorithms, Experimentation, Performance Keywords: CollabSum, Single document summarization,
Collaborative summarization, Graph-ranking algorithm

1. INTRODUCTION
Document summarization is the process of automatically creating a compressed version of a given document that delivers the main topic of the document. Automated document summarization has drawn much attention for a long time because it becomes more and more important in many text applications. For example, current search engines usually provide a short summary for each resultant document so as to facilitate users to browse the results and improve users' search experience. News portals usually provide concise headline news describing hot news topic each day and they also produce weekly news review to save users' time and improve service quality. Document summary can be either query-relevant or generic. Query-relevant summary should be closely related to the given query. Generic summary should reflect the main topic of the
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007...$5.00.

143


SIGIR 2007 Proceedings

Session 6: Summaries

2. RELATED WORK
Single document summarization has been widely explored in the natural language processing and information retrieval communities. A series of workshops and conferences on automatic text summarization (e.g. DUC 1 and NTCIR 2 ), special topic sessions in ACL, COLING, and SIGIR have advanced the technology and produced a couple of experimental online systems. Generally speaking, single document summarization methods can be categorized into two categories: extraction-based methods and abstraction-based methods [11, 12, 14]. Extraction is just to select existing sentences while abstraction needs sentence compression and reformulation. In this paper, we focus on extraction-based methods. Extraction-based methods usually assign each sentence a saliency score and then rank the sentences in the document. The score is usually computed based on a combination of statistical and linguistic features, including term frequency [18], sentence position [9], cue words [6], stigma words [6], topic signature [17], lexical chains [25], etc. Machine learning methods are also employed to extract sentences, including classification-based methods [1, 15], clustering-based methods [22], HMM-based methods [5], CRF-based method [24], etc. Other methods include maximal marginal relevance (MMR) [4], latent semantic analysis (LSA) [8], and relevance measure [8]. In [27], the mutual reinforcement principle is employed to iteratively extract key phrases and sentences from a document. Moreover, a method based on text segmentation is proposed by McDonald and Chen [19] and the text segments instead of the sentences are ranked. Most recently, the graph-ranking based methods, including TextRank [20, 21] and LexPageRank [7], have been proposed for document summarization. Similar to PageRank [3] or HITS [13], these methods first build a graph based on the similarity relationships between the sentences in a document and then the importance of a sentence is determined by taking into account the global information on the graph recursively, rather than relying only on the local sentence-specific information. The basic idea underlying the graph-based ranking algorithm is that of "voting" or "recommendation". When a sentence links to another one, it is basically casting a vote for that other sentence. The higher the number of votes that are cast for a sentence, the higher the importance of the sentence is. Moreover, the importance of the sentence casting the vote determines how important the vote itself is. The computation of sentence importance is usually based on a recursive form, which can be transformed into the problem of solving the principal eigenvector of the transition matrix. However, all the above methods summarize each single document independently. Particularly, only the sentences within the same document cast votes for each other in the graph-ranking based methods. We believe that the sentences in other topic-related documents can also cast votes for the sentences in the specified document, so both the cross-document relationships and the within-document relationships between sentences are incorporated in the proposed CollabSum in this study.

3. THE PROPOSED COLLABSUM 3.1 Overview
Given a document set in which each document needs to be summarized respectively, CollabSum first employs the clustering algorithm (e.g. the agglomerative algorithm, the divisive algorithm, the k-means algorithm, etc.) [10, 26] to group the documents into a few clusters. The documents within each cluster are expected to be topic-related and each cluster can be considered as a context for any document in the cluster. Given a document cluster, CollabSum incorporates both the within-document relationships (local information) and the cross-document relationships (global information) between sentences into the graph-ranking based algorithm to summarize each single document within the cluster. Figure 1 gives the framework of the proposed approach. 1. Document Clustering: Group the documents in the document set into a few clusters using the clustering algorithm; Document Summarization: For each cluster, perform the following steps respectively to produce summaries for single documents in the cluster: 1) Affinity Graph Building: Build a global affinity graph G based on all sentences in the documents of the given cluster: D={d1,d2,...dl}, where l is the number of documents. Let S={s1, s2, ..., sn} denote the sentence set for the cluster, where n is the number of sentences. 2) Informativeness Score Computation: Based on the global affinity graph G, the graph-ranking based algorithm is employed to compute the informativeness score IFScore(si) for each sentence si, where IFScore(si) quantifies the informativeness of the sentence si. 3) Within-Document Redundancy Removing: For any single document dk to be summarized, the greedy algorithm is employed to remove redundancy for the informative sentences. Finally, the sentences which are both informative and novel are chosen into the summary. Figure 1: The framework of CollabSum For the first step of the above framework, different clustering algorithms will yield different clusters and the documents in a high-quality cluster are usually deemed to be highly topic-related (i.e. appropriate cluster context), while the documents in a low-quality cluster are usually not topic-related (i.e. inappropriate cluster context). The quality of a cluster will influence the reliability of the contextual information for evaluating the importance of the sentences in the cluster. A number of clustering algorithms will be investigated in the experiments. For the second step of the above framework, step 1) aims to build a global affinity graph reflecting the relationships among all sentences in the document set of the given cluster. Step 2) aims to compute the informativeness score of each sentence based on the global affinity graph. The informativeness of a sentence indicates how much information about the main topic the sentence contains. Step 3) aims to remove redundant information in the summary and keep the sentences in the summary as novel as possible. Step 1) and 2) perform on all documents in the cluster in order to get highly informative sentences from a global perspective, while step

2.

1 2

http://duc.nist.gov http://research.nii.ac.jp/ntcir/index-en.html

144


SIGIR 2007 Proceedings

Session 6: Summaries

3) performs only on each single document in order to remove redundancy from a local perspective. A summary is expected to include the sentences which are both highly informative and highly novel. Note that the summarization tasks are conducted in a batch mode for each cluster. The steps of 1), 2) and 3) will be described in next sections respectively.

3.3 Informativeness Score Computation
Based on the global affinity graph G, the informativeness score IFScoreall(si) for sentence si can be deduced from those of all other sentences linked with it and it can be formulated in a recursive form as follows[7, 20, 21, 28]:

3.2 Affinity Graph Building
Given a sentence collection S={si | 1OiOn} of a specified cluster, the affinity weight sim(si, sj) between a sentence pair of si and sj is calculated using the Cosine measure [2]. The weight associated with term t is calculated with the tft.isft formula, where tft is the frequency of term t in the sentence and isft is the inverse sentence frequency of term t, i.e. 1+log(N/nt), where N is the total number of sentences in a background corpus and nt is the number of sentences containing term t. If sentences are considered as nodes, the sentence collection can be modeled as an undirected graph by generating a link between two sentences if their affinity weight exceeds 0, i.e. an undirected link between si and sj (iRj) with the affinity weight sim(si,sj) is constructed if sim(si,sj)>0; otherwise no link is constructed. Thus, we construct an undirected graph G reflecting the relationships between sentences by their content similarity. The links (edges) between sentences in the graph can be categorized into two classes: within-document link and cross-document link. Given a link between a sentence pair of si and sj, if si and sj come from the same document, the link is a within-document link; and if si and sj come from different documents, the link is a cross-document link. Actually, the within-document link reflects the local information in a document, while the cross-document link reflects the global information in a cluster context, which is exploited by CollabSum to make use of mutual influences between different documents in the cluster. The graph G contains both kinds of links between sentences and is called as Global Affinity Graph. We use an adjacency (affinity) matrix M to describe G with each entry corresponding to the weight of a link in the graph. M = (Mi,j)n×n is defined as follows:

IFScoreall (si ) d 
all j

i IFScore

all

~ (s j ) M j,i

(1 d ) n

(3)

And the matrix form is:

r
where

d

MT r ~

(1 d ) e r n

(4)

r [ IFScoreall ( si )]n 1 is the vector of informativeness

scores. e is a unit vector with all elements equaling to 1. d is the damping factor usually set to 0.85. For implementation, the initial informativeness scores of all sentences are set to 1 and the iteration algorithm in Equation (3) is adopted to compute the new informativeness scores of the sentences. Usually the convergence of the iteration algorithm is achieved when the difference between the informativeness scores computed at two successive iterations for any sentences falls below a given threshold (0.0001 in this study). Similarly, the informativeness score of sentence si can be deduced based on either the within-document affinity graph Gintra or the cross-document affinity graph Ginter as follows:

r

IFScorent ra (si ) d  i

all j

i IFScore

int ra

~ (s j ) (M int ra ) j,i ~ (s j ) (M int er ) j,i

(1 d ) n (5) (1 d ) n (6)

IFScorent er (si ) d  i

all j

i IFScore

int er

M i,j

sim ( s i ,s j ) , if i j o 0 , therwise 

(1)

The final informativeness score IFScore(si) of sentence si can be either IFScoreall(si), IFScoreintra(si) or IFScoreinter(si), or the linear combination of IFScoreintra(si) and IFScoreinter(si) as follows:

Then M is normalized to M as follows to make the sum of each row equal to 1:
n n

~

IFScore( si )

: IFScoreint ra ( si ) (1 :) IFScoreint er ( si )

(7)

~ M i,j

M i,j
j

M i,j , if o 1 ,

M i,j
j1

0 

(2)

0

therwise

Similar to the above process, another two affinity graphs Gintra and Ginter are also built: the within-document affinity graph Gintra is to include only the within-document links between sentences (the entries of the cross-document links are set to 0); the cross-document affinity graph Ginter is to include only the cross-document links between sentences (the entries of the within-document links are set to 0). The corresponding adjacency (affinity) matrices of Gintra and Ginter are denoted by Mintra and Minter respectively. Mintra and Minter can be extracted from M and we have M=Mintra+Minter. Similar to Equation (2), Mintra and Minter are respectively normalized to sum of each row equal to 1.

where [0,1] is a weighting parameter, specifying the relative contributions to the final informativeness scores from the cross-document relationships and the within-document relationships between sentences. If :=0, IFScore(si) is equal to IFScoreinter(si); if :=1, IFScore(si) is equal to IFScoreintra(si); and if :=0.5, the cross-document relationships and the within-document relationships are assumed to be equally important. We will investigate all the above methods for informativeness score computation. Note that all previous graph-ranking based methods do not consider the cross-document links and have IFScore(si)= IFScoreintra(si).

3.4 Within-Document Redundancy Removing
For each single document dk to be summarized we can extract a sub-graph Gd only containing the sentences within dk and the
k

~ ~ M intra and M inter

to make the

corresponding edges between them from the global affinity graph G. We assume the document dk has m (m<n) sentences and the sentences' affinity matrix M d ( M d ) m m is derived from the
k k

original matrix M by extracting the corresponding entries. Then

145


SIGIR 2007 Proceedings

Session 6: Summaries

~ M d k is normalized into M dk as Equation (2) to make the sum of
each row equal to 1. Similar to [28], the greedy algorithm is used to penalize the sentences highly overlapping with other ~ informative sentences based on M d . Finally, the overall rank
k

ground truth clusters as the upperbound of the automatic clustering algorithms. Agglomerative (AverageLink) Clustering: It is a bottom-up hierarchical clustering algorithm and starts with the points as individual clusters and, at each step, merge the most similar or closest pair of clusters, until the number of the clusters reduces to the desired number. The similarity between two clusters is computed using the AverageLink method, which computes the average of the Cosine similarity values between any pair of documents belonging to the two clusters respectively as follows:
m i n j1

score for each sentence within the document is obtained and the sentences with highest overall rank scores are both highly informative and highly novel, which are chosen into the summary for dk according to the summary length limit. The basic idea of the algorithm is to decrease the overall rank score of less informative sentences by the part conveyed from the most informative one. The overall rank score ORScore(si) of any sentence si is initialized to its informativeness score. Once the highest ranked sentence si is chosen into the summary, any remaining sentence sj linked with si are penalized as follows: ~ ORScore( s j ) ORScore( s j ) ( M d k ) j,i IFScore( si ) (8) The details of the algorithm are omitted due to page limit. The algorithm is applied once for each single document to be summarized in the document cluster.

sim(d i ,d j ) mn

sim(c1 ,c 2 )

(9)

4. EXPERIMENTAL SETUP 4.1 Data Set
We use the DUC2001 and DUC2002 datasets for evaluation in the experiments. Both task 1 of DUC2001 and task 1 of DUC 2002 aim to evaluate generic single document summaries with a length of approximately 100 words or less. Table 1 gives a short summary of the two datasets. The sentences in each article have been separated and the sentence information is stored into files. The articles have been grouped into clusters manually and the documents within each cluster are topic-related or relevant. The manually labeled clusters are considered as the ground truth clusters or gold clusters. In order to investigate different clustering algorithms, the documents in the clusters are mixed together to form the whole document set for single document summarizations. As a preprocessing step, the stop words in each sentence are removed and the remaining words are stemmed using the Porter's stemmer [23]. Table 1: Summary of datasets
DUC 2001 Task Number of documents Number of clusters Data source Summary length Task 1 309 30 TREC-9 100 words DUC 2002 Task 1 567 59 TREC-9 100 words

where di, dj are two documents in cluster c1 and cluster c2 respectively, and m is the number of documents in cluster c1 and n is the number of document in cluster c2. Agglomerative (CompleteLink) Clustering: It differs from the above agglomerative (AverageLink) clustering algorithm only in that the similarity between two clusters is computed using the CompleteLink method, which computes the minimum of the Cosine similarity values between any pair of documents belonging to the two clusters respectively as follows:

1

sim ( c1 ,c 2 )

min d i

c1,d j c 2

{ sim ( d i ,d j )}

(

10)

Divisive Clustering: It is a top-down hierarchical clustering algorithm and starts with one, all-inclusive cluster and, at each step, splits the largest cluster (i.e. the cluster with most documents) into two small clusters using the KMeans algorithm until the number of clusters increases to the desired number. KMeans Clustering: It is a partition based clustering algorithm. The algorithm randomly selects k documents as the initial centroids of the k clusters and then iteratively assigns all documents to the closest cluster, and recomputes the centroid of each cluster, until the centroids do not change. The similarity between a document and a cluster centroid is computed using the standard Cosine measure. Random1 Clustering: It produces k clusters by randomly assigning each document into one of the k clusters. Random2 Clustering: It randomly produces k clusters in a different randomization process. Random3 Clustering: It randomly produces k clusters in another different randomization process.

4.3 Implemented Summarization Systems
Given a cluster of documents, we can design the following three summarization methods based on how to use the cross-document relationships between sentences in the cluster for computing the informativeness scores of sentences: UniformLink: The method computes the informativeness score of a sentence based on the global affinity graph with both the cross-document relationships and the within-document relationships between sentences, i.e. IFScore(si)= IFScoreall(si); InterLink: The method computes the informativeness score of a sentence based only on the cross-document relationships between sentences, i.e. IFScore(si)= IFScoreinter(si); UnionLink: The method computes the informativeness scores IFScoreinter(si) and IFScoreintra(si) of sentence si based on the cross-document relationships and the within-document

4.2 Document Clustering Algorithms
In the experiments, several popular clustering algorithms and random clustering algorithms are explored to produce cluster contexts. Note that we know the numbers of the clusters for the two datasets beforehand and simply use them as input for the following clustering algorithms3. Gold Clustering: It is a pseudo clustering algorithm by manually grouping the documents. For any of the two datasets, we use the
3

How to obtain the number of desired clusters is not the focus of this study.

146


SIGIR 2007 Proceedings

Session 6: Summaries

relationships between sentences respectively, and then combines them as in Equation (7) to get the final informativeness score. Typically, we let :=0.5 to make the two kind of relationships equally important. i.e., IFScore(si) = 0.5 IFScoreintra(si) + 0.5 I FScoreinter(si); In addition, we design the following baseline summarization method using only the within-document relationships between sentences in a document, which is widely explored in previous work [7, 20, 21]. IntraLink: The method computes the informativeness score of a sentence based only on the within-document relationships between sentences, i.e. IFScore(si)= IFScoreintra(si); The cross-document methods of "InterLink", "UnionLink" and "UniformLink" rely on the clustering algorithm adopted for document clustering, and a summarization system implementing CollabSum is represented by a combination of one of the clustering algorithms and one of the above cross-document summarization methods. The system based on the "IntraLink" method is the baseline summarization system, which is independent of any clustering algorithm. Note that the process of redundancy removing is the same for all the above methods.

where n stands for the length of the n-gram, and Countmatch(n-gram) is the maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries. Count(n-gram) is the number of n-grams in the reference summaries. ROUGE toolkit reports separate scores for 1, 2, 3 and 4-gram, and also for longest common subsequence co-occurrences. Among these different scores, the unigram-based ROUGE score (ROUGE-1) has been shown to agree with human judgment most [16]. We show three of the ROUGE metrics in the experimental results: ROUGE-1 (unigram-based), ROUGE-2 (bigram-based), and ROUGE-W (based on weighted longest common subsequence, weight=1.2). In order to truncate summaries longer than length limit, we use the "-l" option4 in ROUGE toolkit.

5. EXPERIMENTAL RESULTS
First of all, we show the document clustering results in Table 2. The gold clustering result is the upperbound of all automatic clustering results. Seen from the table, the four popular clustering algorithms (i.e. CompleteLink, AverageLink, KMeans and Divisive) all perform much better than the three random clustering algorithms (i.e. Random1, Random2 and Random3). Different clustering results lead to different document relationships and a high-quality cluster produced by popular algorithms is deemed to build an appropriate cluster context for collaborative summarizations. Now we compare the summarization results on the two datasets in Tables 3 and 4 respectively. In the tables, "IntraLink" is the baseline system and all the other systems are specific implementations of CollabSum. For example, "InterLink (Gold)" is implemented by using the gold clustering for document clustering and the "InterLink" method for summarization. The systems in the tables are listed in a decreasing order of the ROUGE-1 value. Seen from the tables, most proposed systems using the popular clustering algorithm or gold clustering algorithm outperform the baseline "IntraLink". The systems of "UniformLink (Gold)" and "UnionLink (Gold)", which make use of both the within-document relationships and the cross-document relationships betweens sentences in the ideal (gold) clusters, almost perform best on both datasets, except for "UniformLink(Gold)" on the DUC2001 dataset. It is encouraging that the "UniformLink" and "UnionLink" methods using the popular clustering algorithms always outperform the baseline "IntraLink", which demonstrates that the mutual influences through the cross-document relationships between sentences within a high-quality cluster do benefit single document summarizations. The importance of the cross-document relationships are further validated by that even the methods considering only the cross-document relationships between sentences ("InterLink"), based on high-quality clusters, can perform better than or at least comparable to the baseline "IntraLink". We can also observe that all the proposed systems using the random clustering algorithms (i.e. the random1, random2 and random3 algorithms) perform not well, even much worse than the
4

4.4 Evaluation Metric
4.4.1 Document Clustering Evaluation
We adopt the widely used F-Measure to evaluate the performance of the clustering algorithm (i.e. the quality of the clusters) by comparing the produced clusters with the gold clusters (classes) as follows [10]: For cluster j and class i, we have Recall(i,j)=nij/ni, Precision(i,j)=nij/nj, where nij is the number of members of class i in cluster j, nj is the number of members of cluster j and ni is the number of members of class i. The F-Measure of cluster j and class i is then given by F(i,j) = (2*Precision(i,j)*Recall(i,j))/( Precision(i,j)+Recall(i,j)) For an entire clustering, the F-measure of any class is the maximum value it attains at any cluster and an overall value for the F-measure is computed by taking the weighted average of all values for the F measure as

F

i

ni max n

j

F ( i,j )

(11)

where the max is taken over all clusters and n is the number of all documents in the set. The larger the F-Measure is, the better the cluster quality is.

4.4.2 Document Summarization Evaluation
We use the ROUGE [16] toolkit (i.e. ROUGEeval-1.4.2 in this study) for evaluation, which is widely adopted by DUC for automatic summarization evaluation. It measures summary quality by counting overlapping units such as the n-gram, word sequences and word pairs between the candidate summary and the reference summary. ROUGE-N is an n-gram recall measure computed as follows:

Countmatch(n gram) ROUGE N
S {Ref Sum}n-gram S S {Ref um}n-gram S

S

(12)

Count(n gram)
The "-l" option is very important for fair comparison. Some previous works do not adopt this option, which is likely to overestimate the ROUGE scores.

147


SIGIR 2007 Proceedings

Session 6: Summaries

baseline "IntraLink" system in most cases. This is because that the random clustering algorithms usually produce low-quality clusters, in which the documents are not truly topic-related, and so in any of these clusters, the mutual influences through the cross-document relationships between sentences are not reliable for evaluating sentences. Table 2: Clustering results
Clustering Algorithm DUC2001 Gold CompleteLink AverageLink Divisive K-Means Random1 Random2 Random3 1.000 0.907 0.877 0.924 0.866 0.187 0.189 0.183 F-Measure DUC2002 1.000 0.799 0.752 0.752 0.722 0.168 0.168 0.167

Table 4: System comparison on DUC 2002
System UniformLink (Gold) UnionLink (Gold) UnionLink (CompleteLink) UnionLink (AverageLink) UniformLink (CompleteLink) UnionLink (Divisive) UniformLink (AverageLink) InterLink (Gold) UniformLink (Divisive) InterLink (KMeans) InterLink (CompleteLink) UniformLink (KMeans) UnionLink (KMeans) InterLink (AverageLink) InterLink (Divisive) IntraLink UniformLink(Random2) UniformLink(Random3) ROUGE-1 0.47187* 0.47028* 0.46981* 0.46936* 0.46902* 0.46844 0.46825* 0.46776 0.46754 0.46749 0.46665 0.46618 0.46603 0.46577 0.46504 0.46261 0.46102 0.46019 0.45942 0.45876 0.45852 0.45812 0.43468 0.43378 0.43138 ROUGE-2 0.20102* 0.20046* 0.19985* 0.19957* 0.19833 0.19870* 0.19724 0.19628 0.19696 0.19616 0.19639 0.19603 0.19630 0.19457 0.19411 0.19457 0.19068 0.18904 0.18766 0.18821 0.18676 0.18645 0.16706 0.16793 0.16275 ROUGE-W 0.16318 0.1626 0.16253 0.16241 0.16192 0.16168 0.16211 0.16122 0.16165 0.16133 0.16153 0.16062 0.16102 0.16091 0.16045 0.16018 0.15849 0.15742 0.15801 0.15761 0.15785 0.15740 0.14716 0.14782 0.14654

Table 3: System comparison on DUC 2001
System UnionLink (Gold) UnionLink (AverageLink) UnionLink (CompleteLink) UnionLink (KMeans) UniformLink (Gold) UnionLink (Divisive) UniformLink (CompleteLink) UniformLink (KMeans) UniformLink (AverageLink) InterLink (CompleteLink) UniformLink (Divisive) InterLink (Gold) IntraLink InterLink (KMeans) InterLink (AverageLink) InterLink (Divisive) UniformLink(Random3) UniformLink(Random1) UniformLink(Random2) UnionLink (Random2) UnionLink (Random3) UnionLink (Random1) InterLink (Random1) InterLink (Random2) InterLink (Random3) ROUGE-1 0.44038* 0.43950* 0.43947* 0.43895 0.43890* 0.43832 0.43777 0.43726 0.43651 0.43556 0.43524 0.43422 0.43407 0.43402 0.43397 0.43371 0.43301 0.43269 0.43191 0.43182 0.43099 0.43006 0.41016 0.40835 0.40944 ROUGE-2 0.16229* 0.16108 0.16172* 0.16054 0.16213* 0.15988 0.16097 0.15990 0.15989 0.15993 0.15846 0.15872 0.15696 0.15769 0.15875 0.15787 0.15678 0.15501 0.15340 0.15228 0.15294 0.15291 0.13660 0.13481 0.13728 ROUGE-W 0.13678 0.13679 0.13701* 0.13623 0.13676 0.13598 0.13646 0.13612 0.13592 0.13547 0.13522 0.13506 0.13629 0.13511 0.13503 0.13461 0.13446

UniformLink(Random1) UnionLink (Random3) UnionLink (Random1) UnionLink (Random2) InterLink (Random2) InterLink (Random3) InterLink (Random1)

(* indicates that the improvement over "IntraLink" is statistically significant.) In order to better understand the results, the performances of the proposed systems are visually compared in Figures 2-3. We only show ROUGE-1 results due to page limit. The clustering algorithms in x-axis are arranged in an ascending order of their F-measure values. We can clearly see that high-quality clusters lead to high summarization performances. The proposed systems using the popular clustering algorithms perform much better than the systems using the random clustering algorithms.

0.437 R OUGE-1 0.432 0.427 0.422 0.417 0.412 0.407 Go l d Di v i si v e C o mp l e t e Li n k Kme an s Av e rag e Li n k R a n d o m2 R a n d o m1 R a n d o m3

0.13381 0.13393 0.13360 0.13364 0.13316 0.12643 0.12563 0.12588

U niformLink

UnionLink

Int erLink

Figure 2: ROUGE-1 vs. clustering algorithm on DUC2001

148


SIGIR 2007 Proceedings

Session 6: Summaries

RO U G E-1

RO UGE-1

0.475 0.47 0.465 0.46 0.455 0.45 0.445 0.44 0.435 0.43 G old CompleteLink A verageLink Random3
0.442 0.441 0.44 0.439 0.438 0.437 0.436 0.435 0.434 0.433 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 G old A verageLink K M eans Comp let eLink D ivisive

0.472 0.47 0.468 0.466 0.464 Gold AverageLink KM eans 0 0.1 0.2 0.3 0.4 Comp let eLink Divisive

UniformLink

Figure 3: ROUGE-1 vs. clustering algorithm on DUC2002 In order to investigate how the relative contributions from the cross-document relationships and the within-document relationships between sentences influence the summarization performance, Figures 4-7 show the ROUGE-1 values of the systems based on the "UnionLink" method with respect to different values of the combing weight X. Figures 4 and 5 show the performances of the systems using the popular clustering algorithms and Figures 6 and 7 shows the performances of the systems using the random clustering algorithms. With the increase of X, the cross-document relationships between sentences contribute less to the final summarization performance, and the within-document relationships between sentences contribute more to the final summarization performance. Seen from Figures 4 and 5, when using the popular clustering algorithms for document clustering, the ROUGE values of the systems first increase and then decrease with the increase of X, and the best performances are achieved by assigning appropriate relative contributions of the cross-document and within-document relationships between sentences. Seen from Figures 6 and 7, when using the random clustering algorithms for document clustering, the ROUGE values of the systems almost always increase with X, i.e., the reduction of the contribution of the cross-document relationships can improve the summarization performance, because the random clustering algorithms usually produce low-quality clusters, in which the documents are not topic-related, so the cross-document relationships between sentences in the clusters could not be considered as reliable evidences for evaluating the importance of the sentences.

RO U G E-1

R OUGE-1

Random1

UnionLink

Random2

K means

Int erLink

D ivis ive

0.462 0.5 0.6 0.7 0.8 0.9 1

:

Figure 5: ROUGE-1 vs. on high-quality clusters of DUC2002
0.44 0.435 0.43 0.425 0.42 0.415 0.41 0.405 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Random1 Random2 Random3

:

Figure 6: ROUGE-1 vs. on low-quality clusters of DUC2001
0.47 0.465 0.46 0.455 0.45 0.445 0.44 0.435 0.43 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Random1 Random2 Random3

:

Figure 7: ROUGE-1 vs. on low-quality clusters of DUC2002

RO U G E-1

6. DISCUSSION
The reason underlying the above observations that the cross-document relationships in the framework of CollabSum can improve single document summarizations is that the adopted graph-ranking based algorithm evaluates the importance of a sentence based on the "recommendation" and "voting" from its neighboring sentences. We believe that the votes of neighbors in an appropriate cluster context are at least as important as the votes of neighbors in the same document, so we use both the neighbors from the same document and the neighbors from other documents to iteratively compute the informativeness score of a sentence. In the real world, information usually redundantly exists, for example,

:

Figure 4: ROUGE-1 vs. on high-quality clusters of DUC2001

149


SIGIR 2007 Proceedings

Session 6: Summaries

there are many different documents on the Internet to discuss the same topic from various perspectives, and users can obtains thousands of documents for a specified topic through search engines. An important piece of information about a topic in a sentence would be expressed in different ways in the sentences of other documents, and the sentences might have different representations. The appropriate cluster context would guarantee that the mutual influences through the cross-document relationships between sentences are reliable. The proposed approach thus makes use of this phenomenon to incorporate the guaranteed cross-document relationships between sentences for collaborative single document summarizations.

7. CONCLUSION AND FUTURE WORK
In this paper, we propose a novel framework-CollabSum for collaborative single document summarizations, which first groups the documents into clusters and then incorporates the cross-document relationships between sentences in a cluster into the graph-ranking based summarization algorithm. Experimental results on the DUC2001 and DUC2002 datasets demonstrate the good effectiveness of CollabSum. The cross-document relationships between sentences in an appropriate cluster context can improve single document summarizations. The clustering algorithm is important for obtaining the appropriate cluster context and the low-quality clustering results will deteriorate the summarization performance. It is encouraging that most existing popular clustering algorithms can meet the demands of the proposed approach. The proposed CollabSum has more implementations than the graph-ranking based implementations in this study. In future work, we will explore more summarization methods in the proposed framework to validate the robustness of the framework. Furthermore, we will adapt the proposed approach to collaboratively summarize single web pages. Web pages have rich link information and we can make use of the link structure between web pages to find the appropriate cluster context and mine the implicit mutual influences between web page units for single page summarizations.

8. REFERENCES
[1] Amini, M. R., Gallinari, P.: The Use of Unlabeled Data to Improve Supervised Learning for Text Summarization. In Proceedings of SIGIR2002, 105-112. [2] Baeza-Yates, R., and Ribeiro-Neto, B. Modern Information Retrival. ACM Press and Addison Wesley, 1999. [3] Brin, S. and Page, L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998, 30:1-7. [4] Carbonell, J., Goldstein, J.: The Use of MMR, Diversity-based Reranking for Reordering Documents and Producing Summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, 335-336. [5] Conroy, J. M., O'Leary, D. P.: Text Summarization via Hidden Markov Models. In Proceedings of SIGIR2001, 406-407. [6] Edmundson, H. P.: New Methods in Automatic Abstracting. Journal of the Association for computing Machinery, 1969, 16(2): 264-285.

[7] ErKan, G¨unes, Radev, D. R.: LexPageRank: Prestige in Multi-Document Text Summarization. In Proceedings of EMNLP2004. [8] Gong, Y. H., Liu, X.: Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis. In Proceedings of SIGIR2001, 19-25. [9] Hovy, E., Lin, C. Y.: Automated Text Summarization in SUMMARIST. In Proceeding of ACL'1997/EACL'1997 Worshop on Intelligent Scalable Text Summarization, 1997. [10] Jain, A. K, Murty, M. N., and Flynn, P. J. Data clustering: a review. ACM Computing Surveys, 31(3):264-323, 1999. [11] Jing, H.: Sentence Reduction for Automatic Text Summarization. In Proceedings of ANLP 2000. [12] Jing, H., McKeown, K. R.: Cut and Paste Based Text Summarization. In Proceedings of NAACL2000, 178-185. [13] Kleinberg, J. M. Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999, 46(5):604-632. [14] Knight, K., Marcu, D.: Summarization beyond Sentence Extraction: A Probabilistic Approach to Sentence Compression. Artificial Intelligence, 2002, 139(1): 91-107. [15] Kupiec, J., Pedersen, J., Chen, F.: A.Trainable Document Summarizer. In Proceedings of SIGIR1995, 68-73. [16] Lin, C. Y., Hovy, E.: Automatic Evaluation of Summaries Using N-Gram Co-Occurrence Statistics. In Proceedings of HLT-NAACL2003. [17] Lin, C. Y., Hovy, E.: The Automated Acquisition of Topic Signatures for Text Summarization. In Proceedings of the 17th Conference on Computational Linguistics, 2000, 495-501. [18] Luhn, H. P.: The Automatic Creation of literature Abstracts. IBM Journal of Research and Development, 1969, 2(2). [19] McDonald, D., Chen, H.: Using Sentence-Selection Heuristics to Rank Text Segment in TXTRACTOR. In Proceedings of JCDL2002, 28-35. [20] Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In Proceedings of EMNLP2004. [21] Mihalcea, R. and Tarau, P.: A language independent algorithm for single and multiple document summarization. In Proceedings of IJCNLP2005. [22] Nomoto, T., Matsumoto, Y.: A New Approach to Unsupervised Text Summarization. In Proceedings of SIGIR2001, 26-34. [23] Porter, M. F. An algorithm for suffix stripping. Program, 14(3): 130-137, 1980. [24] Shen, D., Sun, J.-T., Li, H., Yang, Q., and Chen, Z. Document Summarization using Conditional Random Fields. In Proceedings of IJCAI 2007. [25] Silber, H. G., McCoy, K.: Efficient Text Summarization Using Lexical Chains. In Proceedings of the 5th International Conference on Intelligent User Interfaces, 2000, 252-255. [26] SteinBack, M., Karypis, G., and Kumar, V. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 1999. [27] Zha, H. Y.: Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering. In Proceedings of SIGIR2002, 113-120. [28] Zhang, B., Li, H., Liu, Y., Ji, L., Xi, W., Fan, W., Chen, Z., and Ma, W.-Y. Improving web search results using affinity graph. In Proceedings of SIGIR2005.

150