A New Web Page Summarization Method
Qian Diao
Application Research Lab, Microprocessor Technology Lab, Intel Corporation SC12-303, 3600 Juliette Lane, Santa Clara, CA 95054

Jiulong Shan
Intel China Research Center 8/F, Raycom Infotech Park A, NO.2, KeXueYuan South Road, Zhong Guan Cun, Beijing 100080, China

qian.diao@intel.com ABSTRACT
In this paper, we present a novel multi-webpage summarization algorithm. It adds the graph based ranking algorithm into the framework of Maximum Marginal Relevance (MMR) method, to not only capture the main topic of the web pages but also eliminate the redundancy existing in the sentences of the summary result. The experiment result indicates that the new approach has the better performance than the previous methods.

jiulong.shan@intel.com
representation of sentences. The sentence scoring function is similar to PageRank algorithm [3], where the similarity can be computed by any of the metrics, such as cosine similarity in each sentence pair.
Rank ( Si ) = (1 - d ) + d 

S j  Neighbour ( S i )


Sim(VS j , VS i )
S k  Neignbour ( S j )

 Sim(V

(1)

Sj

, VS k )

Categories and Subject Descriptors
H.3.1 [Content Analysis and Indexing]­Abstracting methods

General Terms: Algorithms, Performance, Experimentation Keywords: Multi-document Summarization, Graph Based
Ranking Algorithm, Maximum Marginal Relevance

Although the algorithm captures the main content by considering the global patterns of the similarities between sentences, it has the sentence redundancy problem. A sentence having more similarity with others will have higher scores, and also the bigger chance of being redundant with the meanings of the other sentences. The experiment result in the next section (Figure 1) shows the problem. MMR [4] is a classical method for summarization. The scoring function is shown in Equation 2, where  is an empirical value and R is the sentences already selected for the summary.

1. INTRODUCTION
Web search engines find and rank documents based on maximizing relevance to the user's query, yet these systems still require users to read the hundreds of closely-ranked documents to locate the relevant sections of text for their information seeking goals. For this problem, an integration of IR and query oriented summarization would be helpful in a true IR or topic-detection context. However, the functionality challenges on the summarization system are great, because many of the input documents are likely to repeat much the same information, while differing in certain parts, since they are the retrieval results under the same query. This make the multi-webpage summarization task has a significant difference from singledocument summarization: anti-redundancy methods are needed because the degree of redundancy as previously mentioned is significantly higher in a group of topically related articles that in an individual article as each article tends to describe. Therefore, a good summarization tool should contain the key shared relevant information among all the documents only once, plus other information unique to some of the individual documents that are directly relevant to the user's query [5]. To achieve this goal, we propose a new method, which combine the two previous methods, graph based ranking algorithm and Maximum Marginal Relevance (MMR) algorithm, to not only consider the similarities between user's query and main topic of the documents, but also minimize the possible redundancy existing in the summary result.

MMR( S i ) = Sim(VSi , VQ ) - (1 -  ) max Sim(VSi , VS j )
S j R

(2)

Although MMR reduces the sentence redundancy by -(1 -  ) max Sim(V Si , V S j ) , it considers only the maximal
S j R

similarity Sim(V Si , VQ ) between query and sentences, which is not always optimal, and might miss the main topic of the articles when no query inputs or Sim(VS i ,VQ ) = 0 .
Score ( Si ) =   rank ( Si )(0.001 + Sim (VSi ,VQ )) - (1 -  ) max Sim (VSi ,VS k ) (3)
S k R

rank ( S i ) =

Rank ( S i ) max ( Rank ( S i ))
i =1 N

(4)

We add the graph based ranking into the framework of MMR by Equation 3 and 4, to consider the global patterns in sentences as well as the redundancy elimination. Here graph based ranking score Rank ( S i ) is scaled to [0, 1] by the normalization in Equation 4. This will ensure MMR framework works normally when Rank ( S i ) has different value scale with max Sim(V Si , VSk ) .
Sk R

2. ALGORITHM
Graph based ranking algorithm [1][2] computes sentence importance by the concept of eigenvector centrality in a graph
Copyright is held by the author/owner(s). SIGIR'06, August 6-11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008.

In Equation 3, a small const (0.001) is added to prevent the first part from being zero when Sim(VS i ,VQ ) = 0 , which means when no query input or the query shows no similarity with the sentences, the summary will still have the main topic by rank ( Si ) .

639


3. EXPERIMENTS
We made the test data by randomly choosing two data sets on [6]. The first data set includes 3 articles, in http://lada.si.umich.edu:8080/clair/nie1/nie.cgi?CID=2005122623 4457. The query for our approach and MMR method is the search query "Lee Google Microsoft China" plus the first 11 words in the three articles, totally 37 words. After eliminating the redundancy, we have 18 different query words. The second data in http://lada.si.umich.edu:8080/clair/nie1/nie.cgi?CID=2005122623 4447 includes 4 articles. The query is the first 11 words in the three articles, totally 34 words. After eliminating the redundant words, we have 29 different query words.
Graph based ranking algorithm: Google and Microsoft had been set to go to trial next month in a Washington state case, brought by Microsoft, that accused Google and Lee of violating a non-compete agreement that Lee had signed with Microsoft.(1:6) Microsoft's case was due to go to trial next month.(2:6) MMR Google, Microsoft settle dispute over China exec Dec. 23, 2005.(1:1) Major rivals in online ad market were set to go to trial next month in case of defecting talent.(1:2) Our approach Google, Microsoft settle dispute over China exec Dec. 23, 2005.(1:1) SEATTLE Microsoft said late Thursday it had reached a settlement with rival Google and former employee Kai-Fu Lee, ending a legal battle that had exposed behindthe-scenes rancor between the companies.(3:2)

NewsInEssence Google, Microsoft settle dispute over China exec Dec. 23, 2005 (1:1) Google and Microsoft had been set to go to trial next month in a Washington state case, brought by Microsoft, that accused Google and Lee of violating a non-compete agreement that Lee had signed with Microsoft. (1:6) Our approach Google, Microsoft settle dispute over China exec Dec. 23, 2005.(1:1) SEATTLE Microsoft said late Thursday it had reached a settlement with rival Google and former employee Kai-Fu Lee, ending a legal battle that had exposed behind-thescenes rancor between the companies.(3:2)

Figure 2 Summary on the dataset 1, 2% original article length.
NewsInEssence Get the perfect stock allocation; gauge your risk; all you need to put your portfolio in order. (2:2) NEW YORK (MONEY Magazine) - The critical determinant of how well your portfolio performs over time is not the specific stocks or funds you pick; study after study shows that it's your mix of stocks, bonds and cash. (2:3) In fact, when it comes to the three main choices you have in your retirement account -- how much to contribute, how to allocate your money between stocks and bonds and which funds to choose -- a recent study by Putnam Investments shows that investing prowess isn't what matters. (3:8) Our approach MONEY Magazine: How average Joes can retire rich Dec. 16, 2005.(3:1) Good time to reassess financial goals.(4:1) MARTINEZ Ten years ago, Pamela Newman rang in the year as a waitress.(4:3)

Figure 1 Summary result of the three approaches, 2% original article length. (a:b) means the bth sentence in the ath article. In Figure 1, in the result of the graph based ranking algorithm, the second sentence (2:6) has the redundant meaning with the first sentence (1:6), which is "go to trial next month". In the MMR result, the two sentences selected by the algorithm are all from article 1, and the information from other articles is missed, hence the topic described by the summary is not so completed. In our result, the two sentences selected do not have much redundancy, and represent the topic of two articles simultaneously. We also made a simple comparison with a published system, NewsInEssence [6]. The result In Figure 2 indicates that our approach collects information from more diverse web pages. For example, the sentence (3:2) not only gives a detail explanation for sentence (1:1), but also provides new information as "Google is a rival of Microsoft and there is some behind-thescenes rancor between the companies". In Figure 3, we can see that our approach selects two first sentences from two articles respectively. Usually the first sentence in an article is the most important one, and many web search engine's snapshot for the return hits are the first sentences of the pages. Therefore, the results indicate that our approach has better chance of selecting main topic sentences.

Figure 3 Summary on the dataset 2, 2% original article length.

4. REFERENCES
[1] Erkan G. and Radev D. Lexrank Graph-based centrality as salience in text summarization. Journal of Artificial Intelligence Research (JAIR), 2004. [2] Mihalcea R., Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization, in Proc. of ACL'04, Barcelona, Spain, July 2004. [3] Page, L., Brin, S., Motwani, R., & Winograd, T. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, Stanford, CA, 1998. [4] Carbonell, J. and J. Goldstein. The Use of mmr, DiversityBased Reranking for Reordering Documents and Producing Summaries. In Proc. of SIGIR'98. Australia, 1998 [5] Goldstein J., Mittal V., Carbonell J., and Callan J. Creating and evaluating multi-document sentence extract summaries. In Proc. of CIKM'00, 2000. pp. 165-172. [6] NewsInEssence, interactive multi-source news summarization, http://lada.si.umich.edu:8080/clair/nie1/nie.cgi

640