SIGIR 2007 Proceedings Poster An Effective Snippet Generation Method Using the Pseudo Relevance Feedback Technique 1 Youngjoong Ko1, Hongkuk An2, and Jungyun Seo2 Dept. of Computer Engineering, Dong-A Univ., 840, Hadan 2-dong, Saha-gu, Busan, 604-714, Korea 2 Computer Science & Program of Integrated Biotechnology, Sogang Univ., Sinsu-dong, Mapo-gu, Seoul, 121-742, Korea yjko@dau.ac.kr, {anhg99, seojy}@sogang.ac.kr ABSTRACT A (page or web) snippet is document excerpts allowing a user to understand if a document is indeed relevant without accessing it. This paper proposes an effective snippet generation method. The pseudo relevance feedback technique and text summarization techniques are applied to salient sentences extraction for generating good quality snippets. In the experimental results, the proposed method showed much better performance than other methods including Google and Naver. 2. THE QUERY-BASED SUMMARIZATION USING PSEUDO RELEVANCE FEEDBACK The idea of pseudo relevance feedback is to take the results that are initially returned from some query and to use information about whether or not those results are relevant to perform a new query. To apply the pseudo relevance feedback technique to query-based summarization, first we must find the relevant sentences automatically. In the query-based summarization, words or phrases from sections of text are compared to a user's query terms, which are mostly related to user's interest. From this point of view, a sentence including a query term can be regarded as a relevant sentence. All the terms of the relevant sentences become candidates for query expansion. The Term Selection Value (TSV) of each candidate term is estimated by a relevance weighting function in the probability model [2]. The top k candidate terms are selected for query expansion. The importance score of each sentence is estimated by the sum of each expanded query term's TSV and the location information of the sentence. The salient sentences are extracted according to the calculated importance score, and the snippet of each document is created from them. Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing] ­ abstracting methods General Terms Algorithms, Performance, Experimentation Keywords Snippet, Pseudo Relevance Feedback, Text Summarization 1. INTRODUCTION Today's Web search engines use text summaries to help users make relevance decisions. Most summaries, such as those by Google, are based on the query terms from the user's search: query-based summaries. Since these Web search engines still require users to read many closely-ranked documents to search for their information seeking goals, the query-based summaries would be helpful in Information Retrieval or topic-detection. Snippet summaries are query-based summaries in Web search engines that show which query terms appear in a document and the words around those query terms [1]. This paper investigates the quality of snippets from commercial Web search engines, and it proposes a new effective snippet generation method. The proposed method is based on pseudo relevance feedback. The pseudo relevance feedback technique is a well known approach to query expansion in Information Retrieval. This pseudo relevance feedback technique is applied to a query-based summarization task to create highquality snippets. In the experiments, the proposed method showed significantly improved performance in both data sets from two commercial search engines: Google and Naver1. 2.1 Query Expansion Using Statistical Relevance Weighting for Summarization All the noun terms from relevant sentences become candidate terms for query expansion. To choose salient terms among the candidate terms, the relevance weight (TSV) of each term is estimated by the following function depending on the distribution of candidate terms in relevant and non-relevant sentences. This is based on the basic probability model proposed by Robertson and Sparck Jones [2,3]. wt = TSVt = log p(1 - q ) (r + 0.5)( S - s + 0.5) = log q (1 - p ) ( R - r + 0.5)( s + 0.5) (1) wt: the relevance weight of candidate term t p: the probability that candidate term t is assigned within the set of relevant sentences for initial query Q q: the probability that candidate term t is assigned within the set of nonrelevant sentences for initial query Q R: the number of relevant sentences for initial query Q r: the number of relevant sentences for initial query Q having term t S: the number of non-relevant sentences for initial query Q s: the number of non-relevant sentences for initial query Q having term t Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 1 Naver is the dominant Web search engine in Korea. 711 SIGIR 2007 Proceedings Poster After all the candidate terms are sorted according to their TSVs, the top k candidate terms are selected as expanded query terms. k is set to six by a simple experiment in this paper. the title to the initial query [4]. The importance score of each sentence is measured by similarity between the expanded query with title and each sentence as follows: Sim(T , S i ) = 2.2 The Importance Score Estimation of Each Sentence for Generating a Snippet All the sentences including initial query terms become candidates to be extracted as a query-based summary for snippet. This section describes how to extract salient sentences from the set of the candidate sentences. The importance score of each candidate sentence is estimated by using the TSVs of expanded query terms and the location information of the sentence within a document. The final importance score is calculated by formula (3). RWscore( S i ) = t j =1 j ·sj (4) where tj denotes the presence or absence of j-th term in the expanded query with title, T, and sj denotes that of j-th term in a sentence, Si. The final title method exploits a similar formula as formula (3) except using Sim(T,Si) instead of RWscore(Si). w j S i 3.2 Experimental Results Table 1 shows that the proposed method outperforms Google and Naver in performance. Moreover, the proposed query expansion method is also much superior to the title method in both data sets. Table 1. Performance results in data sets (Google and Naver) Title method 317 (56.6%) 325 (57.4%) Proposed method 376 (67.1%) 389 (68.7%) Search engine 114 (20.4%) 337 (59.5%) Total # of docs wj (2) RWscore( S i ) i -1 score( S i ) = + (1 - )1 - RWscoreMax N (3) In formula (2), RWscore(Si) denotes the importance score which is calculated by the sum of relevance weights (TSVs) of expanded query terms in a candidate sentence, Si. By formula (3), the final importance score is estimated by the linear combination of a normalized RWscore(Si) and a location score. RWscoreMax denotes the maximum RWscore value within a document. N is the total number of sentence in the document, and Si denotes the i-th located sentence. The parameter is set to 0.4 by a simple experiment. As a result, the top m candidate sentences are selected as a summary for snippet. Naver Data Set Google Data Set 560 566 4. CONCLUSION This paper presented the outstanding snippet generation method using the pseudo relevance feedback technique. The effectiveness of the proposed method was verified by the experiments on Google and Naver data sets. However, the execution time of the proposed method can make it difficult to apply the proposed method to practical IR systems. Since most commercial search engines analyze input queries to provide the order information of popular queries to users, the snippets of the popular queries can be previously made by using the proposed method in batch process. 3. EVALUATION 3.1 Data Sets and Performance Measure Two data sets were constructed from Google and Naver search engines. Four categories (science & technology, sports & entertainment, politics & economy, and society & culture) were selected and 20 queries were extracted from each category. At first, top 10 Web documents from the result of each search engine were chosen for these data sets. These data sets consist of 80 queries and 1,600 documents. Some of these 1,600 documents were removed because they have the same number of sentences including query terms as the number of sentences from the snippet of each search engine. The final data set is composed of 80 queries and 1,126 documents. Three people participated in extracting the most salient sentence from a document by voting mechanism. They were not members of our research group, and they were not computer-related majors. Each query and related documents were provided to them. Hence the number of sentences in a summary extracted by the proposed method is equal to that in a snippet from Google or Naver. The accuracy is used as a performance measure in this paper. If an extracted summary or a snippet from Google or Naver includes the most salient sentence selected by man, it is regarded as a correct summary. Otherwise, it is regarded as an incorrect summary. In addition, the proposed query expansion technique using pseudo relevance feedback is verified by comparison with the title method. It is a variation of pseudo relevance feedback which adds 5. ACKNOWLEDGEMENT This research was performed for the Intelligent Robotics Development Program, one of 21st Century Frontier R&D Programs funded by the Ministry of Commerce, Industry and Energy of Korea. 6. REFERENCES [1] McDonald, D.M. and Chen, H. Summary in Context: Searching Versus Browsing. ACM Trans. On Information Systems, 24(1), (Jan. 2006), 111-141. [2] Robertson, S.E. and Sparck Jones, K. Relevance Weighting of Search Terms. Journal of the American Society for Information Science, 27(3), (May-June 1976), 129-146. [3] Harman, D. Toward Interactive Query Expansion. In Proceedings of the 11th ACM SIGIR conference, (1988), 321331. [4] Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. Summarizing Text Documents: Sentence Selection and Evaluation Metrics, In Proceedings of the 22nd ACM SIGIR conference, (1999), 121-128. 712