Improving Personalized Web Search using Result Diversification Filip Radlinski Cornell University Ithaca, NY, USA Susan Dumais Microsoft Research Redmond, WA, USA filip@cs.cornell.edu sdumais@microsoft.com ABSTRACT We present and evaluate methods for diversifying search results to improve personalized web search. A common personalization approach involves reranking the top N search results such that documents likely to be preferred by the user are presented higher. The usefulness of reranking is limited in part by the number and diversity of results considered. We propose three methods to increase the diversity of the top results and evaluate the effectiveness of these methods. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Measurement Keywords: Personalized web search, Result diversity 1. INTRODUCTION Personalizing search results for individual users is increasingly being recognized as an important future direction for web search [3, 5, 6, 7]. Providing results specific to individual users is particularly important because different users expect different information even given the same query [4]. One proposed approach involves providing a user profile to the search engine, which can then use it to bias search results toward the user's interests. However, this requires the search engine to perform the personalization at additional computational expense, and requires that the user trusts the search engine with her profile. In this work, we focus on an alternative approach that runs entirely client-side, where the client requests a larger number of search results and reranks them such that documents more likely to interest the user are presented higher [3, 5]. The primary limitation of client side reranking is that the system can only rerank the top N results. While reranking may allow effective personalization when web pages of particular interest to the user are present, it cannot be effective if all top N results are similar. Anagnostopoulos et al. recently proposed a method to sample search results to avoid homogeneity [1]. We propose an alternative of using query-query reformulations to understand the variety of user intents and improve the effectiveness of client-side reranking. By observing how large numbers of users modify their search queries, we see which kinds of results tend to be missing from the top of search results, from the user's Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. perspective. For example, by looking at logs from a large web search engine, we observed that the query "windows" is often followed by specializations such as "windows xp" or clarifications such as "house windows". This suggests that if we want to personalize results for a user who issued the query "windows", we may also want to consider results for both of these reformulations. We believe that analyzing query-query reformulations adds interesting diversity within the result set by focusing on user intents that are not well represented in the original results. We also believe that such a method could be used to diversify general web search results, although we do not address this question here. We first present our general strategy for diversifying search results using query-query reformulations, then propose how to select the reformulations to consider. Next, we describe a method to measure result diversity. Finally, we present an evaluation of the effectiveness of our approach. 2. DIVERSIFICATION METHODS Assume that we want to personalize search by reranking 100 results using query reformulations to introduce diversity into those results. Given a query q , we generate a set of k 10 related queries R(q ). We then take k+0 results from each 1 query in R(q ) and from q . This gives D, a set of 100 results to rerank. For our experiments we used k {0, 2, 4, 9, 19}. When k=0, the top 100 results from the original query are considered for reranking, and when k=19, the top 5 results from q and from 19 reformulations are considered. We now describe the data and algorithms used to select R(q ). To obtain query-query reformulations, we analyzed a large sample of the query logs from a popular web search engine over about 6 weeks. For each query qi we measured ni , the number of times the query was observed, and pi , the empirical probability that qi was followed by any other query within a thirty minute time window. For a pair of queries (qi , qj ), let nij be the number of times qi was followed by qj . n pij = niij is the empirical probability of qi being followed by qj . pij is the related symmetric measure pj = pij pj i . i We developed three methods for generating R(q ). The Most Frequent (MF) method sets R(qi ) to the queries qj with highest nij . These are the queries that most often follow qi . The Maximum Result Variety (MRV) method greedily selects queries that are both frequent reformulations (using pij ) and different from other queries that have already been selected (using pk ). We used a weighted combination j of these two factors, arg maxqj pij - (1 - ) maxqk R(qi ) pk , j with = 0.5. MRV is motivated by the MMR measure of Carbonell and Goldstein [2] and aims to select a set of queries that are related to qi yet different from each other. 691 Measure of diversity, diversity(D) Finally, the Most Satisfied (MS) method sets R(qi ) as the set of queries qj with minimum pj and pij > 0.001 and nij 2. This method finds queries that tend not to be further reformulated yet occur with some minimum frequency. 1.00 MF: Fixed MRV: Fixed MS: Fixed MF: User MRV: User 0.90 0.80 3. DIVERSITY EVALUATION Let match(di , u) measure how well document di matches the interests of the user u. The maximum match in D, div ersity (D) = maxdi D match(di , u), reflects the extent to which at least one result is very similar to a user's interests. We used the average value of div ersity (D) across all users as a measure of diversity for each method. We also looked at measuring diversity using the top 2-5 match scores (rather than just the maximum). We measure match(di , u) using the relevance-feedback approach for reranking developed by Teevan et al. [5]. They proposed a modification of the standard BM25 weighting scheme, in which relevance information is obtained from a local representation of a user's interests: wt = log (rt + 0.5)(N - nt + 0.5) , (nt + 0.5)(R - rt + 0.5) 0.70 0.60 0 2 4 9 k, the number of reformulations in R(q) 19 Figure 1: Evaluation of MF, MRV and MS diversity metho ds on a fixed query set (Fixed) as well as on queries taken from users' web browser cache (User). original query. Even with this small initial dip, the linear correlation between k and diversity score is strong and significant (r = 0.90, p = 0.037). Incorporating results from reformulations leads to higher computational cost for each query. However we expect the load to increase sub-linearly in k since results for common reformulations will be in cache, and a user will reformulate less often if they are satisfied with their first query. where N is the number of documents in the corpus, R is the number for which we have relevance feedback, and nt and rt are the number of documents in N and R that contain the term t. We computed these weights for both individual words (unigrams) and for pairs of adjacent words (bigrams): X matchunigram (di , u) = wt tdi 5. CONCLUSIONS In this poster, we presented a number of methods to collect diverse results for a given query using past query reformulations. Our evaluation suggests that this is a promising method to improve personalized reranking of search results. We will next evaluate these diversification techniques in an end-to-end web search personalization system. We would like to thank Eric Horvitz for useful discussions, Ed Cutrell for assistance with statistical analysis, Robert Ragno for assistance with collecting data and the volunteers in the user study. matchbigram (di , u) = X ti ,tj di wti ,tj To compute these measures, we estimate the four parameters for both unigrams and bigrams. N and nt were computed from a sample of 1.5 billion web pages. R and rt were computed for each user using a full text index of the files, emails and web pages to represent their interests as in [5]. 6. REFERENCES [1] A. Anagnostopoulos, A. Z. Broder, and D. Carmel. Sampling search engine results. In International World Wide Web Conference, pages 245­256, 2005. [2] J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Poster at SIGIR, 1998. [3] K. Sugiyama, K. Hatano, and M. Yoshikawa. Adaptive web search based on user profile constructed without any effort from users. In International World Wide Web Conference, pages 675­684, 2004. [4] J. Teevan, S. T. Dumais, and E. Horvitz. Beyond the commons: Investigating the value of personalizing web search. In Workshop on New Tech. for Personalized Information Access (PIA), pages 84­92, 2005. [5] J. Teevan, S. T. Dumais, and E. Horvitz. Personalizing search via automated analysis of interests and activities. In SIGIR 2005, pages 449­456, 2005. [6] Y. Zhang, J. Callan, and T. Minka. Novelty and redundancy detection in adaptive filtering. In SIGIR 2002, pages 81­88, 2002. [7] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In International World Wide Web Conference, pages 22­32, 2005. 4. EXPERIMENT AND RESULTS We evaluated these methods for 33 volunteers. Figure 1 shows the bigram match results for the diversification methods and five values of k. The MS method did not generate enough reformulations for some of the user-specific queries so we omit it for simplicity. The unigram match and alternative diversity measures follow the same general trends. The evaluation was performed on two types of queries. The lower three curves show the results for a set of 30 fixed queries chosen from the search engine log. The queries varied in frequency, topic, typical reformulation patterns, etc. The upper two curves show the results for the most recent queries in each user's browser history, averaging 76 queries per user. The main effect of query type (fixed, user) is reliable (F (1, 32) = 4.82, p = 0.022), showing that the match score for queries of interest to the user are higher. The main effect of diversification method is marginally reliable (F (1, 32) = 3.30, p = 0.079), with MRV leading to somewhat higher diversity scores. The main effect of k is reliable (F (4,128) = 3.82, p = 0.006), showing that diversity scores increase as the number of reformulations considered increases. Interestingly for the MF method the first few reformulations reduce the result diversity. This suggests that the most frequent reformulations are not very different in topic from the 692