WWW 2007 / Track: Search Session: Personalization A Large-scale Evaluation and Analysis of Personalized Search Strategies Zhicheng Dou Ruihua Song Microsoft Research Asia Beijing 100080, China Ji-Rong Wen Microsoft Research Asia Beijing 100080, China Nankai University Tianjin 300071, China douzc@yahoo.com.cn ABSTRACT rsong@microsoft.com jrwen@microsoft.com Although p ersonalized search has b een prop osed for many years and many p ersonalization strategies have b een investigated, it is still unclear whether p ersonalization is consistently effective on different queries for different users, and under different search contexts. In this pap er, we study this problem and provide some preliminary conclusions. We present a large-scale evaluation framework for p ersonalized search based on query logs, and then evaluate five p ersonalized search strategies (including two click-based and three profile-based ones) using 12-day MSN query logs. By analyzing the results, we reveal that p ersonalized search has significant improvement over common web search on some queries but it has little effect on other queries (e.g., queries with small click entropy). It even harms search accuracy under some situations. Furthermore, we show that straightforward click-based p ersonalization strategies p erform consistently and considerably well, while profile-based ones are unstable in our exp eriments. We also reveal that b oth longterm and short-term contexts are very imp ortant in improving search p erformance for profile-based p ersonalized search strategies. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval--information filtering, Search process ; H.3.4 [Information Storage and Retrieval]: Systems and Software--Performance evaluation (efficiency and effectiveness) General Terms Algorithms, Exp erimentation, Human Factors, Theory Keywords Click-through, Personalization, Personalized Search, Query Log, Re-ranking 1. INTRODUCTION One criticism of search engines is that when queries are issued, most return the same results to users. In fact, the vast Work was done when the author was visiting Microsoft Research Asia. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. ma jority of queries to search engines are short [27, 12] and ambiguous [16, 7], and different users may have completely different information needs and goals under the same query [12, 26, 23, 34]. For example, a biologist may use query "mouse" to get information ab out rodents, while programmers may use the same query to find information ab out computer p eripherals. When such a query is submitted to a search engine, it takes a moment for a user to choose which information he/she wishes to get. On the query "free mp3 download", the users' selections can also vary though almost all of them are finding some websites to download free mp3: one may select the website "www.yourmp3.net", while another may prefer the website "www.seekasong.com". Personalized search is considered a solution to this problem since different search results based on preferences of users are provided. Various p ersonalization strategies including [21, 22, 26, 31, 14, 9, 35, 30, 19] have b een prop osed, and p ersonalized web search systems have b een develop ed, but they are far from optimal. One problem of current p ersonalized search is that most prop osed methods are uniformly applied to all users and queries. In fact, we think that queries should not b e handled in the same manner b ecause we find: (1) Personalization may lack effectiveness on some queries, and there is no need for p ersonalization on such queries. This has also b een found by [34]. For example on the query "mouse" mentioned ab ove, using p ersonalization based on user interest profile, we could achieve greater relevance for individual users than common web search. Beyond all doubt, the p ersonalization brings significant b enefit to users in this case. Contrarily, for the query "Google", which is a typical navigational query as defined in [3, 17], almost all of the users are consistently selecting results to redirect to Google's homepage, and therefore none of the p ersonalized strategies could provide significant b enefits to users. (2) Different strategies may have variant effects on different queries. For the query "free mp3 download", using the typical user interest profile-based p ersonalization such as the method prop osed in [6], which led to b etter results for the query "mouse", we may achieve p oor results b ecause the results for query "free mp3 download" are mostly classified into one topic category and the profile-based p ersonalization is too coarse to filter out the desired results. In such a case, simply leveraging pages visited by this user in the past may achieve b etter p erformance. Furthermore, simply applying one p ersonalization strategy on some queries without any consideration may harm user exp erience. For example, when a sp orts fan submits the query "office", he/she may 581 WWW 2007 / Track: Search not b e seeking information on sp orts, but may b e seeking help on Microsoft Office Software or any other numb er of office-related inquiries. In this situation, if interest-based p ersonalization is done, many irrelevant results could erroneously b e moved to the front and the user may b ecome confused. (3) Personalization strategies may provide different effectiveness based on different search histories and under variant contexts. For example, it could b e difficult to learn interests of users who have done few searches. Furthermore, as Shen et al. [25] noted, users often search for documents to satisfy short-term information needs, which may b e inconsistent with general user interests. In such cases, long-term user profiles may b e useless and short-term query context may b e more useful. In short, the effectiveness of a sp ecific p ersonalized search strategy may show great improvement over that of nonp ersonalized search on some queries for some users, and under some search contexts, but it can also b e unnecessary and even harmful to search under some situations. Until now, little investigation has b een done on how p ersonalization strategies p erform under different situations. In this pap er, we get some conclusions on this problem and make the following contributions: (1) We develop a large-scale p ersonalized search evaluation framework based on query logs. In this framework, different p ersonalized re-ranking strategies are simulated and the search accuracy is approximately evaluated by real user clicks recorded in query logs automatically. The framework enables us to evaluate p ersonalization on a large scale. (2) We prop ose two click-based p ersonalized search strategies and three profile-based p ersonalized search strategies. We evaluate all five approaches in the evaluation framework using 12-day query logs from MSN search engine1 and provide an in-depth analysis on the results. (3) We reveal that p ersonalization has different effectiveness on different queries, users, and search contexts. Personalization brings significant search accuracy improvements on the queries with large click entropy, and has little effect on the queries with small click entropy. Personalization strategies can even harm the search accuracy on some queries. Therefore, we conclude that not all queries should b e p ersonalized equally. (4) We show that click-based p ersonalization strategies p erform consistently and considerably well though they can only work on the rep eated queries. We find that our profilebased strategies are unstable b ecause of the straightforward implementation. We also find the profile-based methods b ecome more unstable when users search history grows. We reveal that b oth long-term and short-term contexts are very imp ortant in improving search p erformance for profile-based p ersonalization. The remaining sections are organized as follows. In Section 2, we discuss related works. We present a re-ranking framework and introduce how to use this framework to evaluate the p ersonalization strategies in Section 3. In Section 4, we give several p ersonalization strategies in b oth p erson and group levels. In Section 5 we introduce the dataset used in our exp eriments and detailed data statistics. We compare and analyze the results of these strategies in Section 6. We conclude our work in Section 7. 1 Session: Personalization 2. RELATED WORK There are several prior attempts on p ersonalizing web search. One approach is to ask users to sp ecify general interests. The user interests are then used to filter search results by checking content similarity b etween returned web pages and user interests [22, 6]. For example, [6] used ODP2 entries to implement p ersonalized search based on user profiles corresp onding to topic vectors from the ODP hierarchy. Unfortunately, studies have also shown that the vast ma jority of users are reluctant to provide any explicit feedback on search results and their interests [4]. Many later works on p ersonalized web search focused on how to automatically learn user preferences without any user efforts [22, 19, 29, 26]. User profiles are built in the forms of user interest categories or term lists/vectors. In [19], user profiles were represented by a hierarchical category tree based on ODP and corresp onding keywords associated with each category. User profiles were automatically learned from search history. In [29], user preferences were built as vectors of distinct terms and constructed by accumulating past preferences, including b oth long-term and short-term preferences. Tan et al. [31] used the methods of statistical language modeling to mine contextual information from long-term search history. In this pap er, user profiles are represented as weighted topic categories, similar with those given in [28, 6, 22], and these profiles are also automatically learned from users' past clicked web pages. Many p ersonalized web search strategies based on hyp erlink structure of web have also b een investigated. Personalized PageRank, which is a modification of the global PageRank algorithm, was first prop osed for p ersonalized web search in [20]. In [10], multiple Personalized PageRank scores, one for each main topic of ODP, were used to enable "topic sensitive" web search. Jeh and Widom [14] gave an approach that could scale well with the size of hub vectors to realize p ersonalized search based on Topic-Sensitive PageRank. The authors of [32] extended the well-known HITS algorithm by artificially increasing the authority and hub scores of the pages marked relevant by the user in previous searches. Most recently, [17] develop ed a method to automatically estimate user hidden interests based on TopicSensitive PageRank scores of the user's past clicked pages. In most of ab ove p ersonalized search strategies, only the information provided by user himself/herself is used to create user profiles. These are also some strategies which incorp orate the preferences of a group of users to accomplish p ersonalized search. In these approaches, the search histories of users who have similar interest with test user are used to refine the search. Collab orative filtering is a typical group-based p ersonalization method and has b een used in p ersonalized search in [29] and [30]. In [29], users' profiles can b e constructed based on the modified collab orative filtering algorithm [15]. In [30], the authors prop osed a novel method Cub eSVD to apply p ersonalized web search by analyzing the correlation among users, queries, and web pages contained in click-through data. In this pap er, we also introduce a method which incorp orates click histories of a group of users to p ersonalize web search. Some p eople have also found that p ersonalization has variant effectiveness on different queries. For instance, Teevan et al. [34] suggested that not all queries should b e handled 2 MSN Search, http://search.msn.com Op en Directory Pro ject,http://dmoz.org/ 582 WWW 2007 / Track: Search in the same manner. For less ambiguous queries, current web search ranking might b e sufficient and thus p ersonalization is unnecessary. In [6] and [5], test queries were divided into three typ es: clear queries, semi-ambiguous queries, and ambiguous queries. The authors also concluded that p ersonalization significantly increased output quality for ambiguous and semi-ambiguous queries, but for clear queries, one should prefer common web search. In [31], queries were divided into fresh queries and recurring queries. The authors found that recent history tended to b e much more useful than remote history esp ecially for fresh queries while the entire history was helpful for improving the search accuracy of recurring queries. This also gave us a sense that not all queries should b e p ersonalized in the same way. These conclusions inspired our detailed analysis. Session: Personalization remains the same in a machine as long as a cookie is not cleared. For each query, MSN search engine logs the query terms and records all click-through information including clicked web pages and their ranks. A "Browser GUID", which remains the same b efore the browser is re-op en, is also recorded for each query. It is used as the simple identifier of a session, which includes a series of queries made by a single user within a small range of time and is usually meant to capture a single user's attempt to fulfill a single information need [27, 12]. 3.1 Re-ranking Evaluation Framework In this evaluation framework, we first download search results from MSN search engine, then use one p ersonalization strategy to re-rank the results. The click-through data recorded in test set is then used to evaluate the re-ranking p erformance. In more detail, query re-ranking and evaluation are completed in the following steps: (1) Download the top 50 search results from MSN search engine for the test query. We denote the downloaded web pages with U and denote the rank list that contains the rankings of the web pages with 1 . (2) Compute a p ersonalized score for each web page xi U using p ersonalization algorithm and then generate a new rank list 2 with resp ect to U sorted by descending p ersonalized scores. The p ersonalized strategies are introduced in Section 4. (3) Combine the rankings in 1 and 2 using Borda' ranking fusion method [13, 8] and sort the web pages with the combined rankings. The final rank list is denoted with . is the final p ersonalized search result list for the query. Notice that we use the rank-based ranking fusion method b ecause we are unable to get the relevance scores from MSN search engine. (4) Use the measurements introduced in Section 3.2 to evaluate the p ersonalization p erformance on . In this pap er, we assume the results downloaded from MSN search engine are consistent with those returned to the user when the query was submitted. We use the most recent MSN query logs on August 2006 and download search results in the early days on Septemb er 2006 so that the changes of search results can b e ignored (We also tried the approach of rebuilding the search results from query logs but it failed b ecause of the sparseness of queries and user clicks). We downloaded only the top 50 search results b ecause most users never look b eyond the top 50 entries in the test set. 3. EXPERIMENT METHODOLOGY The typical evaluation method used in existing p ersonalized search research is to conduct user studies [23, 26, 6, 34, 28, 29, 19, 5, 31]. Usually, a certain numb er of p eople participate in the evaluated p ersonalized search system over several days. The user profiles are manually sp ecified by participants themselves [6] or automatically learned from search histories. To evaluate the p erformance of p ersonalized search, each participant is required to issue a certain numb er of test queries and determine whether each result is relevant. The advantage of this approach is that the relevance can b e explicitly sp ecified by participants. Unfortunately, there are still some drawbacks in this method. The constraint of the numb er of participants and test queries may bias the accuracy and reliability of the evaluation. We prop ose a framework that enables large-scale evaluation of p ersonalized search. In this framework, we use click-through data recorded in query logs to simulate user exp erience in web search. In general, when a user issues a query, he/she usually checks the documents in the result list from top to b ottom. He/she clicks one or more documents which look more relevant to him/her, and skip the documents which he/she is not interested in. If a sp ecific p ersonalization method can re-rank the "relevant" documents fronter in the result list, the user will b e more satisfied with the search. Therefore, we utilize clicking decisions as relevance judgments to evaluate the search accuracy. Since click-through data can b e collected at low cost, it is p ossible to do large-scale evaluation using this framework. Furthermore, since click-through data reflect real world distributions of query, user, and user selections, they could b e more accurate to evaluate p ersonalized search using click-through data than user surveys. One p otential concern ab out the evaluation framework is that the original user selections may b e influenced by initial result rankings[2], and thus it could b e unfair to evaluate a reordering of the original search results using the original click data. Our framework may fail to evaluate the ranking alternation of documents that are relevant but were not clicked by users, and this may bias the evaluation. However, our framework is still effective to evaluate approximate search accuracy. It is the b est method we could adopt to enable large-scale evaluation of p ersonalized search. We will investigate more stable methodology in future work. In the evaluation framework, we use MSN query logs to simulate and evaluate the p ersonalized re-ranking. In MSN query logs, each user is identified by "Cookie GUID", which 3.2 Evaluation Measurements We use two measurements to evaluate the p ersonalized search accuracy of different strategies: rank scoring metric introduced in [30, 15] and average rank metric introduced in [23]. 3.2.1 Rank Scoring Rank scoring metric prop osed by Breese [15] is used to evaluate the effectiveness of the collab orative filtering systems which return an ordered list of recommended items. Sun et al.[30] used it to evaluate the p ersonalized web search accuracy and we also use it in this pap er. The exp ected utility of a ranked list of web pages is defined as j (s , j ) Rs = 2(j -1)/(-1) 583 WWW 2007 / Track: Search where j is the rank of a page in the list, (s, j ) is 1 if page j is clicked in the test query s and 0 otherwise, and is set to 5 as the authors did. The final rank scoring reflects the utilities of all test queries: s Rs (1) R = 100 s M ax Rs Here, is the obtained maximum p ossible utility when all pages which have b een clicked app ear at the top of the ranked list. Larger rank scoring value indicates b etter p erformance of p ersonalized search. M Rs ax Session: Personalization rep eated by the same user and this approach will only bring b enefits to these queries. This approach is denoted with P-Click. 4.1.2 Person-level Re-ranking Based on User Interests As introduced in Section 2, many current researches use interest profiles to p ersonalize search results [22, 19, 6]. In this pap er, we also prop osed a p ersonalization method based on user interest profile (we denote this method with LProfile). User's profile cl (u) is presented as a weighting vector of 67 pre-defined topic categories provided by KDD Cup-2005 [18]. When a user submits a query, each of the returned web pages is also mapp ed to a weighting category vector. The similarity b etween the user profile vector and page category vector is then used to re-rank search results: S L-P rof ile (q , p, u) = cl (u) · c (p) cl (u) c (p) ( 4) 3.2.2 Average Rank Average rank metric is used to measure the quality of p ersonalized search in [23, 28]. The average rank of a query s is defined as b elow. 1p Av g Ranks = R (p ) |Ps | P s Here Ps denotes the set of clicked web pages on test query s, R(p) denotes the rank of page p. The final average rank on test query set S is computed as: 1s Av g Ranks (2) Av g Rank = |S | S Smaller average rank value indicates b etter placements of relevant result, or b etter result quality. In fact, rank scoring metric and average rank metric has similar effectiveness on evaluating p ersonalization p erformance, and our exp erimental results show that they are consistent. Here c (p) is category vector of web page p. c (p) is generated by a tool using the query and web page classification method introduced in [24]. Given a web page p, the tool returns top 6 categories which p b elongs to with corresp onding confidences. Each comp onent c (p)i of c (p) is the classification confidence returned by the tool, which means the probability that page p should b e classified into category i. If category i is not in the 6 categories returned by the tool, then we set c (p)i = 0. User's profile cl (u) is automatically learned from his/her past clicked web pages as the following equation: p P (p|u)w(p)c(p) cl (u) = P (u ) 4. PERSONALIZATION STRATEGIES As we describ ed in Section 2, p ersonalized search methods can b e categorized into p erson level and group level. In this pap er, we prop ose several re-ranking methods in b oth levels to accomplish p ersonalized search. These strategies are used to re-rank search results by computing a p ersonalized score S (q , p, u) for each page p in the results returned to user u on query q , as Section 3 introduced. In the following sections, we will introduce the strategies. Here P (u) is the collection of web pages visited by user u in the past. P (p|u) can b e thought of as the probability that user u clicks web page p, i.e., P (p|u) = |C licks (·, p, u)| |C licks (·, ·, u)| 4.1 Person-level Re-ranking 4.1.1 Person-level Re-ranking Based on Historical Clicks We supp ose that for a query q submitted by a user u, the web pages frequently clicked by u in the past are more relevant to u than those seldom clicked by u, and thus, the p ersonalized score on page p can b e computed by: S P -C lick (q , p, u) = |C licks (q , p, u)| |C licks (q , ·, u)| + (3) Here, |C licks (·, ·, u)| is the total click times made by u and|C licks (·, p, u)| is the click times on web page p made by u. w (p) is the impact weight for page p when generating user profiles. We assume that the web pages submitted by many users are less imp ortant when building user profiles, thus, w (p) = log |U | |U (p)| Here, |C licks (q , p, u)| is the click numb er on web page p by user u on query q in the past, |C licks (q , ·, u)| is the total click numb er on query q by u, and is a smoothing factor ( = 0.5 in this pap er). Notice that |C licks (q , p, u)| actually decides the ranking of the page, while |C licks (q , ·, u)| and are only used for normalization. A disadvantage to this approach is that the re-ranking will fail when the user has never asked this query. We find that in our dataset, ab out one-third of the test queries are |U | is the numb er of total users; |U (p)| is the numb er of users who have ever visited web page p. In method L-Profile, user's profile cl (u) is accumulated from user's visited web pages in the past. This profile is called long-term profile in previous works [29, 26, 31]. In fact, as investigated by [26], short-term user profile is more useful for improving search in current session. In this pap er, we use the clicks on the previous queries in current session to build user's short-term profile. A user's short-term profile cs (u) is computed as b elow. p 1 c(p) cs (u) = |Ps (q )| P s (q ) Ps (q ) is the collection of visited pages on previous queries in current session. The p ersonalized score of page p using 584 WWW 2007 / Track: Search short-term profile is computed as the following equation: S S -P r of ile Session: Personalization Table 1: Basic statistics of dataset 5) Item #days #users #queries #distinct queries #Clicks #Clicks/#queries #sessions 10000 1000 Query Times 100 10 1 1 10 100 1000 10000 100000 Query ID(ordered by query times in descending order) cs (u) · c (p) (q , p, u) = cs (u) c (p) ( This approach is denoted with S-Profile. We can also fuse the long-term p ersonalized score and the short-term p ersonalized score using a simple linear combination: S LS -P rof ile (q , p, u) = S L-P rof ile (q , p, u) + (1 - )S S -P rof ile (q , p, u) (6) A LL 12 10,000 55,937 34,203 93,566 1.6727 49,839 Training 11 10,000 51,334 31,777 85,642 1.6683 45,981 Test 1 1,792 4,639 3,465 7,924 1.7081 3,865 We denote this approach with LS-Profile. Methods LProfile, S-Profile, and LS-Profile are generally called profilebased methods for short in this pap er. 4.2 Group-level Re-ranking We use the K-Nearest Neighb or Collab orative Filtering algorithm to test group-based p ersonalization. Due to the data sparsity in our dataset, using traditional CF methods on web search is inadequate. Instead, we compute the user similarity based on long-term user profiles: T S im (u1 , u2 ) = cl (u1 ) · cl (u2 ) cl (u1 ) cl (u2 ) (a) Distribution of query frequency (by log scale). 1000 he K-Nearest neighb ors are obtained based on the user similarity computed as follows. Su (ua ) = {us |r ank (S im (ua , us )) K } Then we use the historical clicks made by similar users to re-rank the search results: u S im(us , u) |C licks (q , p, us )| s S u (u ) G-C lick u (q , p, u) = S + |C licks (q , ·, us )| s S u (u ) Number of users 100 10 (7) We denote this approach with G-Click. 1 1 10 100 1000 10000 100000 Query ID(ordered by number of users in descending order) 5. DATASET In this section, we introduce the dataset used in our exp eriments. (b) Distribution of user numb er of queries (by log scale). Figure 1: Query popularity distributions. 5.1 Statistics about Dataset We collect a set of MSN query logs for 12 days in August 2006 for our exp eriments. Because the entire log set is too large, we randomly sample 10,000 distinct users (identified by "Cookie GUID") from the users in the United States on August 19, 2006. These users and their click-through logs are extracted as our dataset. In addition, the queries without any clicks (ab out 34.6% of all queries) are excluded from the dataset b ecause they are useless in our exp eriments. The entire dataset is split into two parts: a training set and a test set. The training set contains the log data of the first 11 days and the log data of the last day is used for testing. Table 1 summarizes the statistics. Notice that all 10,000 users have search activities in the training set b ecause users are sampled from the logs of training days. Because users are randomly sampled, this dataset could reflect the characteristics of the entire logs. It also has similar characteristics of those in existing rep orts [27, 37, 11, 36, 1]. We show detailed statistics of the dataset in the following sections. 5.2 Statistics about Queries In our dataset, more than 80% distinct queries are only issued once in a 12-day p eriod, and ab out 90% distinct queries string are issued only by one user. The 3% most p opular distinct queries are issued by more than 47% users. The statistics is similar with that given in [27, 37, 11], and this indicates that information needs on the Web are quite diverse. Furthermore, we find that query frequency can also b e characterized by Zipf distributions, consistent with that found by [37]. Figure 1(a) plots the distributions of query frequency. In this figure, queries are sorted by query times in descending order: the first query is the most p opular one, and the last is the most unp opular one. Figure 1(b) plots the distribution of numb er of users on each query. 5.3 Statistics about Test Users As Table 1 shows, 1,792 users have search activities on the test day. Figure 2 plots the distribution of historical (i.e. in 585 WWW 2007 / Track: Search 0.18 0.16 Percentage of users 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 2 4 6 8 10 12 User historical query days Percentage of users 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 20 40 60 80 Session: Personalization 100 120 140 160 User historical query times (a) Distribution of user historical query days (b) Distribution of user historical query times Figure 2: Distributions of user search frequency in training days for test users 0.8 Percentage of sessions 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 Queries per session In this pap er, we define click entropy of query as Equation 8. p -P (p|q ) log2 P (p|q ) (8) C lickE ntr oy (q ) = P (q ) Here C lickE ntr oy (q ) is the click entropy of query q . P (q ) is the collection of web pages clicked on query q . P (p|q ) is the p ercentage of the clicks on web page p among all the clicks on q , i.e., P (p|q ) = |C licks(q , p, ·)| |C licks(q , ·, ·)| Figure 3: Distribution of query number per session. training days) query day and query times for users in the test set. Because users are sampled on one of the training days, each user has at least a day-long query history. We find ab out 30% users in the test set have more than 5 days' query history and ab out 50 % of them submit more than 10 queries in training days. 5.4 Statistics about Query Repetition In our dataset, 2,130 (ab out 46%) of all 4,639 queries in the test set are rep eated ones that have b een submitted in the training days, either by the same user or by different users. Furthermore, 1,535 queries (72% of rep eated ones and 33% of test queries) are rep eated by the same user. These results are consistent with those given in [33] and are helpful for p ersonalized search. 5.5 Statistics about Sessions In our dataset, we use "Browser GUID" as a simple identifier of session. A user op ens a browser and asks one or more queries and then closes the browser: the whole process is considered as a session in this pap er. Figure 3 shows the distribution of numb er of queries in a session. Ab out 30% sessions contain at least two queries. This indicates that users sometimes submit several queries to fulfill an information need. Click entropy is a direct indication of query click variation. If all users click only one same page on query q , then we have C lickE ntr oy (q ) = 0. Smaller click entropy means that the ma jorities of users agree with each other on a small numb er of web pages. In such cases, there is no need to do p ersonalization. Large click entropy indicates that many web pages were clicked for the query. This may mean: (1) a user has to select several pages to satisfy this query, which means the query is an informational query [3, 17]. Personalization can help to filter the pages that are more relevant to users by making use of historical selections. (2) Different users have different selections on this query, which means the query is an ambiguous query. In such cases, p ersonalization can b e used to provide different web pages to each individual. Figure 4(a) shows the click entropy distribution. More than 65% queries have low click entropy b etween 0 and 0.5. We find many of these queries are submitted only by one user and the user only clicks one web page. Figure 4(b) shows the click entropy distribution for queries asked more than five times. Figure 4(c) plots the click entropy distribution for queries submitted by at least three users. From these figures, we also find that the ma jorities of the more p opular queries have low click entropies. 6. RESULTS AND DISCUSSIONS In this section, we will give detailed evaluation and analysis of the five p ersonalized search strategies. Notice that the original web search method without any p ersonalization, which is used for comparing with the p ersonalized methods, is denoted with "WEB". We let K = 50 for method G-Click and = 0.3 for method LS-Profile, and b oth of the two settings are empirical. In our exp eriments, we find 676 queries in total 4,639 test 5.6 Distribution of Query Click Entropies As found by [34], for queries which showed less variation among individuals, the p ersonalization may b e insufficient. 586 WWW 2007 / Track: Search 1 Percentage of queries Percentage of queries 0.8 0.6 0.4 0.2 0 0-0.5 1-1.5 2-2.5 3-3.5 4-4.5 >5 Click entropy of queries 1 0.8 0.6 0.4 0.2 0 0-0.5 1-1.5 2-2.5 3-3.5 4-4.5 >5 Click entropy of queries Percentage of queries 1 0.8 0.6 0.4 0.2 0 0-0.5 1-1.5 Session: Personalization 2-2.5 3-3.5 4-4.5 >5 Click entropy of queries (a) All queries (b) Queries with query times>5 (c) Queries with user numb er>2 Figure 4: Distribution of query click entropy. queries lost the clicked web pages in downloaded search results. This is b ecause MSN search engine has changed for these queries. We excluded these queries when rep orting the exp erimental results in the following sections. Furthermore, we find for 57% (2,256/3,963) of the left queries, users select only the top results. In other words, original search method WEB has done the b est on these queries and p ersonalization does not provide improvements. We call the other 1,707 queries, on which users select not only the top results, not-optimal queries. Table 2: Overall performance of personalization strategies. R.S. denotes the rank scoring metric and A.R. denotes the average rank metric. method WEB P-Click L-Profile S-Profile LS-Profile G-Click all R.S. 69.4669 70.4350 66.7378 66.7822 68.5958 70.4168 A.R. 3.9240 3.7338 4.5466 4.4244 4.1322 3.7361 not-optimal R.S. A.R. 47.2623 7.7879 49.0051 7.3380 45.8485 8.3861 45.1679 8.3222 46.6518 8.0445 48.9728 7.3433 6.1 Overall Performance of Strategies Table 2 shows the overall effectiveness of the p ersonalization strategies on the test queries. We find: (1) Click-based p ersonalization methods G-Click and PClick outp erform method WEB on the whole. For instance, on the not-optimal queries, method P-Click has a significant (p < 0.01) 3.68% improvement over method WEB and method G-Click have a significant (p < 0.01) 3.62% improvement over WEB (using rank scoring metric). P-Click and G-Click methods also have significant improvements (1.39% and 1.37%) over WEB on all test queries including b oth not-optimal and optimal queries. These results show that click-based p ersonalization methods can generally improve web search p erformance. (2) Methods P-Click and G-Click have no significant different p erformances on the whole. In our exp eriments, we sample 10,000 users and select the 50 most similar users for each test user in G-Click approach (we also try the methods to select 20 and 100 users, but they show no significant difference). By reason of high user query sparsity, selected similar users may have few search histories on the queries submitted by test user. This makes group-level p ersonalization p erform no significant improvement over p erson-level p ersonalization. If more days' logs are given and more users are selected, method G-Click may p erform b etter. (3) Profile-based methods L-Profile, S-Profile, and LSProfile p erform less well on average. We compute rank scorings of all the methods for each single test query and then plot the distributions of rank scoring increment over WEB method for each p ersonalization strategy in Figure 5. We find that though L-Profile, S-Profile, and LS-Profile methods improve the search accuracy on many queries, they also harm the p erformance on more queries, which makes them p erform worse on average. This indicates that the straightforward implementation of profile-based strategies we employ in this pap er do not work well, at least not as stable as the click-based ones. We will give some analysis on why our profile-based methods are unstable in Section 6.5. 6.2 Performance on Different Click Entropies We give the average search accuracy improvements of different p ersonalization strategies on the test queries with different click entropy in Figure 6. We use only the queries asked by at least three users to make the click entropy more reliable. We see that the improvement of the p ersonalized search p erformance increases when the click entropy of query b ecomes larger, esp ecially when click entropy 1.5. For the click-based methods P-Click and G-Click, the improvement of p ersonalization is very limited on the queries with click entropies b etween 0 and 0.5. The G-Click method, which gets the b est p erformance for these queries, has only a nonsignificant 0.37% improvement over WEB methods in rank scoring metric. This means users have small variance on these queries, and the search engine has done well for these queries, while on the queries with click entropy2.5, the result is disparate: b oth P-Click and G-Click methods make exciting p erformance. In the rank scoring metric, method G-Click has a significant (p < 0.01) 23.37% improvement over method WEB and P-Click method have a significant (p < 0.01) 23.68% improvement over method WEB. Profilebased methods L-Profile, S-Profile and LS-Profile worsen search p erformance when click entropy < 1.5, while L-Profile and LS-Profile also achieve b etter p erformances on queries with click entropy 1.5 (we wonder why method L-Profile also worsens search accuracy when click entropy2.5 and will provide additional analysis on this in future work). All these results indicate that on the queries with small click entropy (which means that these queries are less ambiguous or more navigational), the p ersonalization is insufficient and thus p ersonalization is unnecessary. 587 WWW 2007 / Track: Search Session: Personalization Number of test queries Number of test queries Number of test queries Number of test queries 1000 100 10 1 -100 -50 0 50 P-Click - WEB 100 1000 100 10 1 -100 -50 0 50 L-Profile - WEB 100 1000 100 10 1 -100 -50 0 50 S-Profile - WEB 100 1000 100 10 1 -100 -50 0 50 LS-Profile - WEB 100 Number of test queries 1000 100 10 1 -100 -50 0 50 G-Click - WEB 100 (a) P-Click (b) L-Profile (c) S-Profile (d) LS-Profile (e) G-Click Figure 5: Distributions of rank scoring increment over WEB method. The count of the test queries with the same rank scoring increment range is plot in y-axis with log scale. 25% Rank Scoring Improvement 20% 15% 10% 5% 0% -5% -10% 0.0-0.5 0.5-1.0 1.0-1.5 1.5-2.0 2.0-2.5 >=2.5 60% Average Rank Improvement 40% 20% 0% -20% -40% -60% 0.0-0.5 WEB P-Click L-Profile 0.5-1.0 1.0-1.5 1.5-2.0 Entropy S-Profile LS-Profile G-Click 2.0-2.5 >=2.5 WEB P-Click L-Profile S-Profile LS-Profile G-Click Entropy (a) Ranking scoring (b) Average rank Figure 6: Search accuracy improvements over WEB method on the queries with variant click entropies. Only the queries asked by at least three users are included. Notice that in figure (b), smaller average rank means higher search accuracy. 6.3 Performance on Repeated queries In Subsection 5.4, we find ab out 46% test queries are rep eated by the same user or different users and 33% queries are rep eated by the same user. It means that users often review the queries and results they ever referred. Teevan et al. [33] have also observed that re-finding b ehavior is common, and have shown that rep eat clicks can often b e predicted based on a user's previous queries and clicks. In this pap er, methods P-Click and G-Click are based on historical clicks. The high rep etition ratio in real world makes these click-based p ersonalization strategies work well. Table 3(a) shows the p ersonalization p erformance on the rep eated non-optimal queries rep eated by either the same user or different users and Table 3(b) gives the results on the queries rep eated by the same user. We find the p ersonalization methods P-Click and G-Click have significant improvements over WEB method on queries rep eated by same user. This means that when a user re-submit a query, his/her selections are also highly consistent with the past and the p ersonalization based on his/her past clicks p erforms well. These results tell us that we should record user query and click histories and use them to improve future search if no privacy problems exist. We also should provide convenient ways for users to review their search histories, just like those provided by some current search engines. 6.4 Performance on Variant Search Histories Do users who frequently use search engine b enefit more from p ersonalized search? Do profile-based p ersonalized search strategies p erform b etter when the search history grows? To answer these questions, we plot the improvements of rank scorings on queries given by users with different search frequencies in Figure 7. We find: (1) Using click-based methods P-Click and G-Click, users who have greater search activities in training days do not consistently b enefit more than users who do less searching. This is b ecause users who frequently use the search engine may have more varied information needs. They may rep eat old queries, but they may also submit lots of fresh queries, which makes our click-based methods P-Click and G-Click p erform similar averages for users with different search frequencies (notice that the two series of methods P-Click and G-Click are very close to each other). (2) Method L-Profile when using a user's long-term interest profile can p erform b etter when a user has more queries, esp ecially when the numb er of queries grows from 0 to 70. This is b ecause we can catch users' long-term interests more accurately when their search histories are long enough. At the same time, we find that the p erformance of L-Profile b ecomes more unstable when the user has more and more queries, esp ecially when they have more than 80 queries. This is b ecause there is more noise in queries and furthermore the users have varied information needs. This tell us that when the user's search histories increase, we should take more analysis on user's real information need and select only appropriate search histories to build up user profiles. Tan et al. [31] find that the b est p erformance of profile-based p ersonalized search methods they prop osed is achieved when 588 WWW 2007 / Track: Search Session: Personalization Table 3: Performance on repeated queries. In Table(a), Y means that the query is repeated by either the same user or different users and N means not. In Table(b), Y means that the query is repeated by the same user and N means that the query is first submitted by a user. All the queries are not-optimal queries. (a) Performance on rep eated queries method WEB P-Click L-Profile S-Profile LS-Profile G-Click Y R.S. 46.6285 55.9090 47.7405 46.7600 46.8138 55.7377 A.R. 8.0620 6.1663 8.2953 8.0695 8.1340 6.1886 N R.S. 47.4013 47.4013 45.4091 44.7980 46.6142 47.4013 A.R. 7.7002 7.7002 8.4141 8.4003 8.0169 7.7002 (b) Performance on user-rep eated queries method WEB P-Click L-Profile S-Profile LS-Profile G-Click Y R.S. 45.7215 59.4750 48.0128 45.5959 45.8936 59.1086 A.R. 8.0522 5.2090 8.2575 8.1306 8.1679 5.2500 N R.S. 47.4858 47.4858 45.5346 45.1058 46.7618 47.5025 A.R. 7.7387 7.7387 8.4100 8.3579 8.0215 7.7332 Rank Scoring Improvement 10% 5% 0% -5% -10% -15% 0-10 WEB P-Click L-Profile S-Profile LS-Profile G-Click LS-Profile is more stable than methods L-Profile and SProfile, as shown in Table 2, Figure 5, Figure 6 and Figure 7. That means the incorp oration of long-term interest and short-term context can gain b etter p erformance than solely using either of them. In other words, b oth long-term and short-term search contexts are very imp ortant to p ersonalize search results. The combination of the two typ e of search context can make the prediction of real user information need more reliable. 30-40 60-70 90-100 Historical queries of user 120-130 7. CONCLUSIONS In this pap er, we try to investigate whether p ersonalization is consistently effective under different situations. We develop a evaluation framework based on query logs to enable large-scale evaluation of p ersonalized search. We use 12 days of MSN query logs to evaluate five p ersonalized search strategies. We find all prop osed methods have significant improvements over common web search on queries with large click entropy. On the queries with small click entropy, they have similar or even worse p erformance than common web search. These results tell us that p ersonalized search has different effectiveness on different queries and thus not all queries should b e handled in the same manner. Click entropy can b e used as a simple measurement on whether the query should b e p ersonalized and we strongly encourage the investigation of more reliable ones. Exp erimental results also show that click-based p ersonalization strategies work well. They are straightforward and stable though they can work only on rep eated queries. We suggest that search engine keeps the search histories and provides convenient and secure review ways to users. The profile-based p ersonalized search strategies prop osed in this pap er are not as stable as the click-based ones. They could improve the search accuracy on some queries, but they also harm many queries. Since these strategies are far from optimal, we will continue our work to improve them in future. We also find for profile-based methods, b oth long-term and short-term contexts are imp ortant in improving search p erformance. The appropriate combination of them can b e more reliable than solely using either of them. Figure 7: Rank scoring increments over WEB method on all test queries submitted by users with different query frequencies. using click-through data of past searches that are related to the current query. We think this is b ecause of the same reasons. (3) Methods S-Profile and LS-Profile are less sensitive to historical query numb er. Method LS-Profile is more stable than method L-Profile. 6.5 Analysis on Profile-based Strategies From Table 2, we surprisingly find that the profile-based p ersonalization strategies p erform less optimally, which is inconsistent with existing investigations [28]. We think this is due to the rough implementation of our strategies. The exp erimental results indicate that the straightforward implementation we employ does not work well. From Subsection 6.4 we see it is difficult to build an appropriate user profile even when the user history is rich. The search history inevitably involves a lot of noisy information that is irrelevant to the current search and such noise can harm the p erformance of p ersonalization, as indicated by [31]. In our exp eriments we simply use all the historical user searches to learn user profiles without distinguishing b etween relevant and irrelevant parts, which may make the p ersonalization unstable. We also do none normalization and smoothing when generating user profiles. Since these strategies are far from optimal, we will do more investigation and try to improve their p erformance in future work. Although profile-based approaches p erform badly in our exp eriments, we can still find an interesting thing. Method 8. ACKNOWLEDGMENTS We are grateful to Dwight Daniels for edits and comments on writing the pap er. Comments from the four anonymous referees are invaluable for us to prepare the final version. 589 WWW 2007 / Track: Search Session: Personalization [20] L. Page, S. Brin, R. Motwani, , and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical rep ort, Computer Science Department, Stanford University, 1998. [21] J. Pitkow, H. Schutze, T. Cass, R. Cooley, D. Turnbull, A. Edmonds, E. Adar, and T. Breuel. Personalized search. Commun. ACM, 45(9):50­55, 2002. [22] A. Pretschner and S. Gauch. Ontology based p ersonalized search. In Proceedings of ICTAI '99, pages 391­398, 1999. [23] F. Qiu and J. Cho. Automatic identification of user interest for p ersonalized search. In Proceedings of WWW '06, pages 727­736, 2006. [24] D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang. Q2c@ust: our winning solution to query classification in kddcup 2005. SIGKDD Explor. Newsl., 7(2):100­110, 2005. [25] X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval using implicit feedback. In Proceedings of SIGIR '05, pages 43­50, 2005. [26] X. Shen, B. Tan, and C. Zhai. Implicit user modeling for p ersonalized search. In Proceedings of CIKM '05, pages 824­831, 2005. [27] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6­12, 1999. [28] M. Sp eretta and S. Gauch. Personalized search based on user search histories. In Proceedings of WI '05, pages 622­628, 2005. [29] K. Sugiyama, K. Hatano, and M. Yoshikawa. Adaptive web search based on user profile constructed without any effort from users. In Proceedings of WWW '04, pages 675­684, 2004. [30] J.-T. Sun, H.-J. Zeng, H. Liu, Y. Lu, and Z. Chen. Cub esvd: a novel approach to p ersonalized web search. In Proceedings of WWW '05, pages 382­390, 2005. [31] B. Tan, X. Shen, and C. Zhai. Mining long-term search history to improve search accuracy. In Proceedings of KDD '06, pages 718­723, 2006. [32] F. Tanudja ja and L. Mui. Persona: A contextualized and p ersonalized web search. In Proceedings of HICSS '02, pages volume3, pp.53, 2002. [33] J. Teevan, E. Adar, R. Jones, and M. Potts. History rep eats itself: rep eat queries in yahoo's logs. In Proceedings of SIGIR '06, pages 703­704, 2006. [34] J. Teevan, S. T. Dumais, and E. Horvitz. Beyond the commons: Investigating the value of p ersonalizing web search. In Proceedings of PIA '05, 2005. [35] J. Teevan, S. T. Dumais, and E. Horvitz. Personalizing search via automated analysis of interests and activities. In Proceedings of SIGIR '05, pages 449­456, 2005. [36] S. Wedig and O. Madani. A large-scale analysis of query logs for assessing p ersonalization opp ortunities. In Proceedings of KDD '06, pages 742­747, 2006. [37] Y. Xie and D. R. O'Hallaron. Locality in search engine queries and its implications for caching. In INFOCOM '02, 2002. 9. REFERENCES [1] S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly analysis of a very large topically categorized web query log. In Proceedings of SIGIR '04, pages 321­328, 2004. [2] J. Boyan, D. Freitag, and T. Joachims. Evaluating retrieval p erformance using clickthrough data. In Proceedings of AAAI Workshop on Internet Based Information Systems, 1996. [3] A. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3­10, 2002. [4] J. M. Carroll and M. B. Rosson. Paradox of the active user. Interfacing thought: cognitive aspects of human-computer interaction, pages 80­111, 1987. [5] P. A. Chirita, C. Firan, and W. Nejdl. Summarizing local context to p ersonalize global web search. In Proceedings of CIKM '06, 2006. [6] P. A. Chirita, W. Nejdl, R. Paiu, and C. Kohlschutter. ¨ Using odp metadata to p ersonalize search. In Proceedings of SIGIR '05, pages 178­185, 2005. [7] S. Cronen-Townsend and W. B. Croft. Quantifying query ambiguity. In Proceedings of HLT '02, pages 94­98, 2002. [8] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of WWW '01, pages 613­622, 2001. [9] P. Ferragina and A. Gulli. A p ersonalized search engine based on web-snipp et hierarchical clustering. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 801­810, 2005. [10] T. H. Haveliwala. Topic-sensitive pagerank. In Proceedings of WWW '02, 2002. [11] B. J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information retrieval: a study of user queries on the web. SIGIR Forum, 32(1):5­17, 1998. [12] B. J. Jansen, A. Spink, and T. Saracevic. Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing and Management, 36(2):207­227, 2000. [13] J.C.Borda. M´moire sur les ´lections au scrution. e e Histoire de l'Acad´mie Royal des Sciences, 1781. e [14] G. Jeh and J. Widom. Scaling p ersonalized web search. In Proceedings of WWW '03, pages 271­279, 2003. [15] D. H. John S. Breese and C. Kadie. Empirical analysis of predictive algorithms for collab orative filtering. In Proceedings of UAI '98, pages 43­52, 1998. [16] R. Krovetz and W. B. Croft. Lexical ambiguity and information retrieval. Information Systems, 10(2):115­141, 1992. [17] U. Lee, Z. Liu, and J. Cho. Automatic identification of user goals in web search. In Proceedings of WWW '05, pages 391­400, 2005. [18] Y. Li, Z. Zheng, and H. K. Dai. Kdd cup-2005 rep ort: facing a great challenge. SIGKDD Explor. Newsl., 7(2):91­99, 2005. [19] F. Liu, C. Yu, and W. Meng. Personalized web search by mapping user queries to categories. In Proceedings of CIKM '02, pages 558­565, 2002. 590