Community-Based Snippet-Indexes for Pseudo-Anonymous Personalization in Web Search Ois´n Boydell i Adaptive Information Cluster School of Computer Science and Informatics University College Dublin, Dublin 4, Ireland Barry Smyth Adaptive Information Cluster School of Computer Science and Informatics University College Dublin, Dublin 4, Ireland oisin.boydell@ucd.ie ABSTRACT We describ e and evaluate an approach to p ersonalizing Web search that involves p ost-processing the results returned by some underlying search engine so that they reflect the interests of a community of like-minded searchers. To do this we leverage the search exp eriences of the community by mining the title and snipp et texts of results that have b een selected by community memb ers in resp onse to their queries. Our approach seeks to build a community-based snipp et index that reflects the evolving interests of a group of searchers. This index is then used to re-rank the results returned by the underlying search engine by b oosting the ranking of key results that have b een frequently selected for similar queries by community memb ers in the past. Categories and Sub ject Descriptors: H.3.3 Information Search and Retrieval: search process, selection process General Terms: Human Factors, Design Keywords: p ersonalization, Web search, community barry.smyth@ucd.ie fice to say that such communities can b e readily identified, whether they are a formal community of users (e.g., the employees of a company op erating in a sp ecific business sector) or an ad-hoc group of searchers (e.g., the visitors to a Web site sp ecialising in wildlife and endangered sp ecies). CWS maintains a community search profile by recording the queries submitted and the results selected by community memb ers. When faced with a new target query, CWS promotes results that have consistently b een selected for this and similar queries by the community in the past. However, a ma jor limitation of using community profiles based solely on queries and previously-selected results is that results in a community's history which are relevant to a new target query can only b e identified as such if there are overlapping terms b etween the target and previous relevant queries. The work presented here prop oses a more elab orate community model to improve promotion quality by maintaining a community-based snippet index as a way to drive promotions. Thus, instead of simply storing the queries and the URLs of the pages selected, we produce a local index based on the terms that are contained in the title and (querysensitive) snipp ets for selected results. The use of snipp ets for document indexing in IR was suggested as early as 1958 [4], and more recently work by [5] has looked at generating an alternative index using generic document summaries which can then b e queried in parallel to a full content index or used as a source for pseudo-relevance feedback. Our approach is different in that we use querysensitive document snipp ets which summarise a document for a particular community of searchers. The work of [6] on document transformation suggests modifying indexed document content according to previous selection b ehaviour in order to bring documents closer to the queries that led to their selection. Again our approach is different in that we create a new p ersonalized index for a sp ecific community without altering the existing full text index, and this enables our approach to b e applied as a p ersonalized meta search engine on top of existing Web search engines. We also use query-sensitive snipp ets which provide a far richer set of terms than the query terms alone. 1. INTRODUCTION Dealing with the typ e of vague queries that are commonplace in Web search is an imp ortant and challenging problem. It has b een well documented that typical Web queries contain an average of only 2-3 terms [1], for example, and queries like "jordan pictures" offer no clues ab out whether the searcher is likely to b e looking for images of the racing team, the middle eastern state, the basketball star, or the celebrity. Approaches which attempt to p ersonalize the selection and ranking of search results offer a solution [2]. By learning ab out the p ersonal preferences of the searcher and/or the context of their search it may b e p ossible to prioritise certain results that are more likely to b e relevant. The work describ ed in this pap er has b een inspired by previous research on Col laborative Web Search (CWS) [3] which highlighted the high degree of query rep etition and result selection regularity that naturally exists within communitybased search scenarios. For reasons of space it is not p ossible to discuss in detail the origins of such communities of searchers although the interested reader is referred to the work of [3] for a more complete treatment of this issue. SufThis material is based on works supp orted by Science Foundation Ireland under Grant No. 03/IN.3/I361. Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 2. COMMUNITY-BASED SNIPPET INDEXING Consider some user u, a memb er of some community C . A new target query qT from u is initially answered by a traditional meta-search engine to produce a result-list, RM . In parallel, qT is used to query a local document index that 617 has b een constructed from the title and snipp et texts of results that have b een selected by the community in the past. This produces a new list of results, RC , that are more closely aligned with community interests and RM and RC are combined and returned to the user as RT . We use (C, u, qT ) to denote a search for query qT by user u in community C . Consider a result r selected in resp onse to such a search. We can reasonably assume that the snipp et for this result s(r ) must contain terms which are of sp ecial interest to the user in relation to their query. Therefore, s(r, qT ) can b e used to represent the document corresp onding to r for (C, u, qT ). In this sense s(r, qT ) is a surrogate for r in the context of (C, u, qT ) and thus we prop ose that r can b e indexed by using the terms contained within s(r ). Accordingly, our approach to collab orative Web search involves constructing a community-based index by indexing each selected result document by its snipp et terms. In general then, given that a result r might actually b e selected for a numb er of different queries, q1 , ..., qn , it will come to b e indexed under a numb er of different snipp ets, s(r, q1 ), . . . , s(r, qn ). Thus, for a given community of searchers each document will come to b e represented by its surrogate, S C (r ) as shown in Equation 1 S C (r ) = the precision of the RC results is critical and for this reason we use two techniques to filter the RC results to enhance their precision; that is, in addition to standard stop-word removal and stemming during indexing and retrieval. First, we threshold the prop ortion of query terms that must b e present in the document surrogate for that document to b e retrieved as part of RC . This effectively eliminates results that match on only a few of the query terms and can help to eliminate sup erficial results from b eing retrieved. By default we set this threshold to allow for the retrieval of results that match at least 50% of the query terms. Second, we also limit the total numb er of results returned in RC to ensure that community promotions do not over-p ower the traditional meta-search results. Normally we set this limit at 5-10 promotions. 3. CONCLUSION We have describ ed an approach to p ersonalizing Web search at the level of communities of like-minded searchers. The approach works by using the search b ehaviour of community memb ers--their search queries, the results they select and their snipp ets and titles--to p opulate a local index. Each selected result document is represented by a surrogate that is made up of the various snipp et texts that have b een associated with each of its selections. These surrogates reflect a biased view of the document in terms of the community's implicit preferences. When resp onding to a new search query, previous community selections retrieved from the local index are used to complement the results returned by a standard meta-search. The former are promoted based on their overlap with the target search query and their relevance to the community estimated from their selection histories. Preliminary results from a live user trial show that using a community-based snipp et index provides search results with a higher precision than b oth the original CWS system and standard Web search. Finally, it is worth remarking on the privacy b enefits of our approach to p ersonalized Web search. Within any particular community, the search patterns of an individual cannot b e identified and so their identity can remain anonymous. At the same time, p ersonalized recommendations can still b e made to the b enefit of the individual searcher. Of course, whether users in general will p erceive this as a reasonable privacy-p ersonalization trade-off in practice remains to b e seen. s(r, qi ) (1) i Documents that are broadly relevant to a community's interests are likely to b e retrieved for a wide variety of queries and are likely to b e selected for many of these queries. As a result the document surrogate will cover a significant p ortion of the document's contents and the snipp et index will reflect this by associating the document with a broad set of index terms. In contrast, we might consider other documents that are only relevant to a community through some small part of their contents. These are more likely to b e retrieved for a much more restricted set of query terms and their snipp ets will also b e drawn from a limited subset of their content, and so their index terms will also b e very limited. 2.1 Community-Based Promotion In the current implementation we use Lucene 1 to p erform the basic indexing and retrieval on the community-based snipp et index. At retrieval time, Lucene queries the snipp et index for C using qT to produce RC , which is ranked using a function based on each result's TF-IDF score b oosted by relative hits count and query similarity according to Equation 2. rj is a result in RC and q1 , . . . , qn are the queries for which rj was previously selected. Rel(rj , qi ) is the relative hits count, which is the numb er of times rj was previously selected for qi compared to the total hits for qi . Quer y S im(qT , qi )) is a simple term-based query similarity metric based on Jaccard's coefficient. Relevance(rj , qT , q1 , . . . , qn ) = T F I DF (rj , qT ) n 4. REFERENCES [1] S. Lawrence and C. L. Giles. Context and Page Analysis for Improved Web Search. IEEE Internet Computing, July-August:38­46, 1998. [2] J. Pitkow, H. Schutze, T. Cass, R. Co oley, D. Turnbull, A. Edmonds, E. Adar, and T. Breuel. Personalized search. Communications of the ACM, 45(9):50­55, 2002. [3] B. Smyth, E. Balfe, J. Freyne, P. Briggs, M. Coyle, and O. Boydell. Exploiting query rep etition and regularity in an adaptive community-based web search engine. User Modeling and User-Adapted Interaction, 14(5):383­423, 2004. [4] H. P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2:159­165, 1958. [5] T. Sakai and K. Sparck-Jones. Generic summaries for indexing in information retrieval. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 190­198, New York, NY, USA, 2001. ACM Press. [6] C. Kemp and K. Ramamohanarao. Long-term learning for web search engines. In Proceedings of the 6th European Conference on Principles of Data Mining and Know ledge Discovery, 2002. (1 + i (2) (Rel(rj , qi ) · QuerySim(qT , qi ))) =1 In our current implementation, the final result list returned to the searcher, RT , is the union of RC and RM with the RC results returned b efore the RM results. Thus 1 http://lucene.apache.org 618