User Modeling for Full-Text Federated Search in Peer-to-Peer Networks Jie Lu Jamie Callan Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA {jielu, callan}@cs.cmu.edu ABSTRACT User modeling for information retrieval has mostly been studied to improve the effectiveness of information access in centralized repositories. In this paper we explore user modeling in the context of full-text federated search in peer-to-peer networks. Our approach models a user's persistent, long-term interests based on past queries, and uses the model to improve search efficiency for future queries that represent interests similar to past queries. Our approach also enables queries representing a user's transient, ad-hoc interests to be automatically recognized so that search for these queries can rely on a relatively large search radius to avoid sacrificing effectiveness for efficiency. Experimental results demonstrate that our approach can significantly improve the efficiency of full-text federated search without degrading its accuracy. Furthermore, the proposed approach does not require a large amount of training data, and is robust to a range of parameter values. Peer-to-peer networks contain three types of functional units (with each peer being a single functional unit or a combination of multiple functional units), namely information providers (search engines), information consumers (users), and services that provide functionality to facilitate efficient and effective search. One type of service that is essential to federated search in P2P networks is the directory service, which is responsible for routing queries and responses between consumers and providers. Hierarchical P2P architecture is a popular and effective peer-to-peer architecture used in large-scale operational and research systems. A hierarchical P2P network typically uses a two-level hierarchy of an upper-level of "hubs" for directory services and a lower-level of "leaves" for information providers and consumers. Each hub provides directory service to a region of the network. Multiple hubs work collectively to cover the whole network. Leaves can only connect to hubs. Hubs connect with leaves and other hubs. We focus on full-text federated search in peer-to-peer networks, which conducts search over the full text of documents and returns results in relevance-based document rankings ("full-text ranked retrieval"). Full-text ranked retrieval aims to find content that satisfies the user's information need ("informational search"), which is in contrast to the simple Boolean retrieval commonly used for known-item search in file-sharing P2P networks. The process of full-text federated search in a hierarchical P2P network works as follows [9]. When a consumer has an information request, it issues a query to initially selected hub(s). A Time-ToLive (TTL) field in each query message determines the maximum number of times it may be relayed ("search radius"). A hub that receives the query uses its resource selection algorithm to rank and select one or more neighboring providers as well as hubs based on their full-text resource representations (typically the aggregations of the bag-of-words representations of individual documents) and routes the query to them ("full-text provider/hub selection"). A provider that receives the query uses its full-text document retrieval algorithm to generate a relevance-based ranking of its documents and responds with a list of the topranked documents. A hub is responsible for collecting the ranked lists returned by multiple providers, using its result merging algorithm to merge them into a single, integrated ranked list, and returning it to the consumer. Finally, a consumer merges results returned by multiple hubs. The performance of federated search largely depends on how efficiently and effectively the information providers with relevant contents can be located ("resource location"). Because most current P2P networks provide very limited information to consumers about the available contents and their placement in the network, the resource location conducted by each consumer to initiate the search is typically no more than a random selection from a list of known directory services (hubs). Since there is no guarantee that the (arbitrarily) selected hub(s) can directly locate Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Search process. General Terms Algorithms, Performance, Experimentation, Design Keywords User modeling, Query clustering, Peer-to-peer, Federated search 1. INTRODUCTION User modeling for information retrieval has been a very active research topic in recent years. Most studies focus on using user models for query expansion/reformulation, result re-ranking, or document/link recommendation to improve the effectiveness of information access in centralized repositories. There have been very few studies of user modeling for federated search of multiple distributed collections in the absence of a central authority. Because peer-to-peer (P2P) networks have emerged as an attractive solution to information retrieval in distributed environments when search using centralized repositories or indexes is either impossible or impracticable, we are interested in investigating how user modeling can improve the performance of federated search in peer-to-peer networks. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008...$5.00. 332 relevant resources, a relatively large search radius is usually required to reach the hubs that cover relevant contents. In this paper, we study how user modeling can be used to improve the quality of initial hub selection in order to start the search in the right neighborhood so that less effort (i.e., a smaller search radius) is required to find information providers with relevant contents and therefore higher search efficiency can be achieved. We propose an automatic mechanism for each information consumer to adaptively model the persistent, long-term interests of the user it represents based on past queries, and use the model to guide initial hub selection for future queries that represent interests similar to past queries. Our approach also automatically recognizes queries representing the user's transient, ad-hoc interests, for which the model based on search history cannot support effective initial hub selection, so it can fall back on a default search strategy of a large search radius to avoid sacrificing accuracy for efficiency. Experimental results demonstrate that our approach can significantly improve the efficiency of full-text federated search without degrading accuracy. In the following section we describe related work on user modeling for information filtering, Web retrieval and federated search in P2P networks. Our approach to modeling user interests for full-text federated search is presented in Section 3, followed by the evaluation resources in Section 4. We present experimental results to evaluate the effectiveness of our approach in Section 5 and explore the impact of different parameter settings in Section 6. Section 7 concludes. (over multiple search sessions) [19], or a combination of both [3, 18]. A user's long-term interests can also be modeled by classifying documents visited by the user into different ontologybased categories using complete document contents or snippets surrounding query terms [4, 17]. However, a predefined ontology and additional training data are required to learn the classifier. Collaborative Web search assists a user in searching for information by utilizing others' expert knowledge or search experience. Typically query clustering is used to mine search engine query logs in modeling users' interests. To measure the similarity between queries for clustering, the overlap between query terms [21], the number of retrieved documents clicked in common [5, 21], the distance between the clicked documents based on their URLs or categories [1, 21], or a combination of the above [21] have been used. The contents of the clicked documents are not usually used, for efficiency reasons. For federated search in peer-to-peer networks, there have also been a few attempts to utilize user interests to move a peer closer to those that more frequently provided relevant contents to its past queries in order to improve resource location for future queries [11, 14, 16]. The success of the approach relies on the existence of two properties in the network: i) similar contents are located near to each other ("content-based locality"), and ii) a consumer's queries are closely related to the persistent interests of the user it represents. Since contents relevant to a query tend to be similar, the first property guarantees that relevant contents are mostly near to each other and therefore resource location can be efficient. The network can provide this property by regulating its content placement using distributed hash tables [12] or dynamic topology evolution [2, 7, 10]. The search behaviors of information consumers generally exhibit the second property. However, since existing methods do not explicitly distinguish between queries that express the different interests of the user (e.g., sports versus music), they cannot tell which resources are more relevant to which interests, resulting in less efficient resource location when a resource relevant to one interest is selected to answer queries for other interests. Furthermore, because no method has been provided to explicitly separate transient information needs from those that are related conceptually to the user's persistent, longterm interests, the search performance of transient information needs (either efficiency or accuracy) is likely to be poor. 2. RELATED WORK User modeling represents a user's information needs (interests) with a user model (also called "user profile"). A user model can be hand-crafted, or learned automatically from queries and their retrieved documents based on explicit (e.g., relevance judgment), implicit (e.g., mouse click, time on page), or pseudo (e.g., the top ranks) relevance feedback. Commonly used representations for a user model include a set of attributes with values, a structured representation such as a weighted concept hierarchy based on a predefined ontology, or a bag-of-words representation such as a vector of weighted terms or a unigram language model. A user's interests can be represented as a whole using one representation, or as a set of topics (determined by clustering or classification) with one representation for each topic. Which technique and representation to use in generating user models depends on the objectives and requirements of the particular applications. In information filtering, a user model is specified by the user explicitly and modified over time by the system based on the long-term observations of the document stream and periodic relevance feedback from the user ("adaptive information filtering") [13]. Typically adaptive information filtering systems exploit machine learning techniques to handle positive and negative relevance feedback provided by the user. Although explicit relevance feedback from the user is usually available for information filtering, Web retrieval and federated search in distributed environments are unlikely to have such luxury. Personalized Web search uses user models to adapt search to the needs of individuals through query expansion, result re-ranking, or link recommendation. A user model with a bag-of-words representation can be generated from queries and the text of the top-ranked or visited documents to represent the user's short-term interests (within a single search session) [15], long-term interests 3. APPROACH DESCRIPTION To eliminate the randomness of initial hub selection at each information consumer when initiating search in peer-to-peer networks, we propose to model the user's persistent, long-term interests based on past queries, and use the model to conduct initial hub selection for new queries according to the hubs' resource location effectiveness for old queries with similar interests ("interest-based hub selection"). Compared with existing methods of learning from search history [11, 14, 16], our approach distinguishes between different interests and measures the hubs' performance for each interest instead of modeling all interests as a single group. Our objective is that with effective initial hub selection to initiate search, a small search radius (and therefore little query routing) is sufficient to locate most relevant contents, improving search efficiency without sacrificing accuracy. The effectiveness of interest-based hub selection depends on whether the hubs capable of locating relevant contents efficiently and effectively for past queries perform well for future queries that express similar interests. A hierarchical P2P network 333 in which the information providers having similar contents connect to the same hubs can best support effective interest-based hub selection since contents relevant to similar interests tend to be similar to each other. The network can provide this property by using distributed hash tables [12] or dynamic topology evolution [2, 7, 10] to regulate its content placement. In addition to queries representing persistent information needs, a user may also issue queries that are not conceptually related to his/her long-term interests, to satisfy transient, ad-hoc information needs. Because interest-based hub selection at a consumer depends on the limited (and often biased) information the consumer has learned about the hubs as a byproduct of past search, it is unlikely to provide any clue about which hubs can best locate relevant contents for transient information needs not related to search history. Therefore, for queries expressing transient information needs, the consumer must resort to a more extensive search using a larger search radius (TTL) to route queries to the hubs that directly cover relevant contents, trading efficiency for accuracy. In other words, different search strategies are required for different types of queries to optimize the overall search performance. Our goal is to develop an approach that enables each consumer to distinguish between queries representing different persistent interests for effective interest-based initial hub selection at the consumer, and to recognize transient information needs so that full-text hub selection at the hubs can be fully utilized to guarantee accuracy. User modeling for full-text federated search in a peer-to-peer network takes place at each individual information consumer due to the lack of a centralized server to monitor search activities in the network. Similar to the approach taken in [20], query clustering is used to group past queries in identifying a user's different interests. Each query cluster represents a topic of interest. The interest-dependent performance is measured for each hub that provided search results to this consumer, which is dynamically updated whenever new results are available. For a hub that covers contents related to multiple topics of interest, its performance for each topic is measured independently of the other topics. The optimal hubs for a new query are determined based on their performance for clusters of similar past queries. In the following subsection, we present the design for the two main tasks of our approach, namely clustering queries and learning about the hubs' performance for each cluster. Section 3.2 describes in detail the implementation of our approach. on whether and what type of feedback is available. With explicit relevance feedback from the user, documents relevant to the query are selected. When feedback is implicit in the form of mouse clicks, the clicked documents are treated as relevant documents. The top-ranked merged documents are chosen in the last resort when neither explicit nor implicit feedback is available. After stopwords are removed and stemming is conducted, the contents of the chosen documents are used to generate a maximum likelihood unigram language model to represent the corresponding query. The representation of a query cluster is the aggregation of its members' language models. The similarity between a query and a cluster is measured by the KullbackLeibler divergence between their representations. Our choice of the clustering algorithm is guided by several characteristics of query clustering in peer-to-peer networks. First, because the sets of queries used for clustering are highly dynamic, the clustering algorithm should be incremental. Second, since the size of the query log at each individual information consumer is much smaller compared with the query logs of Web search engines, the clustering algorithm should be able to work well with limited data. Third, the algorithm should not require the number of clusters or the maximum size of each cluster to be set manually as it is unreasonable to assume that these parameters can be determined in advance. Based on the above considerations, we use a greedy non-hierarchical clustering algorithm that incrementally updates existing clusters to include new queries when their representations are similar to the old ones, or creates new clusters when they are sufficiently different in order to capture the user's new interests. Neither the number of clusters nor the size of each cluster is predetermined. In previous research on using search history to improve federated search performance in P2P networks [11, 14, 16], search performance is measured by the number of documents returned for each query. For the known-item search that is common in P2P networks sharing music, videos, and software, this appears to be an appropriate measure since typically the search either returns relevant documents or returns no document at all. In contrast, full-text federated search is very likely to return non-relevant documents, so the number of documents returned is no longer a good measure of search performance. Because the top-ranked documents are more likely to be relevant than most lower-ranked documents, when no feedback is available, the information about how many documents returned by a hub appear among the overall top-ranked merged documents at a consumer is a more reliable indicator of the hub's performance for a query. Therefore, our approach uses this information as a surrogate for relevance feedback to measure each hub's performance on resource location for interest-based hub selection. 3.1 Design Query clustering requires a representation for each query/cluster, and a similarity measure between queries and clusters. Because the small number of query terms does not provide a reliable basis for clustering queries effectively, a commonly used method to measure query similarity in Web retrieval is to count the number of commonly retrieved documents for the queries [5, 21]. This method may work well if the task is to group queries that are very similar. However, to group queries by interest, it is quite likely that two queries that express similar interests in a general topic (e.g., music) may not have any retrieved document in common even though the vocabularies of their retrieved documents may have significant overlap. Therefore, it is more appropriate to measure query similarity based on the contents of the documents returned for each query. Which retrieved documents to choose in generating a representation for the corresponding query depends 3.2 Implementation Figure 3.1 shows an algorithmic description of our approach to user modeling for full-text federated search. Below we discuss several details that are important to its effectiveness. When a query is issued, its query terms are used as its representation in determining which existing query clusters it is most similar to ("classification") for interest-based initial hub selection. However, the chosen query clusters are not necessarily the clusters that the query should join because we need a more reliable representation of the information need for effective query clustering. Therefore, incremental query clustering is conducted 334 after the search results are obtained for the query so that the full contents of the top-ranked merged documents (assuming no feedback is available) can be used to generate the query representation for clustering. Our experimental results demonstrate that a small number of the top-ranked merged documents (e.g., 5-10) are sufficient to provide a reliable representation for query clustering. To distinguish between the two representations of the same query, we refer to the former one used for classification as its "TermRepresentation" and the latter one used for clustering as its "DocRepresentation". We refer to queries that are related conceptually to the user's persistent interests as "characteristic queries" since they are characteristic of his/her long-term information needs. By contrast, queries that represent transient, ad-hoc interests of the user are referred to as "uncharacteristic queries". User modeling allows the consumer to use past search experience to reduce the search radius and improve search efficiency for characteristic queries without reducing accuracy. But a default, larger search radius is still required for uncharacteristic queries to reach more hubs so as to locate sufficient relevant contents. Therefore, for each new query, the consumer needs a classification threshold Tclassify to distinguish uncharacteristic queries from characteristic ones in order to apply different search strategies accordingly. Each query cluster is required to reach a certain size Smin before it is regarded as a topic of interest. This is designed to avoid classifying queries to clusters of uncharacteristic queries formed by chance and to make the description of the topic represented by each cluster more reliable. Instead of classifying a new characteristic query to the most similar cluster, a weighted k-nearest neighbor approach is used to increase the robustness of our approach, where the value of k is determined by Tclassify and the weight is related to the similarity between the query and the cluster. Among all the clusters whose K-L divergence-based distance measures to a query's representation are small enough (determined by a clustering threshold Tcluster), the query chooses to join the largest cluster in order to minimize the "noise" introduced by small clusters of uncharacteristic queries. The total number of query clusters can be limited in order to control the amount of resources dedicated by an information consumer to process and store the language models used to represent the clusters. Although in most cases a consumer may not find it necessary to limit the number of query clusters (the average size of the representation for a query cluster is 69KB in our experiments), associating each cluster with a time stamp and removing infrequently used clusters can reduce clusters of uncharacteristic queries and effectively model the user's interest shift. In our implementation, when the number of query clusters exceeds Nmax, clusters among the r least recently used clusters are removed in an ascending order of cluster size until the number of query clusters drops to Nmax. PROCESSQUERY(q) /* Compare new query to existing query clusters */ characteristic = false initialize M[] = 0 qt = TermRepresentation(q) for each cluster ci if KL(ci, qt)