SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR Federated Text Retrieval From Uncooperative Overlapped Collections Milad Shokouhi School of Computer Science and Information Technology, RMIT University Melbourne, Australia Justin Zobel School of Computer Science and Information Technology, RMIT University Melbourne, Australia milad@cs.rmit.edu.au ABSTRACT In federated text retrieval systems, the query is sent to multiple collections at the same time. The results returned by collections are gathered and ranked by a central broker that presents them to the user. It is usually assumed that the collections have little overlap. However, in practice collections may share many common documents as either exact or near duplicates, p otentially leading to high numb ers of duplicates in the final results. Considering the natural bandwidth restrictions and efficiency issues of federated search, sending queries to redundant collections leads to unnecessary costs. We prop ose a novel method for estimating the rate of overlap among collections based on sampling. Then, using the estimated overlap statistics, we prop ose two collection selection methods that aim to maximize the numb er of unique relevant documents in the final results. We show exp erimentally that, although our estimates of overlap are not inexact, our suggested techniques can significantly improve the search effectiveness when collections overlap. jz@cs.rmit.edu.au uates the query and returns the results to the broker. As there is no need to directly access the index of the collections, FIR methods can search the hidden web. FIR can also provide a search service over the latest version of documents in collections without consuming costly resources for crawling and indexing. For each query, the broker selects the collections that are most likely to return relevant documents. This creates the collection selection problem [Callan et al., 1995; Nottelmann and Fuhr, 2004; Hawking and Thomas, 2005; Si and Callan, 2003a]. To b e able to select suitable collections, the broker acquires some knowledge ab out the contents of each collection, creating the collection representation problem [Baillie et al., 2006; Callan and Connell, 2001]. The knowledge of the broker ab out each collection may vary from detailed vocabulary statistics to a limited numb er of sampled documents. Once the selected collections return their results, the broker merges them and presents them to the user. This final step is the result merging problem [Callan et al., 1995; Si and Callan, 2003b]. Our pap er focuses on the first problem: collection selection. FIR techniques assume that the degree of overlap among collections is either none or negligible [Si and Callan, 2003b]. However there are many collections such as bibliographic databases or news resources that have a significant degree of overlap. In such scenarios, selecting collections that are likely to return the same results not only wastes costly resources, but also degrades search effectiveness by introducing duplicate documents into the final results. We prop ose a method that estimates the degree of overlap among collections by sampling from each collection using random queries. In addition, we introduce two collection selection techniques that use the estimated overlap statistics to maximize the numb er of unique relevant documents in the final results. Our exp eriments on three testb eds suggest that, compared to the state-of-the-art methods, our techniques return fewer duplicate documents. They also significantly outp erform the current alternatives in terms of the final search precision. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; H.3.4 [Information Storage and Retrieval]: Distributed Systems; H.3.7 [Information Storage and Retrieval]: Digital Libraries General Terms Algorithms, Design, Exp erimentation Keywords Resource selection, federated search, distributed information retrieval, overlap estimation, overlapp ed collections 1. INTRODUCTION In federated information retrieval (FIR), the query is sent simultaneously to several collections. Each collection eval- 2. RELATED WORK Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. Many collection selection techniques are based on the assumption that the federated search environment is cooperative, that is, collections provide the broker with their index statistics and other useful information. CORI [Callan et al., 1995], GlOSS [Gravano et al., 1994], and CVV [Yuwono and Lee, 1997] are examples of such methods, among which CORI has b een suggested to b e the most effective 495 SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR [Craswell et al., 2000; Powell and French, 2003], although there is also evidence that is p oor [D'Souza et al., 2004]. In practice, federated search systems may b e uncooperative. In this case, collections do not make their index statistics available to the broker. Therefore, the broker samples documents from each collection and uses them for collection selection. Recent collection selection algorithms were develop ed for uncoop erative environments. Si et al. [2002] prop osed a language modelling framework for collection selection and result merging and showed that it can slightly outp erform CORI in most cases. ReDDE [Si and Callan, 2003a] estimates the numb er of relevant documents in collections according to their sampled documents. Collections are ranked according to the numb er of their sampled documents that are ranked highly by a central model. Nottelmann and Fuhr [2003] introduced a decision theoretic framework (DTF) for collection selection. It tries to minimize the overall costs of federated search including money, time, and retrieval quality. However, the effectiveness of DTF, in particular for short queries, has b een found to b e inferior to that of CORI [Nottelmann and Fuhr, 2003]. Nottelmann and Fuhr [2004] showed that combining DTF with CORI can increase its general effectiveness. In a similar approach, Si and Callan [2004] prop osed a unified utility maximization framework (UUM) for resource selection. As in ReDDE, UUM runs queries on an index of all sampled documents. It uses training queries to learn the probabilities of relevance for the sampled documents according to their central scores. Using these probabilities, UUM selects collections that are likely to maximize either the final precision or final recall. RUM [Si and Callan, 2005] is an enhanced version of UUM that also considers the search effectiveness of resources. That is, collections are ranked according to the numb er of relevant documents that they are exp ected to return, instead of the numb er of relevant documents that they contain. Training queries are used to model the search effectiveness of collections. Si and Callan [2005] showed that the results produced by RUM are at least as good as those of UUM. Hawking and Thomas [2005] suggested a hybrid approach that combines federated search with centralized techniques. In their method, the link anchor text available in a set of crawled documents is used to provide a description of collections that are not crawled. Collections are ranked according to the similarities of crawled pages referring to them. They showed that their technique can outp erform ReDDE and CORI for some tasks. Col lection representation. The broker needs to gather descriptions of each collection b efore collection selection, . Collections are selected according to the similarities of their descriptions with the query. In coop erative environments, collections provide the broker with their own descriptions. In uncoop erative environments, where collections communicate with the broker only in resp onse to queries, a query-based sampling (QBS) technique [Callan and Connell, 2001] can b e used to obtain collection descriptions. In QBS, probe queries are submitted to collections. The documents returned are downloaded and used as descriptions. Several strategies have b een suggested for selecting the prob e queries. Callan and Connell [2001] chose the prob e queries from a set of dictionary words or existing sample documents. Gravano et al. [2003] suggested that prob e queries b e selected from the nodes of a hierarchical classification tree. We found that selecting the prob e queries from query logs can lead to b etter samples and significantly improves the search effectiveness [Shokouhi et al., 2007]. Hedley et al. [2004] selected the initial prob e queries from the search interface of collections. Query-based sampling may b e static or adaptive. In static QBS, a fixed numb er of documents (say 300) is downloaded from each collection [Callan and Connell, 2001]. In adaptive QBS [Baillie et al., 2006; Caverlee et al., 2006; Shokouhi et al., 2006a], sampling terminates according to the size of collections or the numb er of new terms in the most recent samples. Result merging. Once the selected collections return their results, the broker merges them into a single list and presents them to the user. CORI [Callan et al., 1995] and SSL [Si and Callan, 2003b] are two well-known examples of such algorithms. We use SSL for our exp eriments, as it has found to b e more effective than CORI [Si and Callan, 2003b]. Duplicate detection in FIR. Typically, federated search techniques assume that collections are indep endent and overlapfree. There are only two previous works that have explored the problem of overlapp ed collections. We introduced a method for detecting duplicate and nearduplicate documents among the returned results [Bernstein et al., 2006]. In this approach, collections send a hash vector with each document they return to the broker. The broker detects the duplicate and near-duplicate documents by comparing the returned hash vectors and discards them during merging. The ma jor drawback with this technique is that it may not b e applicable for uncoop erative environments, as it exp ects collections to use the same hash functions and to return a hash value p er document. COSCO is an overlap-aware collection selection method prop osed by Hernandez and Kambhampati [2005] for uncoop erative environments, which uses a large numb er of training queries to measure the degree of overlap among collections. For collection selection, each query is matched with the previous training queries. Then, according to the overlap statistics stored for the training queries, collections are selected in a way that is exp ected to minimize the numb er of duplicate documents in the final results. If the terms available in a new query do not exist in the previous training queries, COSCO cannot effectively estimate the overlap and select suitable collections. Also, Hernandez and Kambhampati [2005] test COSCO on a set of bibliographic datasets; its effectiveness for heterogeneous datasets and typical web pages has not b een investigated. 3. OVERLAP ESTIMATION Documents downloaded by query-based sampling are the only source of information ab out collections in uncoop erative environments, so it is necessary to extract as much information as p ossible from the samples. Considering the efficiency restrictions and bandwidth limits, it is advantageous if the extra information is extracted from documents that are already downloaded from collections. Methods for estimating the size of collections such as sample-resample [Si and Callan, 2003a] and capture-history [Shokouhi et al., 2006b] already have such characteristics. They can estimate the size of collections with a small numb er of prob e queries 496 SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR and sample documents. In this section, we introduce a novel method for estimating the degree of overlap among collections. Our technique uses the documents downloaded by query-based sampling for estimating the rate of overlap and does not require any additional information. Supp ose that we have two collections C1 and C2 , and there are K overlap documents b etween them. We gather one sample from each collection using query-based sampling, S1 from C1 and S2 is from C2 . Let m1 and m2 represent the documents from S1 and S2 that are in K . That is, m1 and m2 are the subsets of sampled documents that are selected from the overlapp ed documents b etween C1 and C2 : m 1 = S1 K and m 2 = S2 K Algorithm 1 Relax resource selection 1: Relax-Selection(G, w, s) 2: Initialize-Graph(G, S ) 3: S 4: Q V [G] 5: while Q = do 6: u Extract-Max(Q) 7: S S {u} 8: for each vertex v Adj [u] do 9: d[v ] d[u] - we (u, w) //Relaxing 10: end for 11: end while which gives: E (D) = · ( + 1 - )|m1 |-1 Thus: |m1 ||m2 | (2) K Equation (2) shows the exp ected numb er of documents in m1 that are common with m2 . Similarly, E (D) is the exp ected numb er of documents in m2 that are common with m1 . Thus, the numb er of overlap documents is indep endent of the collection sizes. Having the numb er of duplicate documents (D) within two samples it is p ossible to estimate the value of K as: E (D) = = |m1 | · = |C1 ||C2 | × D |m1 ||m2 | ^ = K= D |S1 ||S2 | (3) (1) Assuming that the samples are random, we can estimate the sizes of m1 and m2 as follows: |mx | = |Sx | × K |Cx | where x {1, 2} Here, |Cx | is the size of collection Cx that can b e estimated by the capture-history technique [Shokouhi et al., 2006b]. From another p ersp ective, m1 and m2 can b e regarded as two random samples from the p opulation of overlapp ed documents. The probability of any given document from m1 to b e available in m2 is = |m2 | . Therefore, the probability K of any given document from m1 not to b e available in m2 is calculated as 1 - . The exp ected numb er of documents in m1 that are available in m2 can b e calculated as b elow: |m1 | E (D ) = X i=0 iP (i) where the p ossible numb er of overlap documents is 0 < i < |m1 | and P (i) is the probability of having exactly i documents in m1 that are also available in m2 . (Note that by definition E (D) is the exp ected numb er of documents in m2 that are available in m1 .) Since P (i) follows a binomial distribution, for the exp ected value of D we have: |m1 | Once the numb er of duplicate documents among collections is estimated, it can b e used in collection selection algorithms for maximizing the numb er of unique relevant documents in the final results. In the following sections, we introduce two methods that use the estimated overlap statistics for collection selection. E (D ) = That is: X i=0 ! |m1 | i i (1 - )|m1 |-i i 4. THE `RELAX' SELECTION METHOD A federated search environment in the presence of overlapp ed documents among collections can b e represented by a graph G. In this graph, each vertex is a collection and each edge indicates the existence of overlap documents b etween a pair of collections. We aim to minimize the numb er of duplicate documents in the final merged list. For this purp ose, the selection algorithm has to avoid simultaneously selecting collections that are likely to return the same answers. In our Relax collection selection technique, initially the numb er of relevant documents in each collection (vertex) is estimated. As in ReDDE [Si and Callan, 2003a] and UUM [Si and Callan, 2004], we rank all the sampled documents from collections for each query. Assuming that the top documents returned in this central ranking are relevant (we used = 150), the numb er of relevant sampled documents for each collection is counted. Then according to the estimated size of collection and the numb er of sampled documents, the total numb er of relevant documents in collection ^ ^ C is computed as R = r×|C | . Here, R is the estimated |S | numb er of relevant documents in collection C and |C | is the |m1 | E (D ) = |m1 | X i=1 i × |m1 |! i (1 - )|m1 |-i i! × (|m1 | - i)! = X i=1 (|m1 | - 1)! i-1 (1 - )(|m1 |-1)-(i-1) (i - 1)! × (|m1 | - i)! where = |m1 | · . By substituting i - 1 with another variable j we have: |m1 |-1 = X j =0 (|m1 | - 1)! j (1 - )(|m1 |-1)-(j ) (j )! × ((|m1 | - 1) - j )! According to the binomial theorem: ! n X n n n-l n (x + y ) = xy l l=0 497 SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR C1 R= 16 5 1 2 C2 R= 15 1 3 1 R= 16 C1 2 C2 R= 15 R= 11 C1 2 C2 R= 12 1 5 1 3 1 4 R= 8 C3 C4 R= 20 4 R= 8 C3 C4 R= 20 R= 4 C3 C4 R= 20 (A) C1 R= 9 C2 R= 12 (B) C1 R= 8 (C) C2 R= 12 1 R= 3 C3 C4 R= 20 R= 2 C3 C4 R= 20 (D) (E) Figure 1: The Relax selection on a sample graph. Each vertex (C n) in this graph represents a federated col lection. (A) The graph initialization where R represents the estimated number of relevant documents in each col lection. (B) The graph after initialization where C 4 is selected as the most relevant col lection according to its R value. The weight we (u, v ) of an edge between u and v is computed according to the estimated number of documents common between u and v . (C)­(E) The status of the graph after selecting each col lection (vertex). estimated size of collection. The numb er of sampled documents from collection C that are ranked in the top results by the central retrieval model is represented by r . Finally, |S | is the numb er of sampled documents from collection C . ^ Relax uses the estimated values for R as the weights of the collections. In the next step, the weight we (u, v ) of an edge b etween two given collections u and v is calculated according to the approximated numb er of common relevant documents b etween u and v as follows: ^ ^ ^ |Ru Rv | × K ^ ^ we (u, v ) = |Ru Rv | = |Cu Cv | (4) edge is computed using the numb er of common relevant documents b etween the connected pairs (Fig. 1). 4. The collection with the highest estimated numb er of relevant documents is selected. 5. Relax up dates the graph by relaxing all collections and removing unnecessary edges. 6. Stop if there are no more edges or enough collections have b een chosen. Otherwise, go to step 4. Figure 1 shows a simple example of four overlapp ed collections. At the first step (A), the numb er of relevant documents in each collection R is estimated. At the next stage (B), the collection with the highest numb er of relevant documents (C 4) is selected. The graph is relaxed by subtracting the estimated numb er of common relevant documents b etween the top collection and the connected collections. After each up date, the edges connected to the most recent selected collection are removed. This process continues until there is no edge in the graph (C)­(E). That is, Relax selects collections according to the numb er of their unvisited relevant documents. Here, |Cu Cv | represents the total numb er of documents ^ ^ in b oth collections. Ru and Rv are resp ectively the estimated numb er of relevant documents in collections u and v . ^ K is the estimated numb er of common documents b etween collections u and v that is calculated by Eq. (3). At each stage, the collection with the highest numb er of relevant documents is selected. The weights of other collections are up dated by subtracting the estimated numb er of their overlapp ed relevant documents with the selected collection (that is, by relaxing). In summary, our Relax selection method (Algorithm 1) is as follows: 1. Documents are downloaded from each collection using the query-based sampling technique. 2. The size of collections and the numb er of overlapp ed relevant documents b etween each pair of collections are estimated. 3. The federated environment is represented by a graph, where each vertex is a collection and the weight of each 5. OVERLAP FILTERING FOR REDDE Another strategy for avoiding duplicates in the final results is to remove collections with a high degree of overlap from the resource selection rankings. That is, initially the degree of overlap b etween collection pairs is estimated. Then for each query, collections are ranked using a resource selection method such as ReDDE [Si and Callan, 2003a]. Each collection at rank is compared with the other collections at the higher ranks. Collection Cµ is pruned from the original rank list if it has a large estimated overlap with at least 498 SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR Number of collection pairs 74492 10000 Number of collection pairs 80128 10000 1000 100 10 1 1923 1000 100 10 1 580 37502483 1062 316 240 152 139 98 101 79 579 424 341 242 267 484 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Ratio of overlap between collection pairs Figure 2: The overlap among col lection pairs in the Qprobed-280 testbed. one of the other collections at higher ranks. Our filtering method for ReDDE, which we refer to as F-ReDDE, can b e summarized as b elow: 1. The overlaps among collections are estimated as describ ed for the Relax selection. 2. Collections are ranked using a resource selection algorithm such as ReDDE. 3. Each collection is compared with the previously selected collections. It is removed from the list if it has a high overlap (greater than ) with any of the previously selected collections. We empirically choose = 30% and leave methods for finding the optimum value as future work. The effectiveness of this method is exp ected to strongly dep end on the underlying collection selection technique that is used. Also, the optimum value for may vary b etween testb eds. In the following sections, we compare the effectiveness of our selection methods with other techniques in the presence of overlap among collections. Ratio of overlap between collection pairs Figure 3: The overlap among col lection pairs in the Qprobed-300 testbed. a random numb er of documents (b etween 5 000 and 20 000) are downloaded as a collection. Queries that return less than 5 000 answers are discarded. In total, 280 collections with average size of 12 194 documents were generated. The largest and smallest collections in this testb ed resp ectively contain 19 860 and 5 001 documents. Documents in each collection match for the same query and are likely to have somewhat similar topicality. Figure 2 depicts the degree of overlap among collections in this testb ed. There are 74 492 collection pairs that have less than 10% overlap while there are only 79 pairs with more than 90% of their documents in common. The overall rate of overlap among collections is low; only 1.1% of collection pairs in this testb ed have more than 50% overlap. Qprobed-300. Starting from the first collection in the previous testb ed, every twentieth collection is merged into a single large collection. The same procedure is applied to every twentieth collection starting from the next initial 13 collections (collections 2, 3, ..., 14) in the Qprob ed-280 testb ed. In total, the testb ed is comprised of 300 collections. Figure 3 illustrates the degree of overlap among collection pairs. Ab out 1.9% of collection pairs have more than 50% overlap which is higher than the Qprob ed-280 testb ed. The collections in this testb ed vary in size from 5001 to 180 985 documents with an average of 20 908 documents. Sliding-115. We used a sliding window of 30 000 documents on the TREC GOV documents to generate 112 collections. Each collection has a random p ercentage of overlap P % (25 < P < 100) with the previous collection. Then, starting from the first collection, every tenth collection is collapsed into a single large collection. The same procedure is applied on every tenth collection starting from the second and third collections, forming two additional large collections. In total, there are 115 collections. Figure 4 illustrates the degree of overlap among collection pairs on this testb ed. Ab out 2.5% of collection pairs have more than 50% overlap. We exp ect that the impact of using overlap statistics b ecomes more significant as the overall overlap in the testb eds increases. The exp erimental results rep orted in the following sections confirm this hyp othesis. 6. TESTBEDS The effectiveness of FIR methods tends to vary substantially on different testb eds [D'Souza et al., 2004; Si and Callan, 2003a]. Unfortunately, in the standard FIR testb eds [Powell and French, 2003; Si and Callan, 2003b], there is no overlap among collections. Thus, we create three new testb eds with overlapping collections based on the documents available in the TREC GOV dataset. We do not claim that our strategies for creating these testb eds are p erfect or argue that the testb eds entirely reflect the characteristics of web collections. However, considering the available datasets for evaluating information retrieval exp eriments, we b elieve that the suggested testb eds are acceptable. Qprobed-280. For this testb ed, we used the 360 most frequent queries in in a query log provided by a ma jor search engine, of queries with a highly ranked answer in the .gov domain. Each selected query is passed as a prob e query to an index of TREC GOV documents. For each prob e query, 499 SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR 10000 1000 100 10 1 12452 88 99 76 65 73 80 44 63 70 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Ratio of overlap between collection pairs Figure 4: The overlap among col lection pairs in the Sliding-115 testbed. 1.0 0.8 Estimated overlap Actual overlap Overlap 0.6 0.4 0.2 0.0 documents b etween collection samples is often higher than the random scenario causing the overestimation of overlap in Eq. (3). Thus, we divide the estimated overlaps by the largest overlap value for normalization. In the rest of this pap er and for all of our exp eriments, we use the normalized overlap values. The AEE values computed for collections in the Qprob ed300, Qprob ed-280, and Sliding-115 testb eds are resp ectively 0.91, 0.93, and 0.70. Figure 5 shows the accuracy of estimations for overlapp ed collections in the Sliding-115 testb ed. The horizontal axis represents the collection pairs sorted according to their actual overlap degree. It can b e seen that the estimated values and the actual overlap rates correlate. The errors in estimations are largely due to two factors: (1) the query-based sampling technique does not produce random samples from collections [Shokouhi et al., 2006b] and (2) the size of samples are so small, so that they do not capture any duplicate document for estimating the degree of overlap. The trends for the other two testb eds were similar and we do not present them here. In the rest of this section, we show that although our estimates of overlap are not p erfectly accurate, our suggested methods can significantly improve the search effectiveness in the presence of overlap among collections. Search effectiveness. To make our results comparable with previous work [Nottelmann and Fuhr, 2003; 2004; Si and Callan, 2003a;b; 2004; 2005], we run traditional static querybased sampling on each collection. Prob e queries are selected randomly from the set of sampled documents and sampling terminates once 300 documents are downloaded. ReDDE is one of the most successful collection selection techniques that does not require training data. Therefore, we use it as the baseline of our exp eriments. We also compare the results with CORI. For simplicity, we assume that all collections use the INQUERY [Callan et al., 1992] retrieval model and return at most 100 answers p er query. We apply SSL [Si and Callan, 2003b]--the b est current FIR merging method--to merge the results and compared methods by their precision values at early recall p oints (P @n). The statistical significant detected by the t-test for the differences b etween ReDDE and other methods at the 90% and 95% confidence intervals are resp ectively sp ecified by and . The duplicate documents in the final merged lists are considered as irrelevant. Table 1 shows the precision values obtained by running the TREC topics 551­600 on the Qprob ed-280 testb ed. The cutoff values represent the numb er of collections selected p er query. For the cutoff values 5 and 10, there is little difference b etween the effectiveness of methods, and only for P@5, Relax has a small advantage over the alternatives. When 20 collections are selected, the gaps b ecome more apparent. Relax significantly outp erforms ReDDE for precision at 5 and 10. F-ReDDE is at least as good as ReDDE and CORI produces a b etter P@5 value than ReDDE. Table 2 includes the precision values produced by the selection methods on the Qprob ed-300 testb ed. The differences b etween ReDDE and Relax are often significant. Other methods have no significant advantage against each other, but ReDDE is usually the b est among them. On the Sliding-115 testb ed, Relax produces the b est results. It significantly outp erforms ReDDE in 5 of 12 cases. Number of collection pairs 0 200 400 600 800 Collection pairs (Sliding-115) Figure 5: The accuracy of overlap estimation for collection pairs in the Sliding-115 testbed. 7. RESULTS The accuracy of our method for estimating the rate of overlap among collection pairs is measured using an average estimation error metric, defined as b elow: n n XX 1 n(n - 1) i=1 AE E = where j =1,j =i ^ |D(i, j ) - D(i, j )| D(i, j ) D(i, j ) = |Ci Cj | |Ci | (5) Here, n is the total numb er of collections, D(i, j ) is the prop ortion of documents in collection i that are common with collection j . The value |Ci Cj | is equivalent to the K value in Eq. (3) and |Ci | represents the size of collection i. In our preliminary exp eriments, the initial estimated values for D(i, j ) suggested that the degree of overlap among collections is usually overestimated. This observation can b e easily explained; document retrieval models are biased towards returning some p opular documents for many queries [Garcia et al., 2004]. In addition, we found evidence that samples produced by query-based sampling are not random [Shokouhi et al., 2006b]. Therefore, the numb er of common 500 SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR Table 1: Precision values obtained by collection selection methods on the Qprobed-280 testbed. TREC topics 551­600 are used as queries. Cutoff values show the number of collections selected per query. For each query, the duplicate documents in the final merged lists are considered as irrelevant. Cutoff=5 Cutoff=10 Cutoff=20 P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20 CORI 0.163 0.149 0.134 0.113 0.167 0.142 0.121 0.110 0.195 0.140 0.125 0.108 F-ReDDE 0.167 0.132 0.126 0.109 0.167 0.144 0.119 0.104 0.171 0.142 0.123 0.112 ReDDE 0.167 0.132 0.126 0.109 0.167 0.144 0.119 0.104 0.163 0.142 0.122 0.112 0.114 0.107 0.183 0.161 0.126 0.112 0.208 0.161 0.122 0.108 Relax 0.187 0.142 Table 2: Precision values obtained by collection selection methods on the Qprobed-300 testbed. TREC topics 551­600 are used as queries. Cutoff values show the number of collections selected per query. For each query, the duplicate documents in the final merged lists are considered as irrelevant. Cutoff=5 Cutoff=10 Cutoff=20 P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20 CORI 0.163 0.142 0.125 0.115 0.171 0.118 0.106 0.106 0.187 0.128 0.102 0.098 F-ReDDE 0.171 0.146 0.134 0.120 0.167 0.149 0.125 0.105 0.167 0.128 0.108 0.091 ReDDE 0.171 0.149 0.136 0.120 0.163 0.144 0.126 0.103 0.167 0.132 0.111 0.093 Relax 0.204 0.177 0.140 0.128 0.187 0.153 0.130 0.121 0.179 0.157 0.126 0.111 Table 4: The average number of duplicate documents returned by each method per query for the Qprobed-280 testbed. Cutoff values (CO) represent the number of collections selected. CO=5 CO=10 CO=20 CORI 20.58 27.66 35.94 F-ReDDE 15.22 25.95 33.04 ReDDE 15.22 26.57 34.49 Relax 13.51 22.57 32.04 Table 5: The average number of duplicate documents returned by each method per query for the Qprobed-300 testbed. Cutoff values (CO) represent the number of collections selected. CO=5 CO=10 CO=20 CORI 20.28 29.73 42.20 F-ReDDE 33.20 49.95 55.32 ReDDE 32.81 50.69 56.00 Relax 30.51 39.51 48.08 F-ReDDE also produces b etter results than ReDDE in general. However, the differences are smaller and only significant at two cases. The precision values for CORI for cutoff=5 are sp ecified in italic to indicate that they are significantly worse than ReDDE. This is consistent with observations of Si and Callan [2003a] suggesting that CORI is p oorer than ReDDE on testb eds with skewed distributions of collection sizes. The differences b etween CORI and ReDDE b ecome negligible for larger cutoff p oints. As the extent of overlap among collections in the testb eds increases, the impact of using an overlap-aware collection selection method b ecomes more apparent. While there is no significant difference b etween methods on the Qprob ed280 testb ed, the overlap-aware methods outp erform the FIR baselines on the other two testb eds. This confirms our hyp othesis that, as the overlap grows, there is a more noticeable need for use of overlap-aware selection methods. Table 6: The average number of duplicate documents returned by each method per query for the Sliding-115 testbed. Cutoff values (CO) represent the number of collections selected. CO=5 CO=10 CO=20 CORI 8.10 11.29 20.39 F-ReDDE 15.95 19.12 24.00 ReDDE 18.50 20.50 28.02 Relax 16.22 20.68 22.77 Number of duplicates. For each cutoff value, we compare the average numb er of duplicate documents returned by methods p er query. Table 4 suggests that when the rate of overlap among collections is low, the numb er of duplicate documents returned by CORI, F-ReDDE, and ReDDE are usually comparable. Relax p erforms b etter than ReDDE and manages to reduce the numb er of duplicate documents by 11% and 15% resp ectively for cutoff values 5 and 10. Tables 5 and 6 suggest that, compared to ReDDE, the overlap-aware selection methods can significantly reduce the chance of visiting a duplicate document in the final results. In b oth testb eds, CORI returns the lowest numb er of duplicate documents. This is due to the p oor p erformance of CORI for testb eds with skewed distribution of collection sizes. Compared to the other methods, CORI selects the three large collections in the Sliding-115 testb ed for fewer queries. The same observation can b e made for the 14 large collections of the Qprob ed-300col testb ed. As these collections cause the highest overlap in the testb eds, missing them during collection selection significantly reduces the numb er of duplicate documents in the final results. However, Tables 2 and 3 show that the effectiveness of CORI on these testb eds is p oor even when the numb er of duplicate documents is low. This is mainly b ecause the large collections missing by CORI have a high density of relevant documents. The average numb er of duplicates returned by ReDDE and F-ReDDE are similar on the Qprob ed-300 testb ed. On the Sliding-115 testb ed, F-ReDDE returns 13% and 14% fewer duplicates resp ectively when 5 and 20 collections are 501 SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR Table 3: The precision values obtained by collection selection methods on the Sliding-115 testbed. TREC topics 551­600 are used as queries. Cutoff values show the number of collections selected per query. For each query, the duplicate documents in the final merged lists are considered as irrelevant. Cutoff=5 Cutoff=10 Cutoff=20 P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20 CORI 0.091 0.075 0.063 0.056 0.187 0.139 0.108 0.095 0.179 0.145 0.140 0.128 0.170 0.141 0.125 0.110 0.166 0.141 0.137 0.122 F-ReDDE 0.166 0.143 0.122 0.107 ReDDE 0.154 0.135 0.111 0.101 0.175 0.131 0.125 0.110 0.162 0.143 0.131 0.114 0.115 0.187 0.156 0.133 0.115 0.154 0.152 0.140 0.133 Relax 0.171 0.154 0.122 selected. On the latter two testb eds, the numb er of duplicates returned by Relax is substantially less than ReDDE. Relax returns resp ectively 12% and 18% less documents than ReDDE for cutoff values 5 and 20 on the Sliding-115 testb ed. On the Qprob ed-300 testb ed, the numb er of duplicates for Relax at CO=10 and CO=20 are resp ectively 22% and 14% less than that for ReDDE. Overall, Relax produces the highest precision values. It also returns a lower numb er of duplicate documents than all methods but CORI. The F-ReDDE approach works well on some testb eds but on others is not significantly b etter than the alternatives. This might b e due to the p oor choice of , which was chosen based on our preliminary exp eriments. 8. CONCLUSIONS We have introduced a novel technique for estimating the degree of overlap among uncoop erative collections. Our method uses the sampled documents downloaded for collection selection and does not require any additional information. We have also prop osed two overlap-aware collection selection techniques that consider the overlap statistics of resources for collection selection. Our exp erimental results show that, in the presence of overlap, our techniques can significantly outp erform previous collection selection methods in terms of search effectiveness. They also lead to a smaller numb er of duplicate documents in the final merged results. Several op en directions remain as our future work. For our exp eriments in this pap er, we used static query-based sampling and downloaded 300 documents for each collection. It is interesting to investigate the impact of sample size and sampling strategies on the accuracy of overlap estimations and the final search effectiveness. Moreover, we arbitrarily set to 30% according to preliminary exp eriments, but our results suggest that the b est value for varies on different testb eds. Finally, although in theory the methods discussed in this pap er are applicable for avoiding near-duplicate documents, they have not b een tested for such a scenario. References Baillie, M., Azzopardi, L., and Crestani, F. (2006). Adaptive querybased sampling of distributed collections. In Proc. SPIRE Conf., pages 316­328, Glasgow, UK. Bernstein, Y., Shokouhi, M., and Zob el, J. (2006). Compact features for detection of near-duplicates in distributed retrieval. In Proc. SPIRE Conf., pages 110­121, Glasgow, UK. Callan, J. and Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97­ 130. Callan, J., Croft, W. B., and Harding, S. (1992). The INQUERY retrieval system. In Proc. Int. Conf. on Database and Expert Systems Applications, pages 78­83, Valencia, Spain. Callan, J., Lu, Z., and Croft, W. B. (1995). Searching distributed collections with inference networks. In Proc. ACM SIGIR Conf., pages 21­28, Seattle, Washington. Caverlee, J., Liu, L., and Bae, J. (2006). Distributed query sampling: a quality-conscious approach. In Proc. ACM SIGIR Conf., pages 340­347, Seattle, Washington. Craswell, N., Bailey, P., and Hawking, D. (2000). Server selection on the World Wide Web. In Proc. ACM Conf. on Digital Libraries, pages 37­46, San Antonio, Texas. D'Souza, D., Zob el, J., and Thom, J. (2004). Is CORI effective for collection selection? an exploration of parameters, queries, and data. In Proc. Australian Document Computing Symposium, pages 41­ 46, Melb ourne, Australia. Garcia, S., Williams, H., and Cannane, A. (2004). Access-ordered indexes. In Proc. Australasian Computer Science Conf., pages 7­14, Darlinghurst, Australia. Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GlOSS for the text database discovery problem. In Proc. ACM SIGMOD Conf., pages 126­137, Minneap olis, Minnesota. Gravano, L., Ip eirotis, P., and Sahami, M. (2003). Qprob er: A system for automatic classification of Hidden-Web databases. ACM Transactions on Information Systems, 21(1):1­41. Hawking, D. and Thomas, P. (2005). Server selection metho ds in hybrid p ortal search. In Proc. ACM SIGIR Conf., pages 75­82, Salvador, Brazil. Hedley, Y., Younas, M., James, A., and Sanderson, M. (2004). A twophase sampling technique for information extraction from hidden web databases. In Proc. ACM Workshop on Web information and data management, pages 1­8, Washington, DC. Hernandez, T. and Kambhampati, S. (2005). Improving text collection selection with coverage and overlap statistics. In Int. Conf. on World Wide Web, pages 1128­1129, Chiba, Japan. Nottelmann, H. and Fuhr, N. (2003). Evaluating different metho ds of estimating retrieval quality for resource selection. In Proc. ACM SIGIR Conf., pages 290­297, Toronto, Canada. Nottelmann, H. and Fuhr, N. (2004). Combining CORI and the decision-theoretic approach for advanced resource selection. In Proc. Euorpean Conf. on Information Retrieval, pages 138­153, Sunderland, UK. Powell, A. L. and French, J. (2003). Comparing the p erformance of collection selection algorithms. ACM Transactions on Information Systems, 21(4):412­456. Shokouhi, M., Scholer, F., and Zob el, J. (2006a). Sample sizes for query probing in unco op erative distributed information retrieval. In Proc. Asia Pacific Web Conf., pages 63­75, Harbin, China. Shokouhi, M., Zob el, J., Scholer, F., and Tahaghoghi, S. (2006b). Capturing collection size for distributed non-co op erative retrieval. In Proc. ACM SIGIR Conf., pages 316­323, Seattle, Washington. Shokouhi, M., Zob el, J., Tahagoghi, S., and Scholer, F. (2007). Using query logs to establish vo cabularies in distributed information retrieval. Information Processing and Management, 43(1):169­180. Si, L. and Callan, J. (2003a). Relevant do cument distribution estimation metho d for resource selection. In Proc. ACM SIGIR Conf., pages 298­305, Toronto, Canada. Si, L. and Callan, J. (2003b). A semisup ervised learning metho d to merge search engine results. ACM Transactions on Information Systems, 21(4):457­491. Si, L. and Callan, J. (2004). Unified utility maximization framework for resource selection. In Proc. ACM CIKM Conf., pages 32­41, Washington, DC. Si, L. and Callan, J. (2005). Mo deling search engine effectiveness for federated search. In Proc. ACM SIGIR Conf., pages 83­90, Salvador, Brazil. Si, L., Jin, R., Callan, J., and Ogilvie, P. (2002). A language mo deling framework for resource selection and results merging. In Proc. ACM CIKM Conf., pages 391­397, McLean, Virginia. Yuwono, B. and Lee, D. L. (1997). Server ranking for distributed text retrieval systems on the internet. In Proc. Int. Conf. on Database Systems for Advanced Applications (DASFAA), pages 41­50, Melb ourne, Australia. 502