Capturing Collection Size for Distributed Non-Cooperative Retrieval
Milad Shokouhi Justin Zobel Falk Scholer S.M.M. Tahaghoghi
School of Computer Science and Information Technology, RMIT University, Melbourne, Australia

ABSTRACT
Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-coop erative environments, where detailed collection statistics are not available, the size of the underlying collections must b e estimated. While several approaches for the estimation of collection size have b een prop osed, their accuracy has not b een thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal p opulations, we prop ose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to p erform the estimation.

Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Selection Process; H.3.4 [Systems and software]: Distributed Systems; H.3.7 [Digital Libraries]: Collection

General Terms
Algorithms

Keywords
Capture-History, Sample-Resample, Capture-Recapture, Collection Size Estimation

1.

INTRODUCTION

Large numb ers of indep endent information collections do not p ermit their content to b e indexed by third parties, and enforce search through their own search interfaces. Distributed Information Retrieval (DIR) systems aim to provide a unified search interface for multiple indep endent collections. In such systems, a central broker receives user queries and sends these in parallel to the underlying collections, and collates the received answers into a single list for

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

presentation to the user. The broker selects the collections to use for each query on the basis of information that it has previously accumulated on each collection. This information -- usually referred to as col lection summaries [Ip eirotis and Gravano, 2004] or representation sets [Si and Callan, 2003a] -- typically includes global statistics ab out the vocabularies of each collection. In coop erative DIR environments, collection summaries are published, which brokers use for collection selection. In practice, however, many collections are non-cooperative, and do not publish information that can b e used for collection selection. For such non-coop erative collections, brokers must gather information for themselves, generally by sending prob e queries to each collection and analyzing the answers, a process known as query-based sampling (QBS) [Callan and Connell, 2001]. The size of a collection is a key indicator used in the most effective of the collection selection algorithms, such as KL-Divergence [Si et al., 2002] and ReDDE [Si and Callan, 2003b]. In a non-coop erative environment, a broker must estimate collection size; QBS mechanisms such as the sampleresample method [Si and Callan, 2003b] have b een prop osed as suitable for this purp ose, but their accuracy has not b een fully investigated. However, if the estimation is inaccurate, the effectiveness of these methods is significantly undermined. In this pap er, we describ e exp eriments on collections of real data that show that sample-resample is in fact unsuitable for appraising collection size in non-coop erative environments, producing widely varying estimates ranging from 50% to 800% of the actual value. To fulfill the need for accurate estimates, we present two new methods based on the capture-recapture approach [Sutherland, 1996], a simple statistical technique that uses the observed probability of two samples containing the same document. Our refined methods, which we call multiple capture-recapture and capturehistory, make rich use of sampling patterns. Our exp eriments show that they provide a closer estimate of collection size than previous methods, and require less information than the sample-resample technique. The remainder of this pap er is organised as follows. In Section 2, we review related work. In Section 3 we introduce our approach and provide simple examples. Exp erimental issues and biases that occurred in implementation are discussed in Section 4. Section 5 contains the exp erimental results of different approaches; we conclude in Section 6.

316


2.

RELATED WORK

In coop erative environments, brokers have comprehensive information ab out each collection, including its size [Powell and French, 2003]. However, there is limited previous work on collection size estimation in non-coop erative environments, where brokers do not have access to information on the collections. Using estimation as a way to identify a collection's size was initially suggested by Liu et al. [2002], who introduce the capture-recapture method. This approach is based on the numb er of overlapping documents in two random samples taken from a collection: assuming that the actual size of collection is N , if we sample a random documents from the collection and then sample (after replacing these documents) b documents, the size of collection can b e ^ estimated as N = ab , where c is the numb er of documents c common to b oth samples. However, Liu et al. [2002] do not examine how to implement the prop osed approach in practice. This is non-trivial; for example, it is unclear what the sample size should b e, as it is not p ossible to accurately estimate the size of a collection of more than 1 000 000 documents using just two samples of 1 000 documents each. It is also far from obvious how a random sample might b e chosen from a non-coop erative collection. Callan and Connell [2001] prop ose that QBS b e used for this purp ose. Here, an initial query is first selected from a standard list of common terms. For this and each subsequent query, a certain numb er of returned answer documents are downloaded, and the next query is selected at random from the current set of downloaded documents. QBS stops after downloading a predefined numb er of documents, usually 300 [Callan and Connell, 2001]. We have found that QBS rarely produces a random sample from the collection, which undermines the initial hyp othesis of the capture-recapture method. Figure 1 shows the distribution of document ids (docids) in two exp eriments on a subset of the WT10g collection describ ed in Section 4; this collection has 11 341 documents. We extracted one hundred samples from the collection, once by selecting documents at random, and once using QBS. Each sample contained 300 documents. The horizontal axis shows how often the same document was sampled, while the vertical axis shows the numb er of documents sampled a given numb er of times. On the left, 1 005 documents did not app ear in the samples, while, for example, 330 documents were rep eated 6 times. The distribution of documents in our random sampling is similar to the Poisson distribution. In contrast, we observe that, for QBS, many documents app ear in multiple samples; indeed, 3 documents app eared in all the samples, while 5 927 documents (52% of the total) did not app ear in any of the samples. None of the documents in our random samples app eared in more than 11 samples; for ease of comparison, we omit data p oints related to documents that app eared in more than 11 QBS samples. Clearly, the samples produced by QBS are far from random. Given the biases inherent in effective search engines -- by design, some documents are preferred over others -- this result is unsurprising. It does however mean that size estimation methods cannot assume that QBS samples are random. Na¨ve capture-recapture would b e drastically inaci curate; the degree of inaccuracy in other methods is explored in our exp eriments. An alternative to the capture-recapture methods is to use the distribution of terms in the sampled documents, as in the sample-resample (SRS) method [Si and Callan, 2003b].

SRS was introduced as a part of the ReDDE collection selection algorithm. ReDDE applies SRS to estimate the relevant document distribution among multiple collections. It selects suitable collections for each query based on a rank calculated as: Rel^ (j ) q DistR^ nk(j ) = P a ^ i Relq (i) (1)

where Relq (i) is the estimated numb er of relevant documents in collection i and is given by: Rel^ (j ) = q X
dCj samp

P (r el|di ) ×

^ N NCj
samp

(2)

Here, P (r el|di ) is the probability of relevance of document i in the collection; NCj samp is the size of the sample obtained ^ by query-based sampling for collection Cj ; and N is the estimated size of that collection calculated by the SRS method. As in the KL-Divergence method -- which we do not discuss here b ecause of our space limitation -- the estimated size of the collection has direct impact on its rank for collection selection, and inaccurate estimations will produce inferior results. ReDDE has proven effective at matching queries to suitable collections without training [Hawking and Thomas, 2005; Si and Callan, 2003a;b]. UMM [Si and Callan, 2004] has b een rep orted to produce slightly higher precision than ReDDE for almost all kinds of queries. However, it needs prior training using a set of queries. Similarly, while methods based on anchor text [Hawking and Thomas, 2005] produce b etter results for homepage and named-page finding tasks, they are not as good for general queries; moreover, they require a huge crawled dataset that needs to b e generated b efore collection selection starts. The p erformance of the KL-Divergence and ReDDE algorithms have b een rep orted to b e similar, with the latter outp erforming the former marginally in most cases [Hawking and Thomas, 2005; Si and Callan, 2003a]. Assuming that QBS [Callan and Connell, 2001] produces good random samples, the distribution of terms in the samples should b e similar to that in the original collection. For example, if the document frequency of a particular term t in a sample of 300 documents is dt , and the document frequency of the term in the collection is Dt , the collection size tD ^ can b e estimated by SRS as N = d300t . This method involves analyzing the terms in the samples and then using these terms as queries to the collection. The approach relies on the assumption that the document frequency of the query terms will b e accurately rep orted by each search engine. Even when collections do provide the document frequency, these statistics are not always reliable [Anagnostop oulos et al., 2005]. In addition, as we have seen, the assumption of randomness in QBS is questionable. Compared to capture-recapture, the additional costs of SRS are significant: the documents must b e fetched and a great many queries must b e issued. In practice, SRS uses the documents that are downloaded as collection representation sets. In the absence of such documents, SRS cannot estimate collection size unless it downloads new samples. For capture-recapture, only the document identifiers or docids are required for estimating the size. Si and Callan [2003b] investigated the accuracy of SRS on relatively small collections extracted from the TREC

317


3000

3000

Number of Documents

2000

Number of Documents
0 2 4 6 8 10

2000

1000

1000

0

0 0 2 4 6 8 10

Frequency of Documents

Frequency of Documents

Figure 1: Distributions of docids in random (left) and query-based (right) samples. newswire data. However, they did not investigate how well SRS estimates the size of larger collections. We test our methods on collections extracted from the TREC WT10g [Bailey et al., 2003] and GOV [Craswell and Hawking, 2002] data sets that are more similar to current Web collections. SRS has b een used in other pap ers for estimating the size of collections. Karnatapu et al. [2004] extend the work of Si and Callan [2003b] to find indep endent terms from the samples, focusing on improving sample quality rather than collection size estimation. The work has the same limitations as that of Si and Callan [2003b]. In other pap ers [Gravano et al., 2003; Ip eirotis and Gravano, 2002; Ip eirotis et al., 2001], researchers have used QBS and SRS for classifying the hidden-web collections. Hawking and Thomas [2005] also used SRS to estimate the size of collections in their pap er. In all these pap ers, the size of collection has b een used as an imp ortant parameter in collection selection. m(x), the exp ected value E (X ) is: E (X ) = X
x 

x m (x )

(3)

where x is a p ossible value for X in the sample space. We first collect a sample, A, with k documents from the collection. We then collect a second sample, B , of the same size. If the samples are random, the likelihood that any document k in B was previously in A is N . The likelihood of observing i duplicate documents b etween two random samples of size k is: k i !,, k N «i ,, «k-i k 1- N

m(i) =

(4)

3.

CAPTURE-RECAPTURE METHODS

The p ossible values for the numb er of duplicate documents are 0, 1, 2, . . . , k, thus: «i ,, « k -i k2 k = 1- N N (5) This method can b e extended to a larger numb er of samples to give multiple capture-recapture (MCR). Using T samples, the total numb er of pairwise duplicate documents D should b e: k i · m(i) = i E (X ) = i i=0 i=0
k X k X

In previous work on using estimated collection size, the accuracy of the estimate has not b een investigated in detail. We prop ose use of empirical evidence -- using collections of various sizes and characteristics -- to evaluate how well approaches estimates the collection size. However, accurate estimation of collection size remains an op en research question. The challenge is to develop methods for estimating collection size that are robust in the presence of bias, and that do not rely on estimates of term frequencies returned by collections. We present and evaluate two alternative approaches for estimating the size of collections. These are inspired by the mark-recapture techniques used in ecology to estimate the p opulation of a particular sp ecies of animal in a region [Sutherland, 1996]. We adapt the mark-recapture approach to collection size estimation, using the numb er of duplicate documents observed within different samples to estimate the size of the collection.

!,,

k N

D=

! T T (T - 1) T (T - 1)k2 E (X ) = E (X ) = 2 2 2N

(6)

This result is similar to the traditional capture-recapture method for two samples of size k. By gathering T random samples from the collection and counting duplicates within each sample pair, the exp ected size of collection is: T (T - 1)k2 ^ N= 2D (7)

3.1 Multiple Capture-Recapture Method
In the standard capture-recapture technique, a given numb er of animals is captured, marked, and released. After a suitable time has elapsed, a second set is captured; by insp ecting the intersection of the two sets, the p opulation size can b e estimated. Assume we have a collection of some unknown size, N . For a numerically-valued discrete random variable X , with sample space  and distribution function

3.2 The Schumacher-Eschmeyer Method
Capture-recapture is one of the oldest methods used in ecology for estimating p opulation size. An alternative, introduced by Schumacher and Eschmeyer [1943], uses T consecutive random samples with replacement, and considers the capture history. We call this the CH method. Here,

318


Sample 1 2 3 4 5

60 00 0

10 10 10 10 10

0 1 1 0 2

0 10 19 28 36

0 1 000 3 610 7 840 12 960

0 10 19 0 72

12 90 00 00 00 0

Ki

Ri

Mi

K i Mi 2

Ri Mi

Estimated Size of collection

Table 1: Using the capture-history method to estimate the size of a col lection with 5 samples of 10 documents each.

Cosine (pivot = 0.3) Cosine (pivot = 0.5) Okapi (k1 = 0.3)

PT 2 i ^ = P=1 Ki Mi N T i=1 Ri Mi

0 10

30 00 0

60

11 0

16 0

21 0

26 0

31 0

36 0

(8)

Number of Samples (Each sample 100 docs)
Figure 3: Effect of search engine type on estimates of col lection size using the CH method. Table 2: Properties of data col lections used for training and testing.
Training Collections WT10g-456 WT10g-4 WT10g-6 GOV-7 WT10g5-125k WT10g5-75k DATELINE 509 Size 817 025 304 035 218 489 133 834 127 375 75 227 30 507 Testing Collections GOV-123456 WT10g-12 WT10g-1 WT10g-3 LATimes GOV-4 WT10g1-55k Size 807 774 589 094 301 681 290 175 138 896 136 176 55 658

where Ki is the total numb er of documents in sample i, Ri is the numb er of documents in the sample i that were already marked, and Mi is the numb er of marked documents gathered so far, prior to the most recent sample. For example, assume that five consecutive random samples, each containing ten documents, are gathered with replacement from a collection of unknown size. The values for parameters in Equation 8 are calculated in each step as shown in Table 1. The estimated size of the collection after 5 observing the fifth sample would b e 253 1 0  251 documents. 10 Figure 2 illustrates the accuracy of these algorithms for estimating the size of a collection of 301 681 documents, using 160 samples of 100 documents. Documents are selected randomly. We observe that b oth methods produce close estimates after ab out 80 samples. In practice, it is not p ossible to select random documents from non-coop erative collections. These are accessible only via queries; we have no access to the index, and it is meaningless to request documents with particular docids. In the following section, we show how can we apply these algorithms to estimation of the size of collections in noncoop erative distributed environments, and prop ose a correction to the CH method to comp ensate for the bias inherent in sampling via QBS.

ously noted by Agichtein et al. [2003], this would require thousands of queries, and is therefore impractical.

4.1 Data Collections
We develop our approach and evaluate comp eting techniques using distinct training and test sets. Each set comprises different collections of varying sizes as detailed in Table 2. The first three training collections each represent a subset of TREC WT10g [Bailey et al., 2003]. GOV7 is a two-gigabyte collection extracted from the TREC crawl of the .gov domain. DATELINE 509 is a subset of TREC newswire data created from Associated Press articles [D'Souza et al., 2004]. Other training collections are different subsets of TREC WT10g, with different sizes. In the test set, GOV-123456 is a 12 GB subset of the TREC .GOV collection, and LATimes consists of news articles extracted from TREC Disk 5 [Voorhees and Harman, 2000]. GOV-4 was also extracted from the TREC .GOV collection. The rest are different subsets of the TREC WT10g collection. The largest collections contain more than eight hundred thousand documents. Considering that the largest crawled server in TREC .GOV2 collection has fewer than 720 000 documents, this upp er limit seems reasonable for our exp eriments.

4.

CAPTURE METHODS FOR TEXT

The CH and MCR methods b oth require random samples of the collection to produce correct results. In practice, however, generating random samples by random queries is subject to biases. For example, some documents are more likely to b e retrieved for a wide range of queries, and some might never app ear in the results [Garcia et al., 2004]. Moreover, long documents are more likely to b e retrieved, and there could b e other biases in the collection ranking functions. We tested the algorithms on collections with different ranking functions and found similar estimations; longer documents are more likely to b e returned, with a similar skew for b oth the Okapi BM25 and cosine [Baeza-Yates and Rib eiroNeto, 1999] ranking functions. Figure 3 illustrates the p erformance of the CH algorithm for estimating the size of the same collection discussed in Figure 2. As can b e seen, the size of the collection is significantly underestimated across a range of ranking parameters. To partially overcome the bias, we could keep only the first document returned for each query; however, as previ-

4.2 Compensating for Selection Bias
We prop ose that the capture methods b e modified to comp ensate for the biases discussed earlier. To calculate the amount of bias, we compare the estimation values and actual collection sizes using the training set. We create random

319

41 0


400

450000 400000 350000 300000 250000

Number of Duplicate Documents

350 300 250 200

Size of collection

200000 150 100 50 0 Duplicate Documents Multiple Capture-Recapture Real Size Capture-History 150000 100000 50000 0

10

20

30

40

50

60

70

80

Number of Samples
Figure 2: Performance of the CH and MCR algorithms (T = 160, k = 100, N = 301 681) when documents are selected at random. samples by passing 5 000 single query terms to each collection and collecting the top N answers for each query. The choice of 5 000 is dictated by the daily limit of the Yahoo search engine develop er kit (http://developer.yahoo.net) and seems a practical numb er of queries for sampling common collections. We selected 10 as a suitable value for N . The query terms themselves should b e chosen with care. Since it is hard to choose sp ecific queries for each collection, query terms should b e general enough to return answers for almost all typ es of collections. The queries should ideally b e indep endent of each other. To make the system more efficient, we should also avoid query terms that are unlikely to return any answers. We exp erimented with selecting query terms at random from the Excite search engine query logs [Jansen et al., 2000], and from the index of 290 175 web pages. We found that using the query log led to p oor results; we conjecture that this is due to the limited breadth of p opular query topics. Therefore, the random terms from query logs are less likely to b e indep endent. In preliminary exp eriments, we found that index terms with low document frequency failed to match any documents in some collections. To avoid this, we eliminate terms occurring in fewer than 20 documents. Our initial results suggested that, in all exp eriments, the CH and MCR algorithms underestimate the actual collection sizes at roughly predictable rates, due to the biases discussed earlier. To achieve a b etter estimation, we approximated the correlation b etween the estimated and actual collection size by regression equations as b elow; we refer to these as MCR-Reg and CH-Reg resp ectively: ^ log (NM C R ) = 0.5911 × log(N ) + 1.5767 ^ log (NC H ) = 0.6429 × log(N ) + 1.4208 R2 = 0.8226 (9) R2 = 0.9428 (10) We used 25 single-word resample queries for SRS to estimate the size of collections: P |S | Di ^ NS RS = P di (11)

^ where N is the estimated size obtained by the methods, and N is the actual collection size. The R2 values indicate how well the regression fits the data.

320

90 10 0 11 0 12 0 13 0 14 0 15 0 16 0

where |S | is the sample size, and Di and di are the document frequencies of the query term i in the collection and the sample resp ectively.

5. COMPARISON AND RESULTS
We prob ed the test collections using varying numb ers of the single-term queries. The answers for each query are a small sample picked from the collection. Using the numb er of duplicate documents within these samples, we estimate the size of each collection. We tested the p erformance of MCR-Reg and CH-Reg algorithms for different numb ers of queries -- from 100 to 5 000 -- on selected collections from the test set. Increasing the numb er of queries generally improves the estimation slightly. Although the p erformance might continue to increase b eyond this range, as discussed previously, 5 000 app ears to b e a practical upp er limit. We omit the results here for brevity. The bars in Figure 4 depict the p erformance of the three algorithms on various collections for 5 000 queries. The CH method produces accurate approximations, while MCR usually underestimates the collection size. The SRS method produces good approximations for some collections, but considerably overestimates the size for other collections. To evaluate the estimation of different methods, we define the estimation error as: Estimated Size - Actual Size × 100 (12) Actual Size A negative estimation error indicates underestimation of the real size. The lower the absolute estimation error, the more accurate the estimation algorithm. Table 3 shows the estimation error of each algorithm for the collections of Figure 4. The CH-Reg method has the lowest error rate; it gives


Collection Size (Number of Documents)

10000000

1000000

Real Size CH-Reg MCR-Reg SRS

100000

10000

W T1 0g -1

W T1 0g -3 LA -T im es

G O V -1 23 45 6 W T1 0g -1 2

Figure 4: Performance of size estimation algorithms.

Table 3: Estimation errors for size estimation methods. Al l values are percentages.
Name GOV-123456 WT10g-12 WT10g-1 WT10g-3 LATimes GOV-4 WT10g1-55k CH-Reg -33.7 9.6 29.2 8.1 77.6 -29.7 -29.9 MCR-Reg -59.9 35.0 -9.7 66.0 52.5 -56.7 -78.8 SRS 542.4 -62.0 -56.3 -50.6 -45.2 885.8 -40.8

Table 4: The total number of docids returned for 5 000 queries in each experiment.
Collection GOV123456 WT10g-12 WT10g-1 WT10g-3 LATimes GOV-4 WT10g1-55k Total Number of do cids 43 491 45 367 43 276 49 900 30 359 36 605 33 756 Unique do cids 36 699 38 805 35 294 38 686 25 968 24 329 17 920

the b est approximation for all collections except WT10g-1 and LATimes. SRS has a small advantage over CH-Reg for only the LATimes collection. The CH-Reg method produces accurate approximations for collections of different sizes. There is no significant difference b etween MCR-Reg and SRS; the former generally produces b etter results and is more robust. Since MCR-Reg produces p oorer results than CH-Reg, we do not discuss it further in this pap er. The SRS method requires lexicon statistics (document frequencies) to estimate the size of a collection. Therefore, while the MCR and CH methods only require the document identifiers (or, in practice, the URL of the answer document), SRS requires that many documents b e downloaded. In collection selection algorithms, these downloaded documents can also b e used as collection representation sets. Table 4 shows the numb er of docids returned from each collection for 5000 queries. Although answering 5 000 queries seems feasible for many current search engines, downloading the answer documents might b e impractical. Thus, we also compare the methods using a smaller numb er of queries (that is, smaller sample size for SRS). A typical QBS sample size has b een rep orted to b e 300 documents [Callan and Connell, 2001]. The same numb er was suggested by Si and Callan [2003b] for SRS. Therefore, we compare the p erformance of the CH method with that of SRS using 300 document samples. On average,

60 queries were required to gather 300 documents by querybased sampling from our test collections (each query returns at most 10 documents)1 . Assuming that the cost of viewing a result page and downloading an answer document are the same, and considering that we used 25 resample queries, this involves 385 interactions: 60 sampling queries, 300 results pages, and 25 resample queries. This is exactly equal to what has b een rep orted by Si and Callan [2003b]. The CH method downloads only the result page for each query. Since indexing costs are negligible for small numb ers of queries and documents, the cost of a typical SRS method for our test collections is at least as high as the cost of a CH method that uses 385 single-term queries. Table 5 compares the p erformance of the alternative schemes. For 385 queries (300 document samples for SRS), the CH method outp erforms SRS in 5 of 7 instances. SRS produces more accurate estimations for the GOV collections. For the smallest collection, they b oth produce similar p oor estimates. Interestingly, for 5 of 7 collections the CH method remains accurate with fewer queries, producing b etter results with 140 queries than the SRS method based on 100 documents. On average, 15 prob e queries were sufficient to extract 100 documents from the test collections. Taking the costs of 25 resampling queries and downloading 100 documents into account, the total cost is equivalent to
1 We used the Lemur toolkit (www.lemurproject.org) for all exp eriments rep orted in this pap er.

321

W T1 0g 155 k

G O V -4


Table 5: Estimates given by CH and SRS methods.
Collection GOV123456 WT10g-12 WT10g-1 WT10g-3 LATimes GOV-4 WT10g1-55k SRS 300 do cuments 430 034 102 527 85 261 89 235 57 005 123 090 18 764 CH-Reg 385 queries 338 845 381 403 215 005 167 757 91 766 62 699 18 934 SRS 100 do cuments 361 505 93 158 63 045 71 847 52 190 74 721 20 404 CH-Reg 140 queries 858 368 146 895 266 686 41 896 122 168 103 717 12 506 Actual Size 807 774 589 094 301 681 290 175 138 896 136 176 55 658

Table 6: The impact of size estimation algorithms on the precision at different recall points for the Non-relevant testbed. TREC topics 51-100 used as queries. One collection is selected per query. P@5 P@10 P@15 P@20 SR S 0.1224 0.1082 0.0980 0.0959 CH 0.1755 0.1551 0.1483 0.1459  CH-Reg 0.1918 0.1653 0.1578 0.1531

6. CONCLUSIONS
We have describ ed new methods for estimating the size of collections in non-coop erative distributed environments. These methods are an application of capture-recapture and capture-history methods to the problem of size estimation; by introducing an explicit (though ad hoc) comp ensation for the consistent biases in sampling from search engines, we have considerably improved the estimates that these methods provide. We have presented evaluation results across several collections; to our knowledge these are the first measurements of such size estimation methods. These exp eriments show that the capture-history method significantly outp erforms the other algorithms in most cases and is robust to variations in collection size. In contrast to the sample-resample methods used in previous work, our prop osed algorithm does not rely on the document frequency of query terms b eing returned by search engines, and does not need to download and index the returned results of prob e queries, significantly reducing resource and bandwidth consumption. We also showed that the capture-history method is less exp ensive and more accurate than sample-resample, producing more accurate estimates while using fewer prob e queries. In addition, we evaluated the impact of using a b etter size estimator metric on collection selection. The results show that the effectiveness of collection selection algorithms can b e improved by use of more accurate estimations of collection size. Several interesting research problems remain. In particular, the choice of ranking measure produces biases that could b e factored into the estimation process. Our exp eriments used the Okapi BM25 and cosine similarity measures; similarity functions used in commercial search tools incorp orate measures such as PageRank [Brin and Page, 1998] and weighting for anchor-text [Craswell et al., 2001], as well as making use of other forms of evidence such as click-through data. Therefore, our techniques are not in their present form suitable for estimating the size of the vast collections held by web search engines. We plan to explore robustness across more diverse data and search engine typ es. The results we have presented in this work indicate that our methods represent progress towards this goal.

a CH method using 140 queries. Generally, the CH method with 140 queries produces more accurate estimates than SRS with 385 queries. The average of absolute estimation errors on the test collections for 385 queries is 57.42% for SRS and 44.85% for the CH method. For 140 queries, these values are 66.14% and 41.28% resp ectively. Moreover, the SRS method is limited to collections that have a search interface providing the document frequency of query terms, and is dep endent on the accuracy of this information. The CH methods do not rely on values provided by the collection search interface, and produce b etter approximations using ab out one-third as many queries.

Impact of size estimation on collection selection. We analysed the impact of size estimation on collection selection algorithms on five well-known DIR testb eds; trec123-100colbysource, relevant, nonrelevant and representative. The details and characteristics of these testb eds are discussed elsewhere [Si and Callan, 2003b; 2004]. We used ReDDE [Si and Callan, 2003b] as the collection selection algorithm in our exp eriments. We only selected one collection for each query. Therefore, the typ e of merging algorithm does not affect the final results. ReDDE applies the estimated collection size values to approximate the numb er of relevant documents in each collection. For our exp eriments, we estimated the size of collections by b oth SRS and CH-Reg methods. We also compared the results with a capture-history method that does not use Eq. (10) for correction (CH). Except for the relevant testb ed, our evaluation results showed that using the CH and CH-Reg methods led to higher precision values than SRS. For brevity, we only show the results for the nonrelevant testb ed, in Table 6. We use the t-test to measure the statistical differences b etween the methods. The numb ers sp ecified by  are significantly b etter than the corresp onding SRS values at the 0.90 confidence intervals. None were significant at the 0.95 level. These results show that using a more accurate size estimation algorithm such as capture-history--even without correction--can improve the effectiveness of collection selection and thus of final ranking.

References
Agichtein, E., Ip eirotis, P. G., and Gravano, L. (2003). Modeling query-based access to text databases. In International Workshop on Web and Databases, pages 87­92, San Diego, California. Anagnostop oulos, A., Broder, A. Z., and Carmel, D. (2005). Sampling search-engine results. In Proceedings of 14th

322


International Conference on the World Wide Web, pages 245­256, Chiba, Japan. Baeza-Yates, R. A. and Rib eiro-Neto, B. (1999). Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Bailey, P., Craswell, N., and Hawking, D. (2003). Engineering a multi-purp ose test collection for Web retrieval exp eriments. Information Processing and Management, 39(6):853­871. Brin, S. and Page, L. (1998). The anatomy of a large-scale hyp ertextual Web search engine. In Proceedings of 7th International Conference on the World Wide Web, pages 107­117, Brisbane, Australia. Callan, J. and Connell, M. (2001). Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97­130. Craswell, N. and Hawking, D. (2002). Overview of the TREC-2002 Web Track. In Proceedings of TREC-2002, Gaithersburg, Maryland. Craswell, N., Hawking, D., and Rob ertson, S. (2001). Effective site finding using link anchor information. In Proceedings of 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 250­257, New Orleans, Louisiana. D'Souza, D., Thom, J., and Zob el, J. (2004). Collection selection for managed distributed document databases. Information Processing and Management, 40(3):527­546. Garcia, S., Williams, H. E., and Cannane, A. (2004). Accessordered indexes. In Proceedings of 27th Australasian Computer Science Conference, pages 7­14, Darlinghurst, Australia. Gravano, L., Ip eirotis, P. G., and Sahami, M. (2003). Qprob er: A system for automatic classification of HiddenWeb databases. ACM Transactions on Information Systems, 21(1):1­41. Hawking, D. and Thomas, P. (2005). Server selection methods in hybrid p ortal search. In Proceedings of 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 75­82, Salvador, Brazil. Ip eirotis, P. G. and Gravano, L. (2002). Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of 28th International Conference on Very Large Data Bases, pages 394­405, Hong Kong, China. Ip eirotis, P. G. and Gravano, L. (2004). When one sample is not enough: improving text database selection using shrinkage. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 767­ 778, Paris, France. Ip eirotis, P. G., Gravano, L., and Sahami, M. (2001). Prob e, count, and classify: categorizing Hidden Web databases. ACM SIGMOD Record, 30(2):67­78. Jansen, B. J., Spink, A., and Saracevic, T. (2000). Real life, real users, and real needs: a study and analysis of user queries on the Web. Information Processing and Management, 36(2):207­227. Karnatapu, S., Ramachandran, K., Wu, Z., Shah, B., Raghavan, V., and Benton, R. (2004). Estimating size of search engines in an uncoop erative environment. In Workshop on Web-based Support Systems, pages 81­87, Beijing, China. Liu, K., Yu, C., and Meng, W. (2002). Discovering the

representative of a search engine. In Proceedings of 11th ACM CIKM International Conference on Information and Know ledge Management, pages 652­654, McLean, Virginia. Powell, A. L. and French, J. (2003). Comparing the p erformance of collection selection algorithms. ACM Transactions on Information Systems, 21(4):412­456. Schumacher, F. X. and Eschmeyer, R. W. (1943). The estimation of fish p opulations in lakes and p onds. Journal of the Tennesse Academy of Science, 18:228­249. Si, L. and Callan, J. (2003a). The effect of database size distribution on resource selection algorithms. In Proeedings of SIGIR 2003 Workshop on Distributed Information Retrieval, pages 31­42, Toronto, Canada. Si, L. and Callan, J. (2003b). Relevant document distribution estimation method for resource selection. In Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 298­305, Toronto, Canada. Si, L. and Callan, J. (2004). Unified utility maximization framework for resource selection. In Proceedings of 13th ACM CIKM Conference on Information and Know ledge Management, pages 32­41, Washington, D.C. Si, L., Jin, R., Callan, J., and Ogilvie, P. (2002). A language modeling framework for resource selection and results merging. In Proceedings of 11th ACM CIKM International Conference on Information and Know ledge Management, pages 391­397, New York, NY. Sutherland, W. J. (1996). Ecological Census Techniques. Cambridge University Press. Voorhees, E. M. and Harman, D. (2000). Overview of the sixth Text REtrieval Conference (TREC-6). Information Processing and Management, 36(1):3­35.

323