Dynamic Test Collections: Measuring Search Effectiveness on the Live Web Ian Soboroff National Institute of Standards and Technology Gaithersburg, Maryland ian.soboroff@nist.gov ABSTRACT Existing methods for measuring the quality of search algorithms use a static collection of documents. A set of queries and a mapping from the queries to the relevant documents allow the exp erimenter to see how well different search engines or engine configurations retrieve the correct answers. This methodology assumes that the document set and thus the set of relevant documents are unchanging. In this pap er, we abandon the static collection requirement. We b egin with a recent TREC collection created from a web crawl and analyze how the documents in that collection have changed over time. We determine how decay of the document collection affects TREC systems, and present the results of an exp eriment using the decayed collection to measure a live web search system. We employ novel measures of search effectiveness that are robust despite incomplete relevance information. Lastly, we prop ose a methodology of "collection maintenance" which supp orts measuring search p erformance b oth for a single system and b etween systems run at different p oints in time. Categories and Sub ject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software--Performance evaluation, H.3.5 Online Information Services General Terms: Exp erimentation, Measurement Keywords: retrieval test collections and documents are retrieved from the collection using two (or more) search algorithms. The results of each search are examined to see which documents are relevant and which are not. If significant and noticeable differences in effectiveness are observed, and the differences are consistent across multiple test collections, we can conclude that one search algorithm is b etter than another. The Cranfield methodology is so named after the first formalized measurements of search systems conducted by Cleverdon at the College of Aeronautics at Cranfield [12]. It was subsequently refined by many, most notably by Sparck Jones and van Rijsb ergen in 1975, and in recent years in the scop e of the Text REtrieval Conferences (TREC) [21]. In TREC parlance, the information needs are called "topics", and the mapping of topics to relevant documents is called the "relevance judgments" or "qrels". A "run" is the set of documents retrieved by some search algorithm for each of the topics in the test collection. The Cranfield paradigm makes several assumptions in order to simplify and op erationalize the measurement of search effectiveness. The assumption that we are chiefly concerned with in this pap er is that the document collection is static with resp ect to the runs b eing measured. A second assumption that we address is that the relevance judgments are complete ­ all documents are judged with resp ect to all topics. The TREC p ooling process has shown that judgments need not b e complete in order to accurately measure the relative p erformance of two or more systems [22, 19]. Since we are interested in the problem of collection decay, where our document collection and relevance judgments evolve out from under us, we will focus on measures which do not rely on the completeness assumption at all, such as bpref [7]. The Cranfield paradigm further assumes that the information needs (and thus the relevance judgments) are also static, so that for example if one wishes to measure the quality of a "find more like this" facility, it is assumed that the initial set of search results do not change the user's definition of what is relevant. Other assumptions include the notion that the search process can b e abstracted away from such vital system details as how queries are created and how results are presented to or indeed used by the end user. In this present work we retain these assumptions as part of the exp erimental design. The phenomenon of change and decay on the web has b een well studied. Cho and Garcia-Molina tracked 720,000 web pages daily over the course of four months in order to sp ecify design choices for an incremental crawler [9]. Fetterly et al. expanded on that study, tracking more than 150 1. INTRODUCTION Search is the most p opular Internet application after email. The proliferation of information available on the web makes search a critical application. The emergence of the web as the world's dominant information environment has created a surge of interest in search, and consequently imp ortant advances in search technology. However, it is difficult to measure the effectiveness of web search algorithms b ecause our current methodologies assume that the document collection does not change. The dominant evaluation procedure is known as the Cranfield or test collection methodology. A test collection consists of a set of documents, a set of information need descriptions (p ossibly including actual queries), and a mapping of needs to the documents that are relevant to them. In resp onse to each information need, a query is formulated This paper is authored by an employee(s) of the United States Government and is in the public domain. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 276 million web pages weekly over 11 weeks and also looking at content changes within pages [14]. Ntoulas et al. crawled 154 different complete sites weekly, and examined change in linkage, changes in page content, and new pages b eing created [15]. With resp ect to information disapp earing on the web, Bar-Yossef et al. looked closely at soft- and hard404 errors, and prop osed models of web decay based on a Markov chain model of dead link propagation inspired by PageRank [3]. These studies all examine general web crawls in order to understand change on the web as a whole. Brewington and Cyb enko looked at the change rates of pages sp ecifically requested by users of a web clipping service [5]. This last work is the closest to what we have done here, except that we are concerned with pages that are relevant to particular search topics. We are also particularly concerned with change sp ecifically as it affects measurements of search quality. 2. PROBLEM DEFINITION There are many challenges to measuring search effectiveness on the web using the test collection methodology. We b elieve the key challenge is the requirement of a static document collection. Holding the document collection fixed allows for straightforward reproducibility of results, but is a limiting requirement when we wish to measure search on the web. We prop ose allowing the document collection to change, while keeping the topics and the relevance judgments initially fixed. Sp ecifically, consider that we have a set of topics and relevance judgments that were constructed for a collection of web pages, and we wish to measure a set of live web search algorithms using them. We would rather avoid making any new relevance judgments if at all p ossible. If the documents are allowed to change as they do on the live web, we must account for several p ossible cases: judged documents will change over time, new pages will app ear that are not in the collection, and the runs b eing measured may b e collected at different p oints in time. First, documents in the collection which have already b een judged are likely to have changed, or may no longer exist at all. A relevant document which no longer exists on the web is certainly no longer relevant. If it still exists but its content has changed, we might compare the new document's similarity to the judged document using standard IR similarity measures or a nearduplicate-document measure [6, 10, 4]. In this pap er, we choose a simpler strategy which highlights the limits of our approach, and assume that a changed document is no longer relevant until we devote such resources to judge it anew. For documents judged not relevant, we assume that the page remains irrelevant to the topic even if it changes. We call a document valid if it has not changed since it was initially judged, or if we have re-examined the document and applied a new relevance judgment. A valid topic is one that has valid documents. Needless to say, new web pages have come into existence since the initial relevance judgments were compiled, and some of these may b e relevant. This can b e more or less of a problem for evaluation dep ending on the timeliness of information desired by the searcher. Rather than make any guess ab out the relevance of new unjudged documents, we monitor how they are retrieved and how they might affect our determination of the relative effectiveness of the search engines b eing measured. Lastly, it may b e the case that the runs which we want to compare may have b een b een executed at different times or on different web crawls. This is likely to b e the case when we want to compare live web search engines, but consider also that we may wish to examine several parameter settings of an engine or group of engines on a single large web snapshot, using existing relevance judgments which predate the snapshot. In this pap er, we measure a single search engine and are careful to collect our runs within a short p eriod of time, but in general one can use our methodology to compare runs done at different times or compare multiple engines. In these cases, to maximize fairness the set of relevance judgments should b e constrained to the intersection of valid documents, so that runs are compared over documents which they all have equal opp ortunity to rank. In the next section, we examine the decay of relevance information in an existing test collection. We then illustrate the affect of collection decay over time on our retrieval effectiveness measures using a set of TREC runs. We also present a small exp eriment measuring retrieval runs from a live web search engine using the decayed relevance judgments. The exp eriment motivates a maintenance regimen for test collections in order to measure search in dynamic collections. 3. DATA In this pap er we use the GOV2 collection from the TREC terabyte track [11]. The goal of the terabyte track is to scale information retrieval exp eriments b eyond the gigabyte range it typically works in today, and to study how that scaling affects the exp erimental methodology. The GOV2 collection is a fairly exhaustive crawl of US federal and state government web pages collected in the winter of 2003-4, and contains ab out 25 million web pages or ab out 468GB of text. (There is an associated 800GB of image and binary data which are not typically searched in retrieval exp eriments, but which are used when making relevance judgments.) According to Cho and Garcia-Molina [9], .gov pages tend to b e much more static than those in other domains. Fetterly et al. [14] confirmed this and also found that whereas generally longer pages change more often, this is not the case in .gov. Thus, rates of change that we observe here are likely to b e slower than on the web in general. In TREC 2004 and 2005, two sets of fifty topics were created for the GOV2 collection. These topics are general informational searches, such as might b e done by someone compiling a research rep ort. They consist of a short title (often used as a query), a sentence-length description, and a narrative paragraph which defines what the user exp ects the search system to return. There are 99 topics numb ered 701800; topic 703 was dropp ed from the 2004 evaluation b ecause no relevant documents were found for it. In those TREC cycles, research teams submitted dozens of runs consisting of the top 10,000 ranked results for each topic according to their search engines. The top hits from two runs from each group were collected into a p ool for each topic and judged by the NIST assessor that created the topic. This process yielded 103,368 relevance judgments for these 99 topics. We use this combined topic set in order to maximize the numb er of usable topics after time is taken into account. To gather the history of each judged page since the crawl was done, we consulted the Internet Archive.1 Using their 1 http://www.archive.org/ 277 Wayback Machine service, we downloaded page revisions since February 15th, 2004, the end date of the GOV2 crawl. Only 56,693 of the judged pages were present in the Wayback archives; we will presume for lack of b etter information that the others disapp eared immediately after the GOV2 crawl. We obtained a total of 199,137 page versions, an average of 3.5 revisions p er page. Of these, 15,676 page versions were rep orted as present in the archives, but were not available due to system downtime. Since in this study we work with the timestamps alone, we did not worry ab out the content of these missing versions. Figure 1 illustrates the "lifetimes" of topics 701-750 (the TREC 2004 topics), when we assume that a document b ecomes irrelevant the first time it changes. Each topic's line shows the numb er of unchanged relevant documents remaining each day. The longest line (topic 739) extends for 369 days. That is in fact the longest lifetime of all 99 topics and represents the extent of historical information available from the Internet Archive at the time of writing. Coverage is more complete for the first 280-300 days. Some topics are more volatile, with their relevant documents disapp earing quickly, while others exhibit a more gradual drop-off. At the end of the change history, there are on average 38 relevant documents remaining p er topic. The distribution of times b etween changes for b oth relevant and irrelevant pages is shown in Figure 2. The longest gap b etween changes that we observed was 387 days. 4.5% of gaps represent same-day changes; another 9.4% are 1-day gaps. The average gap b etween changes is 62.23 days and the median is 49. This supp orts previous findings that .gov pages change slowly. There are also p eaks around 60 and 120 days which are due to default page revisit p olicies in the Internet Archive crawls. According to the first-change heuristic, the numb er of relevant documents decreases b elow 50% of the original after 156 days for the average topic. 120 172 49 320 95 701 0 719 0 736 75 0 148 376 160 702 45 0 720 118 0 737 52 0 171 246 65 82 704 42 0 721 0 738 0 13 91 107 240 42 705 25 0 722 0 739 59 0 128 90 43 706 36 0 723 26 0 740 0 15 36 26 66 707 0 5 724 6 0 741 18 0 80 108 50 44 708 0 725 34 0 742 12 0 44 191 17 33 709 0 726 0 31 743 0 11 135 55 56 710 41 0 727 18 0 744 0 9 42 344 15 33 711 0 728 110 0 745 0 6 296 3 108 712 0 49 729 0 0 746 0 38 200 11 68 121 713 0 730 0 0 747 36 0 4. MEASURES FOR DECAYED COLLECTIONS 35 131 61 714 9 0 731 38 0 748 0 27 The question of measures is critical when working with dynamic collections. In a sense, the relevance judgments are always incomplete, even less so than in a static test collection. Traditional retrieval metrics such as mean average precision (MAP), precision at the top 10 documents retrieved (P@10), and mean reciprocal rank (MRR) of the first relevant document dep end completely on the ranks of the relevant documents which have b een retrieved by the system; unjudged retrieved documents are considered to b e irrelevant. In our situation here, where so many of the documents are unjudged due to either b eing outside the collection or having changed since they were judged, such a measure would mostly indicate the sparsity of our relevance data rather than any comparative measure of the runs. Instead, we use a relatively new measure, bpref, to compare the runs, and consider carefully which documents we should try to judge in order to improve the picture. The bpref measure, prop osed by Buckley and Voorhees in 2004 [7], computes a preference relation of whether judged relevant documents are retrieved ahead of judged irrelevant documents. Thus, it is based on the relative ranks of judged 45 296 47 141 715 11 0 732 0 749 0 9 54 35 12 12 716 0 12 733 0 750 3 0 268 215 717 88 0 734 70 0 306 113 121 718 0 735 37 0 Figure 1: Timelines of the number of valid relevant documents for topics 701-750 of the TREC 2004 Terabyte test collection. 278 0 10 30 0 10 30 0 10 30 0.4 0.3 0.2 0.1 0.0 40000 Changes occurring after x days 30000 0.4 0.3 0.2 0.1 0.0 20000 10000 0.4 0.3 0.2 0.1 0.0 0.4 0.3 0.2 0.1 0.0 score 0 100 200 Days between changes 300 400 0.4 0.3 0.2 0.1 0.0 0.4 0.3 0.2 0.1 0.0 0 Figure 2: changes. Histogram of time gaps between page documents only. The bpref measure is defined as2 bpref = 1X |n ranked higher than r | ) (1 - Rr min(R, N ) 0.4 0.3 0.2 0.1 0.0 0.4 0.3 0.2 0.1 0.0 where R is the numb er of judged relevant documents, N is the numb er of judged irrelevant documents, r is a relevant retrieved document, and n is a memb er of the first R irrelevant retrieved documents. Bpref can b e thought of as the inverse of the fraction of judged irrelevant documents that are retrieved b efore relevant ones. Bpref and mean average precision are very highly correlated when used with complete judgments. But as judgments are degraded (in Buckley and Voorhees' study, by taking random samples of the judgments and the collection), rankings of systems by bpref still correlate highly to the original ranking, whereas rankings of systems by MAP do not [7]. To b etter understand how bpref and MAP b ehave as the collection decays, we examined TREC runs from the 2004 terabyte track at one-week intervals through the collection change history. This is a different approach than Buckley and Voorhees took, in that we are observing the real-world "downsampling" of the collection over time. Furthermore, we are able to compare bpref and MAP in the TREC terabyte collections which were not available to them. Figure 3 shows the MAP and bpref scores for each of the 70 runs from TREC 2004 when measured against the qrels that remain valid each week. The absolute value of the score is not imp ortant, but that shap e of each curve is. As the collection decays, MAP decreases. Bpref fluctuates somewhat, and actually increases as we lose more and more relevance information. This much is consistent with the findings of Buckley and Voorhees. At the end of the graphs, bpref drops This definition of bpref corrects a bug in [7] and follows the actual implementation in trec_eval version 8.0; see the file bpref_bug in the trec_eval distribution for details. 2 0.4 0.3 0.2 0.1 0.0 0 10 30 0 10 30 0 10 30 0 10 30 0.4 0.3 0.2 0.1 0.0 week bpref map Figure 3: MAP and bpref scores for the TREC 2004 runs, scored according to the qrels remaining after each week. Each subgraph shows a single run. Past week 41, insufficient judged documents were retrieved to use even bpref. sharply b ecause we have only 103 judged documents total remaining to b e retrieved, out of 28,102 in week 1. This is a smaller p ercentage than Buckley and Voorhees examined. Even though bpref does show some fluctuations as relevance information decays, the relative ordering of systems according to bpref remains fairly close to the order in week 1. Figure 4 shows the correlation of weekly system rankings to the original ranking using Kendall's tau. Whereas the correlation using MAP falls b elow 0.9 at 18 weeks, the lowest correlation for bpref during the entire p eriod is 0.91 at week 41. Thus, when the systems are compared using the bpref measure, we arrive at a consistent ordering despite severe decay in relevance data. Note that using TREC runs to illustrate the effect of collection decay is anachronistic b ecause the runs were p erformed on the GOV2 collection as it was initially compiled, and the decay we observe happ ens after this p oint. In an op erational setting, the runs always come from a document 279 R un short-q 1.0 +++++++++ ++++++ long-q ++++++++ ++++++++++++++ + + + Kendall tau correlation to week 1 ranking Retr. In GOV2 9375 2616 (28%) in decayed qrels: 9080 2318 (26%) in decayed qrels: Judged 1800 (19%) 1019 (11%) 951 (10%) 584 (6%) 888 107 425 58 Rel (9%) (1%) (5%) (0.6%) 0.9 Table 1: Number of retrieved documents present in the collection, judged, and relevant for the two runs. 0.8 0.6 We also observed a somewhat amusing phenomenon, which is that if you search for TREC topics after the TREC conference cycle is completed, you tend to find the TREC topic file high up in the ranking. Fortunately, these pages are not in the collection and are thus ignored. + o 0 10 20 Week 30 bpref MAP 40 0.7 5.1 Determining valid topics We first look at how many documents returned by the search engine occur in the collection, how many of those were judged, and how many were judged relevant. Table 1 shows that less than a third of retrieved documents are in the collection, one-half to two-thirds of these were judged, and one-third to one-half of these last are relevant. The long-query run finds somewhat fewer GOV2 documents and only half as many relevant documents as the short-query run, a clue that our hyp othesis may turn out to b e correct. These searches were conducted after the ep och of our historical data on the relevance judgments, but for simplicity we will assume that the revision data we have is current. Recall that we assume p essimistically that the first change to a relevant page reverts the page to an unjudged status. We generate the set of relevance judgments that corresp onds to pages that have remained unchanged. The lines lab eled "decayed qrels" in Table 1 indicate how many judged and relevant documents were retrieved. The short-query run retrieved no judged documents for topics 717, 770, 779, 793, and 796. The long-query run retrieved no judged documents for 7023 , 705, 739, 750, 758, 763, 770, 779, 780, 787, and 800. We are left with 85 valid topics to compare the two runs, a good-sized topic set. Figure 4: Kendall's tau correlation of weekly system rankings to the week 1 ranking. bpref agrees more closely than MAP to the original ranking as the collection degrades. collection that is more recent than the relevance judgments. In the next section we conduct just such an exp eriment. 5. We ran a small exp eriment to test the methodology for measuring "live web" search using a p opular web search engine. This exp eriment tests if shorter or longer queries give b etter p erformance for the engine in question. We hyp othesize that shorter queries will b e more effective since most search engines combine terms in a noisy-AND formula. As stated ab ove, the 99 TREC terabyte topics include a short title field and a sentence-length description field. For the short queries, we used the title field. For the long queries, we added description field words which were not present in the title. Stop words were removed from b oth long and short queries. In some cases, the description text includes phrases in quotation marks; we retained these quoted phrases since the search engine allows this as an op erator but removed most other punctuation. Long queries were limited to nine terms (counting quoted phrases as a single term). All queries included a search restriction requiring that hits come from .gov sites. For each query, we attempted to retrieve the top 100 documents. This was a compromise b etween the limitations of the search engine API and the need for our rankings to go deep enough to find judged documents. For some queries the search engine returned less than 100 documents. For each search result, we checked to see if the URL corresp onded to a document in the GOV2 collection, since these are the only documents for which we have judgments. If we were trying to measure multiple search engines or a single search engine over time, we would need to restrict ourselves to the intersection of retrieved documents b etween the engines in order to ensure a fair collection. 0.5 MEASURING SEARCH ENGINES 5.2 Per-topic results and analysis Over the 85 topics, the short-query run has an average bpref of 0.0304, and the long-query run scores 0.0161, supp orting our hyp othesis that short queries p erform b etter than long queries for this search engine. However, despite having 85 valid topics, we still have very little data with which to measure these two runs. The shortquery run has on average only 11 judged and 1.2 relevant documents p er topic; the long-query run, 6.7 judged and 0.7 relevant. Furthermore, the short-query run finds no known relevant documents for 40 of these topics, and the long-query run finds no relevant for 53. Even though bpref is designed for our degraded collection scenario, by using topics with only one or two judged documents we are forcing bpref to its corner case, and thus we should read it with care. In Table 2, we focus on 27 topics for which b oth runs found at least one relevant document. The average within this subset still supp orts the hyp othesis, but if we look closer at the p er-topic results we should In fact, the search engine returned only one document for the long query for topic 702, and this hit was the TREC topic file. 3 280 701 708 710 712 713 719 720 721 722 725 731 732 736 741 746 752 761 766 767 771 776 777 782 791 797 798 799 Avg #rel 49 50 41 49 68 95 118 82 42 34 38 141 75 18 38 75 41 22 102 46 23 39 33 10 37 8 33 short-q rel.ret bpref 1 0.0171 5 0.0936 1 0.0190 1 0.0196 1 0.0147 5 0.0465 2 0.0155 1 0.0115 1 0.0232 8 0.1696 2 0.0492 1 0.0067 3 0.0389 2 0.0988 1 0.0263 5 0.0645 6 0.1374 3 0.1364 3 0.0291 2 0.0359 1 0.0302 2 0.0388 4 0.1010 3 0.2300 1 0.0219 1 0.0000 3 0.0854 2.6 0.0578 long-q rel.ret bpref 1 0.0196 6 0.1164 1 0.0184 2 0.0400 2 0.0277 3 0.0316 1 0.0084 1 0.0115 1 0.0238 1 0.0216 2 0.0492 1 0.0070 3 0.0395 2 0.0988 3 0.0789 2 0.0219 1 0.0238 1 0.0413 2 0.0193 2 0.0388 1 0.0397 2 0.0440 3 0.0735 1 0.1000 1 0.0263 2 0.0312 2 0.0588 1.9 0.0411 Table 2: Per-topic measures for each run. "#rel" is the number of relevant documents in the decayed qrels. "rel.ret" is the number of relevant retrieved for that topic. b e cautious in accepting that conclusion. The numb er of retrieved documents judged to b e relevant is b etween two and three on average, and so the bpref value is based on very few pairs of relevant and irrelevant documents. While the short-query run does find more relevant documents on average, and has most of the highest bpref scores p er topic, the long-query run actually b eats the short query run on 13 of the topics. The short-query run wins for 11 topics, but with a higher bpref difference. Although the sample is small, we observe that a one-sided Wilcoxon test is not significant (p = 0.22), but a one-sided paired t-test is (p = 0.04). We conclude from this exp eriment, somewhat cautiously, that short queries do work b etter than long queries for this search engine; short queries tended to return more relevant documents, but it's hard to measure the quality of the ranking from so few documents. The situation would improve if there were 3 to 5 more judged relevant documents for each run in each topic, as we show in the next section. 6. MAINTAINING TEST COLLECTIONS In their recent pap er on building test collections incrementally, Carterette and Allan prop ose a method for choosing which documents to judge by giving priority to documents whose relevance will exp ose the greatest difference among the systems. Using the MAP measure, they compute the p otential difference in MAP if an unjudged document were to b ecome relevant. To avoid bias, the process continues until adding more relevant documents stops affecting the relative p erformance of the systems [8]. Fundamentally, this approach builds on the results of Zob el [22], who susp ected that the TREC relevance judgments were incomplete, but found that discovering more relevant documents did not affect the relative p erformance of systems very much. Carterette and Allan's approach takes a set of judgments which are too coarse to compare systems with, and moves them to the p oint observed by Zob el in an optimal way. We do the same thing here with two changes. First, we would rather work with bpref than MAP, since bpref requires many fewer relevance judgments to achieve stability. Choosing bpref also means that we don't need to optimize the document selection process as strongly as we would for MAP, since bpref only considers relative ranks of judged documents. Second, we identify three typ es of unjudged documents. Because we are maintaining an existing collection rather than building a new one, we have documents which are in the original collection and were unjudged. Since our starting p oint here is a TREC collection built by p ooling a large numb er of system outputs, we should probably opt not to examine these documents. We also have unjudged documents that lie outside the original collection and should b e candidates for examination. When selecting out-of-collection documents to judge, it is imp ortant to avoid bias in favor of one run or another. The TREC p ooling process, while not efficient, places a high priority on avoiding bias towards particular systems. We can maximize impact on the measure and avoid bias by selecting documents retrieved highly by b oth runs which have a high coefficient of variance in the rank retrieved. Lastly, we have previously judged documents which have changed. We have chosen to invalidate their relevance judgments, but in particular previously-relevant documents would b e a good place to start recovering relevance information. This would also follow if we had chosen to invalidate relevance judgments according to document similarity measures. In the case of previously-judged documents, we can choose documents to judge in an unbiased way by selecting them in order of most-recent change. In Figure 5, we simulate the re-judging process assuming that documents regain their old relevance value, and show the effect on bpref scores. Each p oint on the x-axis represents recovering one document for each run with the next latest change timestamp, assigning them their original relevance value, and recomputing the bpref score. We can see that for most topics, our conclusion is unchanged: either the runs are indistinguishable in effectiveness, or their initial effectiveness ranking is preserved. For some inconclusive topics we gain enough information to distinguish them after re-judging very few "expired" documents. It is also clear that we should first focus on those topics with the fewest retrieved relevant documents. Once these topics have b een stabilized, maintenance effort should b e directed at reviving the 14 topics we were forced to discard b ecause they retrieved no judged documents. Priority should b e given to topics with more than 20 retrieved documents which are unjudged due to page revisions, since in the 27 topic subset we see 5-20 judged irrelevant documents retrieved for every judged relevant one. Further, we should choose topics where a large numb er of those revisions 281 701 708 710 712 713 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 719 720 721 722 725 731 732 736 741 746 bpref 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 752 761 766 767 771 776 777 782 791 797 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 798 799 documents added long-q short-q Figure 5: Change in bpref in the 27-topic subset as changed documents are re-judged according to their original relevance value. affected relevant documents, to ensure that we will recover relevant documents without needing to examine too many irrelevant ones. 7. CONCLUSION We have shown that static test collections can b e used to measure search in a changing document collection such as the live web by tracking changes in judged documents, applying heuristics to determine the decay of relevance information, and carefully re-examining the "old" relevant documents as well as the unjudged documents retrieved in each exp eriment. Test collections with large sets of relevance judgments remain usable for a long time; the documents we use here were nearly two years old when these exp eriments were run. As judged documents change, measures such as bpref which work with incomplete information can b e used with little or no additional relevance assessment. We prop ose the following approach for maintaining a test collection of topics and relevance judgments atop a changing web. First, one must consider how the initial test collection was created. The collection we address in this pap er was built as part of a collab orative process which typically involves ten or twenty research teams using different systems with various tuned parameter settings and which may also include manually collected search results in addition to automatic system rankings. Equivalent approaches include p ooling the outputs of a smaller but highly diverse range of retrieval methods [18], or iterative search-and-judge procedures [13]. Test collections built using these methods avoid bias towards any particular search strategy by looking broadly and deeply into the collection for relevant documents. Alternatively, the test collection might come out of an industry search environment, for example a search engine company, or an organization attempting to tune an intranet search engine. In this case there may b e only one or two search algorithms contributing documents to judge, and one should b e concerned ab out bias. If the relevance judgments are derived from a single retrieval strategy, then a new approach will retrieve many unjudged documents. The procedure of Carterette and Allan [8] may b e useful here. In either case, as the document collection changes over time, there are several indicators to watch: The rate at which judged documents change as newer web crawls are done, combined with a heuristic for deciding when a page's relevance judgment no longer applies. Our heuristic is based on the rate of change; alternatively it might b e based on a similarity or fingerprinting comparison. The rate of change should b e observed p er topic, rather than p er-document. The number of topics with only one or two retrieved documents that have valid relevance judgments. These topics will need some maintenance in the form of additional judgments in order to b e useful. Alternatively, we may decide that topics which fail to retrieve any judged documents can b e retired or redone from scratch. The number of retrieved documents whose relevance values expired due to changes in the page. If we are still retrieving these pages they are good candidates for re-judging, particularly if they used to b e relevant. The number of retrieved documents which lie outside the collection used to create the initial relevance judgments. These documents are unjudged and would not have b een judged initially, and this numb er will only grow over time. If the researcher has several different retrieval algorithms at hand, these can b e p ooled and judged using the processes describ ed in any of the ab ove mentioned pap ers. The total number of valid topics. If fewer than 30 topics are usable due to relevance decay, then unusable topics should b e patched. One can look to guidelines for topicset size such as [20, 17], but keep in mind that the exp erimental conditions may necessitate more topics. More topics are always b etter. Sanderson and Zob el suggest that having many topics judged less is b etter than having fewer topics judged more completely [16]. By following these indicators through frequent, rep eated exp eriments, a test collection may b e maintained over the live web or other dynamic collection and its usable lifetime extended considerably. As topics decay, one can re-examine past documents, sp end resources to judge new documents, or retire topics in favor of developing new ones. Maintenance is cheap er in terms of topic development and relevance assessment time, and p ermits the comparison of runs from different versions of the collection. 282 8. FUTURE WORK [11] Charles L. A. Clarke, Ian Sob oroff, and Nick Craswell. Overview of the TREC 2004 terabyte track. In E. M. Voorhees and L. P. Buckland, editors, Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004), Gaithersburg, MD, Novemb er 2004. [12] C. W. Cleverdon. The cranfield tests on index langauge devices. In Aslib Proceedings, volume 19, pages 173­192, 1967. [13] Gordon V. Cormack, Christopher R. Palmer, and Charles L. A. Clarke. Efficient construction of large test collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98) [1], pages 282­289. [14] Dennis Fetterly, Mark Manasse, Marc Na jork, and Janet L. Wiener. A large-scale study of the evolution of web pages. Software Practice and Experience, 34:213­237, 2004. [15] Alexandros Ntoulas, Junghoo Cho, and Christopher Olston. What's new on the web? the evolution of the web from a search engine p ersp ective. In Proceedings of the Thirteenth International World Wide Web Conference (WWW 2004), pages 1­12, New York, May 2004. [16] Mark Sanderson and Justin Zob el. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), pages 162­169, Salvador, Brazil, August 2005. ACM Press. [17] Ian Sob oroff. On evaluating web search with very few relevant documents. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004) [2], pages 530­531. [18] Ian Sob oroff and Stephen Rob ertson. Building a filtering test collection for TREC 2002. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003), pages 243­250, Toronto, Canada, July 2003. ACM Press. [19] Ellen M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98) [1]. [20] Ellen M. Voorhees and Chris Buckley. The effect of topic set size on retrieval exp eriment error. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), pages 316­323, Tamp ere, Finland, August 2002. ACM Press. [21] Ellen M. Voorhees and Donna K. Harman, editors. TREC: Experiments in Information Retrieval Evaluation. MIT Press, 2005. [22] Justin Zob el. How reliable are the results of large-scale retrieval exp eriments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98) [1], pages 307­314. The results presented here are somewhat preliminary, and there are a numb er of improvements and future directions we would like to explore. We plan to conduct a fuller examination of the bpref measure, and system rankings in general in dynamic collections. We would also like to study explicit measures for detecting bias in selecting documents for relevance assessment. The notion of incremental and maintained test collections makes such measures critically imp ortant. Lastly, we have not fully considered the ramifications of comparing runs done at different p oints in time on different versions of the collection. When a difference is discovered, is it due to algorithmic advantage or differences in the statistical distribution of features in the collection? 9. REFERENCES [1] Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98), Melb ourne, Australia, August 1998. ACM Press. [2] Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004), Sheffield, UK, July 2004. ACM Press. [3] Ziv Bar-Yossef, Andrei Z. Broder, Ravi Kumar, and Andrew Tomkins. Sic transit gloria telae: Towards an understading of the web's decay. In Proceedings of the Thirteenth International World Wide Web Conference (WWW 2004), pages 328­337, New York, NY, May 2004. [4] Yaniv Bernstein and Justin Zob el. A scalable system for identifying co-derivative documents. In Proceedings of the Eleventh Symposium on String Processing and Information Retrieval (SPIRE 2004), Padova, Italy, Octob er 2004. [5] Brian E. Brewington and George Cyb enko. How dynamic is the web? In Proceedings of the 9th International WWW Conference, pages 257­276, Amsterdam, The Netherlands, May 2000. [6] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8/13):1157­1166, 1997. [7] Chris Buckley and Ellen M. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004) [2], pages 25­32. [8] Ben Carterette and James Allan. Incremental test collections. In Proceedings of the Fourteenth ACM Conference on Information and Know ledge Management (CIKM 2005), Bremen, Germany, Novemb er 2005. [9] Junghoo Cho and Hector Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Databases (VLDB 2000), pages 200­209, Septemb er 2000. [10] Ab dur Chowdhury, Ophir Frieder, David Grossman, and Mary Catherine McCab e. Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems, 20(2):171­191, April 2002. 283