Minimal Test Collections for Retrieval Evaluation Ben Car terette car teret@cs.umass.edu James Allan allan@cs.umass.edu Ramesh Sitaraman ramesh@cs.umass.edu Center for Intelligent Information Retrieval Depar tment of Computer Science University of Massachusetts Amherst Amherst, MA 01003 ABSTRACT Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of realworld implementations is at best inefficient, at worst infeasible. In this work we link evaluation with test collection construction to gain an understanding of the minimal judging effort that must be done to have high confidence in the outcome of an evaluation. A new way of looking at average precision leads to a natural algorithm for selecting documents to judge and allows us to estimate the degree of confidence by defining a distribution over possible document judgments. A study with annotators shows that this method can be used by a small group of researchers to rank a set of systems in under three hours with 95% confidence. Categories and Sub ject Descriptors: H.3 Information Storage and Retrieval; H.3.4 Systems and Software: Performance Evaluation General Terms: Algorithms, Measurement, Exp erimentation, Theory Keywords: information retrieval, evaluation, test collections, algorithms, theory 1. INTRODUCTION Information retrieval system evaluation requires test collections: corpora of documents, sets of topics, and relevance judgments indicating which documents are relevant to which topics [15]. Ideal measures of retrieval performance should reward systems that rank relevant documents highly (precision) and that retrieve many relevant documents (recal l). Stable, fine-grained evaluation metrics take both of these into account, but they require large sets of judgments to accurately measure system performance. The TREC conferences were set up by NIST with several goals. Most relevant to this work is the goal of building test collections that could be used by information retrieval Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. practitioners to build and evaluate their retrieval systems. The test collections need to be reusable; that is, they need to not only provide accurate evaluation of the set of retrieval systems submitted to TREC, but also to evaluate future retrieval systems that may be developed. To that end, NIST uses a process of pooling to build sets of relevance judgments: top results from many system runs on the same topics are pooled, and the entire pool is judged [11]. This gives a large set of judgments that are sufficient for future use [16]. Reusability is not always a ma jor concern. For example, someone setting up a search engine for a collection simply wants to know which retrieval system is best for the types of queries he expects. TREC-style topics and judgments may not suit his purpose, and he may have neither the time nor money to collect the thousands of relevance judgments necessary to do a TREC-style evaluation on his own topics. Another case is a researcher performing a user study or a preliminary investigating a new retrieval task. She wants her retrieval system to be the best possible, but she has no relevance judgments for her topics. Reusability is not a concern; she just wants to do the minimum set of judgments to tell her which system is best. A third example is a large, highly dynamic collection such as the web. From month to month, topics and documents become obsolete; new documents enter the collection; collection statistics change. Meanwhile new algorithms must constantly be evaluated. A comparison of algorithms one month may result in a set of judgments that are not usable the next month simply because too many documents have disappeared or become less relevant as new documents and topics have appeared [9]. For any of these tasks, building a database of tens of thousands of relevance judgments is quite inefficient. Our goal in this work is to show that it is possible and not difficult to evaluate a set of retrieval systems with high confidence with a minimal set of judgments. We present a novel perspective on average precision that leads to a natural algorithm for building a test collection. We show that average precision (AP) is normally distributed over possible sets of relevance judgments, allowing us to estimate our confidence in AP. This is used as a probabilistic stopping condition for our algorithm. In this way, evaluation and test collection construction are explicitly linked. We then implement the algorithm and run a study with annotators. With only 15 hours of annotator time and under three hours real time we can achieve a ranking that is correct and that we have 95% confidence in. Simulation exp eriments test further aspects of our algorithm. 268 2. PREVIOUS WORK The method used by NIST to build test collections for TREC is pooling [11]. At TREC (Text REtrieval Conference) each year, sites submit retrieval runs on a corpora without relevance judgments. The top N documents from each submitted system are pooled and judged, and that set of relevance judgments is used to evaluate all systems. The pooling method has been shown to be sufficient for research purposes [16, 13]. Work building on the pooling method shows how sets of TREC judgments can be obtained more efficiently. Cormack et al. and Zobel independently investigated dynamic orderings of documents such that systems that have yielded more relevant documents are given greater priority in a pool [4, 16]. Soboroff et al. investigated random assignment of relevance to documents in a pool and found that this could actually give a decent ranking of systems [10]. More recently, Sanderson and Joho found that systems could be ranked reliably from a set of judgments obtained from a single system or from iterating relevance feedback runs [7], and Carterette and Allan proposed an algorithm based on paired comparisons of systems that was able to achieve high rank correlation with a very small set of judgments [3]. All of these works have focused on building collections from TREC systems. In this work we move away from the TREC environment to a smaller experimental environment that a research lab might find itself in. The difference in average precision between two systems is then n 1i j AP = AP1 - AP2 = cij xi xj (2) |R| =1 i cij = aij - bij Suppose we believe that AP > 0 and we wish to find the set of documents that would prove it. Intuitively, a document that supportscAP > 0 is relevant and has positive "weight", that is, ij xi xj > 0. A do cument that supp orts the opposite hypothesis AP < 0 has negative weight. If we can show that the sum of the weights of relevant documents is greater than the maximum possible sum of the weights of negative documents, we can conclude that AP > 0. Let S be the set of judged relevant documents and T be the set of unjudged documents. The following inequality is a sufficient stopping condition: i cij > ,j S ,j T or iS,j T and cij <0 i |cij | AP > 0 (3) 3. INTUITION AND THEORY Precision is the ratio of relevant documents retrieved to documents retrieved at a given rank. Average precision is the average of precisions at the ranks of relevant documents. For a set R of relevant documents, 1d prec@rank (d) (1) AP = |R| R The left-hand side (LHS) is AP calculated over judged relevant documents only. The right-hand side (RHS) is an upper bound on the amount AP would decrease if unjudged documents were judged relevant. Before any documents have been judged, the LHS is 0 and the RHS is the sum of all negative coefficients. It is intuitively clear that we want to increase the LHS by finding relevant documents and decrease the RHS by finding nonrelevant documents. 3.1 An Optimal Algorithm The stopping condition of Eq. 3 suggests an algorithm: select documents to maximize the left-hand side and minimize the right-hand side. Suppose we have no relevance judgments. If we were to judge document i relevant, we would add cii (the difference in reciprocal ranks of i) to the LHS. If i were nonrelevant, we c would subtract ci1 + ci2 + · · · + ciN = ij from the RHS. It seems intuitively clear that we want to pick the document that will have the greatest expected effect on either side. R Give each document a "relevant weight" wi (the amount it would add to the LHS if relevant) and a "nonrelevant N weight" wi (the amount it would subtract from the RHS if nonrelevant), and the document to judge should be the one R N that maximizes max{pi wi , (1 - pi )wi }, pi = P (xi = 1). This algorithm is optimal for our stopping condition. We sketch part of a proof in the appendix. Average precision is a standard information retrieval evaluation metric. It has been shown to be stable [1]; that is, it reliably identifies a difference between two systems when one exists. Let xi be a Boolean indicator of the relevance of document i. Numbering documents in the order they were retrieved, we can rewrite AP as a sum over ranks 1 to n, the total number of documents in the corpus: AP = n n r i r xi 1r 1r i 1 xr = xr xi |R| =1 r |R| =1 =1 r =1 1 r If we order documents arbitrarily, we replace the coefficient aij : AP = aij n 1i j |R| =1 with 3.2 AP is Normally Distributed aij xi xj i 1 = max{rank (i), rank (j )} Let pi = P (xi = 1). Let qi = 1 - pi . Then n ia j 1 p E [AP ] = aij pi pj ii pi + i >i + (4) To see why this is true, consider a toy example: a list of 3 documents with relevant documents x2 , x3 at ranks 1 and 3. 1 Average precision will be 2 ( 1 x2 + 1 x2 x1 + 1 x2 x3 + 1 x2 + 12 2 3 21 1 x1 x3 + 1 x2 ) = 1 (1 + 2 ) because x1 = 0, x2 = 1, x3 = 1. 3 33 2 3 2 V ar[AP ] = ( + i 1 p =j i) 2 n i a2i pi qi + i k j a2j pi pj (1 - pi pj ) i >i 2aii aij pi pj qi + 2aij aik pi pj pk qi >j =i + 69 0.3 for a set of topics T . Because topics are independent, 1t E [APt ] E M AP = |T | T 1t V ar[M AP ] = V ar[APt ] |T |2 T Then if AP ; N (0, 1), M AP ; N (0, 1) as well. The algorithm for determining a difference M AP follows directly from the algorithm for average precision. We treat each (topic, document) pair as a unique "document" and proceed exactly the same way. Density 0.1 0.2 0.4 0.0 4. -3 -2 -1 0 DeltaAP 1 2 3 EXPERIMENTAL EVALUATION Figure 1: Average Precision is normally distributed. We simulated two ranked lists of 100 do cuments. Setting pi = .5, we randomly generated 5000 sets of relevance judgments and calculated AP for each set. The histogram conforms to a standard normal distribution. The Anderson-Darling go o dness of fit test [12] concludes that we cannot reject the hyp othesis that the sample came from a normal distribution. The error term = O(c ignore it.1 As n , -n Having developed the algorithm from first principles, we wish to show that it can be used in a research setting on a new corpus. Previous work has shown simulated results of efficient judging algorithms on TREC ad hoc collections [3, 4, 7, 10, 16]; rather than repeat those experiments, we will evaluate the algorithm as it might be used in a research or industrial environment. The outline of our experiment is as follows: we ran eight retrieval systems on a set of baseline topics for which we had full sets of judgments. We asked six annotators to develop new topics; these were run on the same eight systems. The same six annotators then judged documents selected by the algorithm. 4.1 Baseline Topics The baseline topics were used to estimate the system performance. It is generally assumed in Information Retrieval that 50 topics are a sufficient sample for reliable comparison of retrieval systems. This has been shown recently by Sanderson and Zobel [8] and earlier by Zobel [16]. We expect, therefore, that performance of systems on our new topics will track performance on the baseline topics. The baseline topics were the 2005 Robust/HARD track topics [14] and ad hoc topics 301 through 450. We used only topic titles for queries. ) for some c > 1, so we may safely AP - E [AP ] V ; N (0, 1) ar[AP ] This means that with any incomplete test collection, AP is distributed normally over all possible assignments of relevance to all unjudged documents. We have shown this by simulation (Fig. 1), and have sketched a formal proof using martingale limit theory. Eq. 4 is normally distributed independent of how aij is defined, so AP is distributed normally as well. Given a set of relevance judgments, we use the normal cumulative density function (cdf ) to find P (AP 0). If P (AP 0) < .05, at least 95% of the possible assignments of relevance would conclude that AP > 0 and we can conclude that AP > 0 with 95% confidence. Our previous stopping condition is replaced by the probabilistic stopping condition P (AP 0) < 1 - OR P (AP 0) > . Note that we must make some assumption about pi . We make the "neutral" assumption that if i is unjudged, pi = 0.5.Using prior information about rates of relevance or ranks at which documents were retrieved could give better results, but we do not explore that here. 4.2 Corpora We indexed two different corpora: one was the Aquaint corpus consisting of about 1 million articles from the New York Times News Service, Associated Press Worldstream News Service, and Xinhua News Service from 1996 through 2000. The second was TREC disks 4 and 5, with about 500,000 documents from the Financial Times, Federal Register, LA Times, and Foreign Broadcast Information Service, covering 1989 through 1996. We ran the Robust queries on the Aquaint corpus and the ad hoc queries on the TREC corpus, so even though some topics were duplicated, the retrieved results are different. 4.3 Retrieval Systems We used six freely-available retrieval systems: Indri, Lemur, Lucene, mg, SMART, and Zettair2 . With these six systems Respectively available at: http://www.lemurpro ject.org/indri http://www.lemurpro ject.org http://lucene.apache.org http://www.cs.mu.oz.au/mg ftp://ftp.cs.cornell.edu/pub/smart http://www.seg.rmit.edu.au/zettair 2 3.3 Application to MAP Mean average precision is the average of APs computed E [AP ] and V ar[AP ] are actually sums over exp onentially many terms. These approximations are intuitive, and the error is negligible. 1 270 System 1 2 3 4 5 6 7 8 Topics All 200 .109 .123 .168 .169 .172 .213 .215 .296 Robust 05 .103 .110 .144 .162 .141 .194 .203 .321 301-350 .129 .148 .174 .179 .166 .233 .231 .300 351-400 .093 .108 .165 .150 .174 .177 .182 .262 401-450 .112 .124 .189 .184 .208 .249 .244 .301 60 New .162±.007 .158±.007 .174±.008 .175±.008 .179±.008 .185±.008 .187±.008 .194±.008 60 New-norm .118 .100 .187 .192 .218 .246 .260 .300 Table 1: True MAPs of eight systems over 200 topics; broken out into four sets of 50 topics; and exp ected MAP, with 95% confidence intervals, over 60 new topics. The final column is exp ected MAP rescaled to the range [.1, .3] for easy comparison to p erformance on other topic sets. Horizontal lines indicate "bin" divisions determined by statistical significance. we obtained eight retrieval runs on all 200 baseline queries. The runs use different retrieval models (Okapi weighting, TFIDF weighting, language modeling), different stemmers (Krovetz, Porter), and different stop lists. One run used pseudo-relevance feedback. Since we did not tune the systems, we want to avoid identifiable claims about relative performance. We number the systems 1 through 8, with 1 having lowest mean average precision on all 200 baseline topics and 8 having highest. Results of the eight retrieval runs are shown in Table 1. The table shows that the ranking can vary some by topic sets, though the bins the systems fall into remain the same. Significant differences between pairs of systems are always preserved. No. 6 13 24 31 36 44 54 59 title query environmental conservation policies journalistic plagiarism sea piracy intelligent design indian nuclear tests and visas africa aids orphans environmental impact of US army aquifer water levels judged 39 31 55 43 49 92 27 12 rel 25 10 29 16 7 75 2 0 Table 2: Selected topics develop ed by annotators. 4.4 Query Formulation Six volunteer annotators were given an interface to query and browse the Aquaint corpus and asked to come up with 10 topics each. They were asked to provide a title query and a description of what should and should not be considered relevant. They were asked to create topics that were not too easy (too many relevant documents found in the browsing interface), but not too hard (no relevant documents). We wanted to minimize drift in annotators' definitions of relevance. We kept the time between defining topics and judging documents as short as possible so that annotators would not forget what they were thinking. All judging was done in a 3-hour block with a break for lunch so that definitions of relevance would be unlikely to change drastically. Our annotators were exp erienced in the field of information retrieval; we hoped their greater understanding of the definition of relevance would also help keep drift down. Finally, by asking annotators to explicitly state what should be considered relevant and displaying that to them while they judged documents, we hoped to reduce drift further. Some of the topics (along with number of judgments made and number of relevant documents found) are shown in Table 2. Apart from two topics about "intelligent design", all topics were unique. cutoff for collecting judgments for TREC. If a document was not ranked in the top 100 by a system, its reciprocal rank for that system was defined to be 0. When comparing pairs of systems, we used the document with max weight in the pair. To extend that to ranking a set of systems, we use the document with the max weight in all pairs of systems. The implementation was very fast. There was no perceivable lag between an annotator submitting a judgment and receiving the next document. Strictly speaking, the algorithm requires that judgments be made one at a time in the determined order. This would require that annotators spend a lot of time idle, waiting for a document from one of their topics to be served. Instead, we separated the topics into six sets of ten and ran the algorithm independently on each set. This may give slightly suboptimal performance, but it is a better use of annotator time. 4.6 Experimental Process The 60 topics developed were run on the same eight retrieval systems against the Aquaint corpus. A server was set up to feed documents to annotators for judging. They used a web interface to judge documents. Each annotator judged documents from their own 10 topics. They spent about 2.5 hours judging (15 total hours of annotator time), with a break for lunch after the first hour. 4.5 Algorithm Implementation The algorithm was implemented in R [5]. Using vector arithmetic, weights could be calculated in linear time; the complexity of the algorithm is O(S 2 n2 ). We used only the top 100 documents retrieved by each system, partially for computational reasons, but also because this is the usual 5. RESULTS In 2.5 hours we obtained 2200 relevance judgments, about 4.5 per system per topic on average, about 2.5 per minute per annotator, and about 14.7 per minute. The TREC pool- 271 ing approach with a depth of 100 would have yielded 18,537 documents for judgment. NIST annotators spent a total of 280.5 hours judging the 37,798 documents in the pool for Robust/HARD systems, a rate of 2.2 judgments per minute. Of the documents judged, 38.5% were relevant. Judgments by annotators on the first 271 documents were compared to judgments by one of the authors of this paper; annotator agreement was 88%. The fastest annotator judged twice as many documents as the slowest. Some topics were more difficult to interpret than others, and some topics retrieved documents that were longer on average; that affected the number of documents an annotator could judge. In section 6 we simulate a single annotator judging all topics. We rank systems by expected value of MAP w c t t i j 1 p E M AP = E [APt ] = cij pi pj ii pi + i >i 1 0.9 Estimated confidence 0.8 0.7 0.6 0.5 correct ranking 95% confidence 90% confidence (1 hour) correct bins 0 500 1000 1500 Number of judgments 2000 2500 Figure 2: Confidence increases as more judgments are made. 2 .007 .480 -- 3 .040 0 .034 0 -- 4 .059 0 .052 0 .018 .061 -- 5 .037 0 .031 0 .003 .144 -.021 .793 -- 6 .090 0 .084 0 .050 .002 .032 .073 .053 .018 -- 7 .100 0 .093 0 .060 0 .041 .029 .063 .003 .010 .185 -- 8 .217 0 .211 0 .177 0 .159 0 .180 0 .127 0 .117 0 here pi = 1 if document i has been judged relevant, 0 if nonrelevant, and .5 otherwise. E MAP ranges from 0 to 1. A ranking by E MAP is directly comparable to a ranking by MAP, though E MAP has compressed range compared to MAP. E MAP is shown in Table 1 with the MAPs on the baseline topics for comparison. The ranking is within the range of normal variation we would expect from performance on different topic sets. In fact, apart from the inversion of the first two systems, the ranking is identical to the canonical ranking. Table 1 also shows E MAP rescaled to the range [0.1, 0.3] for comparison to the other topics. Using the normal cdf, we calculate the confidence that M AP > 0 for each pair of systems. We estimate the confidence in the ranking as the average of the confidences in each pair. With no judgments, confidence is .5; with a full set of judgments, confidence would be 1. This can be interpreted as the expected confidence in M AP if any two systems are selected from the set at random. By that measure, our algorithm has 96% confidence that this is the correct ranking. Figure 2 shows how confidence increases as more judgments are made. A correct ranking is found very fast: after only 50 judgments, the systems are binned correctly; after 250 judgments, the systems are ranked correctly. Confidence after 250 judgments is only 71%, but after 1000 judgments, the systems are ranked correctly with 90% confidence. It took about one hour real time (6 hours of annotator time) to reach that point. 1 2 3 4 5 6 7 Table 3: Each cell shows difference in true MAP and P (M AP 0) for each pair of systems after 2200 judgments for the Robust 2005 topics. Pairs in which we have 95% confidence that additional judgments will not change the outcome are b olded. · How does confidence vary as more judgments are made? · Are test collections produced by our algorithm reusable? 6.1 Comparing E MAP and MAP The main advantage of E MAP over standard MAP is that it takes advantage of information provided by nonrelevance. MAP is calculated using only relevant documents, and thus it only helps to find relevant documents (except insofar as a nonrelevant judgment indicates that the document has been judged and does not need to be judged again). More value is extracted from each judgment. Another advantage is that E MAP is normally distributed. This provides all the power that the normal distribution provides. The caveat is that we must make an assumption about the probability of a document's relevance. We ran the following simulation: after several documents had been judged, we calculated both E MAP and MAP on all systems, ranked them, and compared the ranking to the "true" ranking by MAP computed using all relevance judg- 6. DISCUSSION Here we present some results of simulations of our algorithm to attempt to evaluate its performance in general settings. The simulation is done by running the client/server setup described previously, but using judgments from the qrels produced by NIST instead of judging manually. Documents not judged by NIST were assumed nonrelevant. Table 3 shows true M AP and P (E MAP 0) for a simulation using the Robust 2005 topics. Some questions we explore are: · To what degree are the results dependent on the algorithm rather than the evaluation metric? · How many judgments are required to differentiate a single pair of ranked lists with 95% confidence? 272 1 0.8 Tau correlation to true ranking 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 1 4 MAP Expected MAP 16 64 Number of judgments 256 1024 Number of judgments 200 180 160 140 120 100 80 60 40 20 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Absolute difference in AP 0.7 0.8 Figure 3: As the numb er of judgments increases, Kendall's tau correlation to the true ranking increases. Correlation b etween Exp ected MAP and true MAP increases faster than correlation b etween MAP and true MAP. ments. We did this after 1, 2, 4, 8, ..., 1024 judgments. To compare rankings we use a rank correlation measure called Kendal l's tau [6]. Kendall's tau is calculated using pairwise inversions between elements in two lists. It ranges from -1 to 1, with -1 indicating that the two lists are reversed, 0 that half the pairs of elements are inverted, and 1 that the two lists are identical. Since we have only 8 systems, the range of values is limited. The result is shown in Figure 3. We see that the correlation between E MAP and true MAP increases very fast; after only 32 judgments it is about .85. It takes regular MAP 256 judgments to get to the same point. The bpref metric also uses nonrelevance information [2]. This result tracks with bpref. Figure 4: Absolute difference in true AP and the numb er of judgments it takes to achieve 95% confidence that a difference exists. Each p oint represents a pair of ranked lists. In 98% of the pairs, the sign of the difference was predicted correctly. 6.3 Confidence over Time Figure 5 shows a long simulation of a single annotator judging documents for the Robust/HARD queries. Confidence reaches .95 very rapidly, about as fast as it did in Figure 2, and after that continues to approach 1, the point at which the ranking would not change with more judgments. It takes 16,000 judgments to reach 1, less than half the number in the Robust 2005 qrels. The curve from Figure 2 is included for comparison. Figure 6 shows a comparison to a pooling method that we will call "incremental pooling". In this method all documents in a pool of depth k will be judged. Since the order in which they are judged affects the estimated confidence, we must impose an ordering on the pool. We judge them in rank order. This could be seen as a weaker version of our algorithm, which will tend to judge high-ranked documents unless they are found at a similar position in every list. The pool of depth 10 for our Robust/HARD runs contains 2228 documents. There were 569 documents in the pooled set that were not in our algorithmic set (and vice versa). This means that our algorithm tends to pick documents from the top of the ranked lists, which is expected since those have the greatest effect on MAP. It also shows that high confidence can be achieved more rapidly by drawing documents from outside the pool. 6.2 How Many Judgments? The number of judgments that must be made in comparing two systems or a set of systems depends on how similar the systems are. "Similarity" could be measured by true difference in average precision, or by a ranking distance metric. Figure 4 shows absolute difference in true AP for Robust 2005 topics vs. number of judgments to 95% confidence for pairs of ranked lists for individual topics. As the figure shows, if difference in AP is greater than .25, it will generally take fewer than 25 judgments to distinguish performance. But as difference in AP gets closer to 0, the number of judgments rises fast. The axes can be flipped: if many documents have been judged but 95% confidence is not in sight, the lists must be very similar. The correlation between difference in AP and number of judgments to 95% confidence is -0.509. A simple distance measure is the sum over all do|cuments 1 1 - ri |. of absolute difference in reciprocal rank, i.e. d = ri The correlation between distance and number of judgments is 0.218. Difference in AP and our distance metric have a correlation of 0.219, indicating that an increase in distance leads to an increase in difference in AP. 6.4 Reusability of Test Collection We showed above that E MAP can produce a good ranking with a very small set of judgments. This suggests that a test collection created by this algorithm may be able to evaluate new systems that were not used in the creation of the test collection. To test this, we removed one system from our set of eight and simulated our algorithm to build test collections of 500, 1000, 1500, and 2000 judgments from the remaining seven. Then we replaced the eighth system and ranked all eight by E MAP, setting pi to be the ratio of relevant documents in the test collection. Table 4 shows rank confidence averaged over eight trials, removing a different system for each trial. Confidence ac- 273 1 0.95 0.9 Estimated confidence 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0 2000 4000 simulation actual 6000 8000 10000 12000 14000 16000 Number of judgments no. judgments 500 1000 1500 2000 Rank confidence 7 systems 8 systems .863 .881 .921 .930 .945 .954 .955 .969 Table 4: Reusability of test collections. Judgments are collected from seven systems, then all eight are ranked using those judgments. Rank confidence increases when the eighth is replaced. Figure 5: A long simulation showing confidence approaching 1. The original exp eriment is shown for comparison. 1 0.95 0.9 Estimated confidence 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0 500 algorithm incremental pooling 1000 1500 Number of judgments 2000 2500 A clear direction for future work is extending the analysis to other evaluation metrics for different tasks. Some initial exploration in this direction suggests that it would be very easy to analyze precision, for instance. Another direction is estimating probabilities of relevance. Uniformly using the same value is not optimal; better estimates based on ranks provide better results. Acknowledgments This work was supported in part by the Center for Intelligent Information Retrieval, in part by the Defense Advanced Research Pro jects Agency (DARPA) under contract number HR001-06-C-0023, and in part by an NSF award under grant number CNS-0519894. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor. 8. REFERENCES [1] C. Buckley and E. M. Voorhees. Evaluating Evaluation Measure Stability. In Proceedings of SIGIR, pages 33­40, 2000. [2] C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of SIGIR, pages 25­32, 2004. [3] B. Carterette and J. Allan. Incremental Test Collections. In Proceedings of CIKM, pages 680­687, 2005. [4] G. V. Cormack, C. R. Palmer, and C. L. Clarke. Efficient Construction of Large Test Collections. In Proceedings of SIGIR, pages 282­289, 1998. [5] R. Gentleman and R. Ihaka. The R Language. In Proceedings of the 28th Symposium on the Interface, 1997. [6] M. Kendall. Rank Correlation Methods. Griffin, London, UK, fourth edition, 1970. [7] M. Sanderson and H. Joho. Forming test collections with no system pooling. In Proceedings of SIGIR, pages 33­40, 2004. [8] M. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of SIGIR, pages 186­193, 2005. [9] A. Singhal. Challenges in Running a Commercial Search Engine. In Proceedings of SIGIR, page 432, 2005. Keynote Talk. [10] I. Soboroff, C. Nicholas, and P. Cahan. Ranking Retrieval Systems without Relevance Judgments. In Proceedings of SIGIR, pages 66­73, 2001. Figure 6: Comparison b etween our algorithm and incremental p o oling. We have more confidence in the rankings by the algorithm's do cuments than by p o oling's. tually increases when adding the 8th system back in, likely because of the better estimate of pi . Furthermore, the 8th system is always placed in the correct spot in the ranking or swapped with the next (statistically indistinguishable) system. This suggests that the test collections can be reused reliably in at least some cases. In general it depends on how many documents in the new system have been judged and the estimates of pi . 7. CONCLUSION A new perspective on average precision leads to an algorithm for selecting documents that should be judged to evaluate retrieval systems in minimal time. Using actual annotators, we implemented the algorithm and showed that it can be used to rank retrieval systems with a high degree of confidence and a minimal number of judgments. After only six hours of annotation time, we had a achieved a ranking with 90% confidence. This is applicable to a variety of retrieval environments when little relevance information is available. 274 [11] K. Sparck Jones and C. J. van Rijsbergen. Information Retrieval Test Collections. Journal of Documentation, 32(1):59­75, 1976. [12] M. A. Stephens. EDF Statistics for Goodness of Fit and Some Comparisons. Journal of the American Statistical Association, 69:730­737, 1974. [13] E. Voorhees. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. In Proceedings of SIGIR, pages 315­323, 1998. [14] E. Voorhees. Overview of the TREC 2005 Robust Retrieval Track. In TREC 2005 Notebook, 2005. [15] E. M. Voorhees. The philosophy of information retrieval evaluation. In CLEF '01: Revised Papers from the Second Workshop of CLEF, pages 355­370, London, UK, 2002. Springer-Verlag. [16] J. Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of SIGIR, pages 307­314, 1998. Lemma 2. At iteration i of the algorithm, cij p2 cj j p2 for j i. Proof. By induction on i. The base case, c1j cj j , follows because c11 cj j by lines 3 & 5 of Alg. 1, and c11 c1j cj j by Lemma 1. Assume the induction hypothesis: ci-1,j cj j for arbitrary i - 1. The proof continues by contradiction. Suppose cj j > cij . If this is true, then anywhere we place an arbitrary document k we have that cj k cik (this can be shown by enumeration of nine cases, some of which are impossible because they violate the induction k ypothesis). h k This means that cj j p + cj k p2 > cii p + cik p2 . But this is a kcontradiction: wek selected document i because cii p + cik p cj j p + cj k p for all j > i (lines 3 & 5 in Alg. 1). Therefore we must conclude that cij cj j . Now we are ready to prove the theorem. The proof is fairly technical, but the intuition is that, given an arbitrary set U and the set S produced by Algorithm 1, we show that the expected weight of S is greater than the expected weight of a set S that contains a particular document in U instead of the kth document in S , and that the expected weight of S is greater than the expected weight of U . Proof. By induction on k. The base case is trivial: c11 p cii p by construction. Let Sk be a set of k documents. Let Uk be an arbitrary set of k documents such that Uk = Sk . We shall number documents in the order they would be added to S , so Uk will contain documents with indices greater than k, but Sk will not. For conciseness, the expecteS weight of a set S d i E [ ,j S cij xi xj ] shall be denoted . Let us make the S U induction hypothesis that k-1 k-1 . We wish to S U show that k k. S S Note that Sk-1 Sk . Therefore k= k-1 + ckk p + j 2 Sk-1 cj k p . Let be the do cument in Uk such that k and has greater difference in reciprocal ranks than any other document u k, i.e. c cuu for all u k. Lemma 1 ensures exists by imposing an ordering on the differences in reciprocal ranks cii for all i. Lemma 2 states that cj p2 c p2 for j < . Then it follows that cj p2 c p2 cu p2 cuu p2 cuv p2 cvv p2 .... w Let Sk be the set obtained by replacing document kS ith document defined above, i.e. Sk = Sk - k + . k Sk by construction. If S e remove from both Sk and w byj the induction Uk , we have Sk - = k-1 Uk - 2 hypothesis. It remains to be shown that Sk-1 cj p u 2 Uk - cu p . Let j b e an arbitrary do cument in Sk-1 . If there is a u Uk - such that j = u, then cj p2 = cu p2 . Otherwise, cj p2 c p2 cu p2 . Then S Sk S j = cj p2 k k-1 + c p + Sk-1 APPENDIX In Section 2 we stated a sufficient stopping condition (Eq. 3) that would allow us to conclude AP > 0. Here we present an algorithm for maximizing the expected value of part of the stopping condition and prove its optimality. This is only part of the algorithm we actually use in the paper; a complete algorithm and proof are forthcoming. i cij xi xj > ,j S i |cij | AP > 0 (5) S or j S ;cij <0 / / This is a restatement of the stopping condition Eq. 3: S is a set of documents that have been judged. Intuitively, we want to show that our algorithm will select documents that maximize the expected value of the left-hand side and minimize the expected value of the right-hand side. We claim Algorithm 1 maximizes the expected value of the LHS when pi = pj = p for all i, j . Algorithm 1 Select k documents to be judged. 1: while |S | < k do 2: for all i S do j / 2 3: wi cii p + S cij p 4: end for 5: j argmaxi wi 6: S S + {j } 7: end while E[ Theorem 1. If pi = p for al l i, the set S maximizes i ,j S cij xi xj ]. To prove this, we will need to show that the weight of S is greater than the weight of any other set of the same size. We begin with two lemmas: Lemma 1. For two documents i and j , cii cij cj j or cj j cij cii Proof. The proof is by a simple analysis of cases. There are four possible relative orderings of i and j in two systems, and in every case one of the above inequalities is true. ( Uk - ) + c p + u 2 up Uk - c = U k and the proof is complete. A similar algorithm is used to pick nonrelevant documents to decreaj e the right hand side of Eq. 5. Instec d of wi = s a 2 2 cii p + ij (1 - p) . S cij p , it sets wi = cii (1 - p) + Similar reasoning proves that the set selected has minimal expected weight. 275