SIGIR 2007 Proceedings Poster Power and Bias of Subset Pooling Strategies Gordon V. Cormack and Thomas R. Lynam David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada gvcormac@uwaterloo.ca, trlynam@uwaterloo.ca ABSTRACT We define a method to estimate the random and systematic errors resulting from incomplete relevance assessments. Mean Average Precision (MAP) computed over a large numb er of topics with a shallow assessment p ool substantially outp erforms ­ for the same adjudication effort ­ MAP computed over fewer topics with deep er p ools, and P@k computed with p ools of the same depth. Move-to-front p ooling, previously rep orted to yield substantially b etter rank correlation, yields similar p ower, and lower bias, compared to fixed-depth p ooling. Categories and Subject Descriptors H.3.4 [Information Search and Retrieval]: Systems and Software ­ p erformance evaluation average precision (MAP) for k = 100 and to measure the correlation in system rankings (Kendall ) achieved by the gold standard and the prop osed method. Without formal justification, > .9 has b een taken to b e good agreement. Furthermore, the qualitative term bias has b een ascrib ed to some methods in assessing their validity. We present a method to estimate the p ower and bias of p ooling methods, and use our method to evaluate the effectiveness of several p ooling alternatives as a function of adjudication effort. The alternatives we investigate are: different values of k; different numb ers of topics; move-to-front sampling; using precision at cutoff k (P @k) as an alternative to MAP. 2. METHOD Kendall simply counts inversions in rank; as such it conflates random error ­ error due to chance ­ and systematic error (or bias) ­ error due to measuring the wrong quantity. More sp ecifically, it measures errors in the sign of the difference b etween the MAP scores of pairs of systems. It does not account for the magnitude or significance of the difference. We treat random error and bias separately, using a paired t-test1 to estimate statistical p ower, and counting the numb er of significant inversions b etween the alternative method and the gold standard. Overall, an alternative p ooling method is good if it has high p ower, and if its observed bias is insubstantial relative to random error. We applied this method to various subsets of the topics and judging p ool from the TREC 2004 Robust Retrieval Track [7]. In all cases we computed for all pairs of systems the sign of the difference in MAP (or P @k) and also the ttest p-value. We compared the sign of the difference to that yielded by the gold standard, and computed the numb er of inversions when p < .2 We compute p ower as the overall prop ortion of differences for which p < , and bias as the prop ortion of differences with p < whose difference has the opp osite sign from the gold standard. If this prop ortion is substantially less than , bias is a negligible factor (compared to random error) in the validity of the estimate, and may b e discounted. Although the applicability of the assumptions have b een called into question [6], we found the t-test to b e very accurate ­ for all methods and sample sizes presented here ­ in predicting inversions in rank using the same method on different topics, therefore establishing its validity. 2 For = .05 and several values not rep orted here. 1 General Terms Exp erimentation, Measurement Keywords significance test, validity, statistical p ower,p ooling methods 1. INTRODUCTION A numb er of stategies have b een devised to minimize the amount of human adjudicaton involved in IR system evaluation. The TREC p ooling method selects for adjudication only the top-ranked k documents from each system under test. The typical value of k = 100 app ears to work well for test collections with 50 topics and 500,000 documents. However, the effort in conducting an evaluation is substantial for a collection of this size and prohibitive for larger collections. It is not obvious whether a smaller value of k, or some other subset of the p ool, might suffice; and it is not obvious that even k = 100 is sufficient for larger collections. Validation of the p ooling method, and optimizations of it, have typically b een ad hoc and uncalibrated. The commonest approach has b een to assume as a gold standard the mean Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 837 SIGIR 2007 Proceedings Poster necessary to achieve that p ower for a given n. Each p oint represents a different value of k. Figure 2 shows that moveto-front p ooling [4] yields insubstantially different results. Figure 3, on the other hand, shows that substituting P @k for M AP yields inferior p ower. Figure 4 shows the observed bias for each of the methods, as a function of judging effort. We observe that moveto-front exhibits substantially less bias than methods commonly taken to b e more fair. 249 topics 124 topics 75 topics 50 topics 25 topics k move-to-front p@k Power 0.3 0.4 0.5 0.6 0.7 0.8 0.9 500 1500 5000 15000 qrels 50000 150000 500000 Bias Figure 1: Power vs effort (depth k pool) 0.8 0.9 0.00 5000 0.02 0.04 0.06 0.7 15000 50000 qrels 150000 500000 Power 0.6 Figure 4: Bias vs effort (249 topics) 249 topics 124 topics 75 topics 50 topics 25 topics 0.5 500 1500 5000 15000 qrels 50000 150000 500000 Our results supp ort the suggestion that an exp erimental design using more topics and fewer judgements is more efficient [5], but not the assertion that more regimented selection techniques yield lower bias. We advance p ower and bias analysis as a methodology to supplant rank correlation in assessing new p ooling strategies and evaluation measures (e.g. [2, 3, 1]). 0.3 0.4 Figure 2: Power vs effort (move-to-front) 4. REFERENCES [1] Aslam, J. A., Pavlu, V., and Yilmaz, E. A statistical method for system evaluation using incomplete judgments. In SIGIR '06 (2006), pp. 541­548. [2] Buckley, C., and Voorhees, E. M. Retrieval evaluation with incomplete information. In SIGIR '04 (2004), pp. 25­32. [3] Carterette, B., Allan, J., and Sitaraman, R. Minimal test collections for retrieval evaluation. In SIGIR '06 (2006), pp. 268­275. [4] Cormack, G. V., Palmer, C. R., and Clarke, C. L. A. Efficient construction of large test collections. In SIGIR Conference 1998 (Melb ourne, Australia, 1998). [5] Sanderson, M., and Zobel, J. Information retrieval evaluation: Effort, sensitivity, and reliability. In SIGIR Conference 2005 (Salvador, Brazil, 2005). [6] Van Rijsbergen, C. J. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979. [7] Voorhees, E. M. Overview of the TREC-2004 robust track. In 13th Text REtrieval Conference (Gaithersburg, MD, 2004). 0.9 Power 0.6 0.7 0.8 0.3 0.4 249 topics 124 topics 75 topics 50 topics 25 topics 0.5 500 1500 5000 15000 qrels 50000 150000 500000 Figure 3: Power vs effort (P@k) 3. RESULTS Figure 1 shows the effect of varying k (p ool depth) and n (numb er of topics) on adjudication effort and p ower ( = .05) for the standard TREC p ooling method. The y-axis is p ower and the x-axis is the numb er of relevance assessments 838