Evaluating Evaluation Metrics based on the Bootstrap Tetsuya Sakai Knowledge Media Laboratory, Toshiba Corporate R&D Center tetsuya.sakai@toshiba.co.jp ABSTRACT This pap er describ es how the Bootstrap approach to statistics can b e applied to the evaluation of IR effectiveness metrics. First, we argue that Bootstrap Hyp othesis Tests deserve more attention from the IR community, as they are based on fewer assumptions than traditional statistical significance tests. We then describ e straightforward methods for comparing the sensitivity of IR metrics based on Bootstrap Hyp othesis Tests. Unlike the heuristics-based "swap" method prop osed by Voorhees and Buckley, our method estimates the p erformance difference required to achieve a given significance level directly from Bootstrap Hyp othesis Test results. In addition, we describ e a simple way of examining the accuracy of rank correlation b etween two metrics based on the Bootstrap Estimate of Standard Error. We demonstrate the usefulness of our methods using test collections and runs from the NTCIR CLIR track for comparing seven IR metrics, including those that can handle graded relevance and those based on the Geometric Mean. 1. Use two or more test collections of "resp ectable" size, and observe trends that are consistent across the different data; 2. Make sure that the p erformance difference b etween X and Y is "relatively large"; 3. Conduct statistical significance tests to claim that the difference was not observed due to chance. All of the ab ove are arguably necessary conditions for making a good IR pap er that involves comparative exp eriments, although surprisingly many IR pap ers do not satisfy them [13]. Unfortunately, there has b een a controversy as to which statistical significance tests should b e used for IR evaluation, as well as whether such tests should b e used at all [4, 6, 13, 14]. It is known that a typical IR evaluation environment often violates the underlying assumptions of significance tests, but it is also known that some significance tests work well even when some of the assumptions are violated. Parametric tests rely on the normality assumption and generally have higher p ower than nonparametric ones. (That is, it is easier to detect significant differences with parametric tests.) But even nonparametric tests are not assumption-free: the Paired Wilcoxon Test dep ends on b oth the symmetry and the continuity assumptions [4]. An IR researcher who wants to b e conservative (i.e., who wants to minimise the risk of jumping to wrong conclusions) might, for example, choose the two-tailed Sign Test, which generally has little p ower. However, as Savoy [14] p oints out, there is a very attractive alternative called the Bootstrap [3]. Invented in 1979, the Bootstrap is the approach to statistics for the computer age, and has strong theoretical foundations. While classical statistics rely on mathematical derivations that often require several assumptions on the underlying distributions of data, the Bootstrap tries to achieve the same goal by directly estimating the distributions through resampling from observed data. The Bootstrap Hyp othesis Tests are free from the normality and symmetry assumptions, and it is known that they often show p ower comparable to that of traditional parametric significance tests. Moreover, the Unpaired Bootstrap Hyp othesis Test is directly applicable even to unconventional summary statistics that are not Arithmetic Means over a topic set (e.g., the "area" measure based on the worst N topics for each system [16] and Geometric Means [11, 16]). We therefore b elieve that Bootstrap Hyp othesis Tests deserve more attention from the IR community. This pap er concerns Question (b) p osed ab ove: How "good" is IR metric M ? More sp ecifically, we use the Bootstrap Hyp othesis Tests to assess and compare the sensitivity of Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Experimentation Keywords Test Collection, Evaluation, Bootstrap, Graded Relevance 1. INTRODUCTION A typical IR pap er claims that System X is b etter than System Y in terms of an effectiveness metric M computed based on a test collection C : How reliable is this pap er? More sp ecifically, (a) What happ ens if C is replaced with another set of data C ? (b) How good is M ? Question (a) p osed ab ove is usually dealt with as follows: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. 525 different IR metrics. This is related to the "swap" method prop osed by Voorhees and Buckley [12, 13, 15] (and the "stability" method prop osed by Buckley and Voorhees [2, 12]): The swap method derives the p erformance difference required for declaring that a system is b etter than another, and that the chance of obtaining a contradictory result with another topic set (the swap rate) is b elow a given threshold. However, while the swap method is not directly related to statistical significance tests (See Section 3.5), our method estimates the p erformance difference required to achieve a given significance level directly from Bootstrap Hyp othesis Test results. In addition, we describ e a simple way of examining the accuracy of rank correlation b etween two metrics based on the Bootstrap Estimate of Standard Error. To demonstrate the usefulness of our Bootstrap-based methods, we use two test collections with submitted runs from NTCIR [9] and compare seven IR metrics, including those based on graded relevance and those based on the Geometric Mean [11, 16]. The swap, stability and our Bootstrap-based methods agree that, for these data sets, the most sensitive IR metrics are Q-measure [12], normalised Discounted Cumulative Gain at cut-off 1000 [5, 7] and Average Precision (AveP), while the least sensitive one is Precision at cut-off 1000. In the middle lie normalised Cumulative Gain at cutoff 1000 and Geometric Mean AveP / Q-measure. Section 2 describ es the IR metrics and the NTCIR data we use. Section 3 discusses our new methods for comparing the sensitivity of IR metrics based on the Bootstrap Hyp othesis Tests. Section 4 describ es a method for examining the accuracy of rank correlation b etween two IR metrics based on the Bootstrap Estimate of Standard Error. Section 5 discusses previous work and Section 6 concludes this pap er. pap er, we use g ain(S ) = 3, g ain(A) = 2, g ain(B ) = 1.) P Let cg (r ) = 1ir g (i) denote the cumulative gain [5] at Rank r of the system's output, where g (i) = g ain(L) if the document at Rank i is L-relevant and g (i) = 0 otherwise. In particular, consider an ideal ranked output, such that isr el(r ) = 1 for 1 r R and g (r ) g (r - 1) for r > 1, and let cgI (r ) denote the ideal cumulative gain at Rank r . Similarly, by using dg (i) = g (i)/ loga (i) instead of g (i) for i > a, we can obtain the (ideal) discounted cumulative gain dcg (r ) and dcgI (r ) [5]. (We use a = 2 throughout this pap er.) Then we have: nC Gl = cg (l)/cgI (l) . nDC Gl = dcg (l)/dcgI (l) . Q-measur e = 1X isr el(r )B R(r ) R 1rL (3) (4) (5) where B R(r ) is the blended ratio given by: B R(r ) = (cg (r ) + count(r ))/(cgI (r ) + r ) . (6) 2. METRICS AND DATA 2.1 IR Effectiveness Metrics The IR metrics we consider in this pap er are Average Precision (AveP), Precision at cut-off 1000 (PDoc1000 ), Qmeasure [12, 17], and normalised (Discounted) Cumulative Gain at cut-off 1000 (n(D)CG1000 ) [5, 7]. AveP represents a very sensitive IR metric based on binary relevance, while PDoc1000 represents a very insensitive one [12]. (PDoc1000 rewards a system with 10 relevant documents at Ranks 1-10 and one with 10 relevant documents at Ranks 991-1000 equally. Note also that it does not average well.) Let R denote the numb er of relevant documents for a topic, and let L ( 1000) denote the size of a ranked output. Let count(r ) denote the numb er of relevant documents within top r ( L), and let isr el(r ) b e 1 if the document at Rank r is relevant and 0 otherwise. Clearly, Precision at Rank r is given by P (r ) = count(r )/r . Hence: 1X isr el(r )P (r ) . (1) Av eP = R 1rL P Docl = P (l) . (2) It is known that nCGl has a problem: it cannot p enalise late arrival of relevant documents after Rank R (as cgI (r ) = cgI (R) holds for r R) [12]. nDCGl solves this problem by discounting the gains, while Q-measure solves it by including r in the denominator of B R(r ). nDCGl is a stable and sensitive metric provided that l is large [12]. Microsoft rep ortedly uses a variant of this metric for improving their Web search engine [1]. Q-measure is also stable and sensitive [12], and it has b een applied to XML retrieval [17] as well as Question Answering. It is more highly correlated with AveP than nDCG is; In a binary relevance environment, Q-measur e = Av eP holds for any ranked output if there is no relevant document b elow Rank R, and Q-measur e > Av eP holds otherwise. By default, we use the Arithmetic Mean over a given topic set with any IR metric. However, this pap er also considers the Geometric Mean versions of AveP and Q-measure, which we denote by G AveP and G Q-measure. Let xi denote the value of a metric for the i-th topic (down to four significant figures). Then the actual method we use for obtaining the Geometric Mean (GM ) is as follows [16]: P 1in log (xi + 0.00001) GM = exp( ) - 0.00001 . (7) n The 0.00001's are necessary b ecause limiting the ranked output size to L 1000 implies that xi may b e zero. Geometric Mean AveP was used at the TREC Robust Track in order to focus on the "hardest" topics. Sakai et al. [11] used Geometric Mean Q-measure for analysing their results at NTCIR-5. Geometric Means are arguably more practical than Arithmetic Means for building robust IR systems, which can produce a decent output whatever the query is. 2.2 NTCIR CLIR Chinese and Japanese data Our exp eriments use two sets of data (test collections and submitted runs) from the NTCIR-3 CLIR track [9], provided by National Institute of Informatics, Japan. Table 1 provides some statistics of the data. From each data set, only the top 30 runs as measured by relaxed AveP (i.e., AveP that treats S-, A- and B-relevant documents just as "relevant") were used in our exp eriments, since "near-zero" runs are unlikely to b e useful for discussing the sensitivity of metrics. We can also use IR metrics based on graded relevance, since the NTCIR data contain S-, A- and B-relevant (highly relevant, relevant and partially relevant) documents. Let R( P L) denote the numb er of L-relevant documents so that L R(L) = R, and let g ain(L) denote the gain value for retrieving an L-relevant document [5]. (Throughout this 526 Table 1: Statistics of the NTCIR-3 CLIR Chinese and Japanese data. |Q| Chinese Japanese 42 42 R 78.2 60.4 R(S ) R(A) per topic 21.0 24.9 7.9 31.5 R(B ) 32.3 21.0 runs (used) 45 (30) 33 (30) for b = 1 to B create topic set Qb of size n = |Q| by randomly sampling with replacement from Q; for i = 1 to n q = i-th topic from Qb ; wi b = observed value in w for topic q ; Figure 1: Algorithm for creating Bootstrap samples Qb and wb = (w1 b , . . . , wnb ) for the Paired Test. count = 0; for b = 1 to B ¯ ¯ t(wb ) = wb /( b / n); b if( |t(w )| |t(z)| ) then count++; AS L = count/B ; Figure 2: Algorithm for estimating the Achieved Significance Level based on the Paired Test. Thus, for each data set, we have a set of 30 29/2 = 435 system combinations, which we shall denote by C . 3. BOOTSTRAP HYPOTHESIS TESTS AND IR METRICS SENSITIVITY Sections 3.1 and 3.2 describ e how Paired and Unpaired Bootstrap Hyp othesis Tests can b e applied to IR evaluation. Section 3.3 prop oses methods for comparing the sensitivity of IR metrics and for estimating the p erformance difference required for achieving a given significance level. Section 3.4 presents our exp erimental results, and Section 3.5 compares them with those based on the stability and swap methods. 3.1 Paired Test: One Sample Problem We first describ e the Paired Bootstrap Hyp othesis Test, which can b e used for comparing two IR strategies run against a common test collection. This is similar to the one describ ed earlier by Savoy [14], except that we use a Studentised test statistic to enhance accuracy. We encourage other IR researchers to try it as it is based on fewer assumptions than standard significance tests such as paired t- and Wilcoxon tests, is easy to apply, yet has high p ower. Let Q b e the set of topics provided in the test collection, and let |Q| = n. Let x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ) denote the p er-topic p erformance values of System X and Y as measured by some p erformance metric M . A standard method for comparing X and Y is to measure the difference P P ¯ b etween sample means x = i xi /n and y = i yi /n such ¯ as Mean AveP values. But what we really want to know is whether the population means for X and Y (X and Y ), computed based on the p opulation P of topics, are any different. Since we can regard x and y as paired data, we let z = (z1 , . . . , zn ) where zi = xi - yi , let = X - Y and set up the following hyp otheses for a two-tailed test: H0 : =0 vs H1 : = 0 . Thus the problem has b een reduced to a one-sample problem [3]. As with standard significance tests, we assume that z is an indep endent and identically distributed sample drawn from an unknown distribution. In order to conduct a Hyp othesis Test, we need a test statistic t and a nul l hypothesis distribution. Consider: t(z) = z ¯ / n ¯ have five topics Q = (001, 002, 003, 004, 005) and that w = (0.2, 0.0, 0.1, 0.4, 0.0). Supp ose that, for trial b, sampling with replacement from Q yields Qb = (001, 003, 001, 002, 005). Then, wb = (0.2, 0.1, 0.2, 0.0, 0.0). For each b, let wb and b denote the mean and the stan¯ ¯ dard deviation of wb . Figure 2 shows how to compute the Achieved Significance Level (ASL) using wb . In essence, we examine how rare the observed difference would b e under H0 . If AS L < , where typically = 0.01 (very strong evidence against H0 ) or = 0.05 (reasonably strong evidence against H0 ), then we reject H0 . That is, we have enough evidence to state that X and Y are probably different. The ab ove test relies on the fact that the difference b etween two Arithmetic Means equals the Arithmetic Mean of individual differences. But then how should we discuss statistical significance in terms of Geometric Mean AveP / Q-measure? There are at least two ways to handle the problem. One is to use the Unpaired Bootstrap Hyp othesis Test describ ed in Section 3.2 instead, which is directly applicable to virtually any metric, such as the "area" measure based on the worst N topics for each system [16]. The other is to stick to the Paired Test: Instead of examining zi = xi - yi as mentioned earlier, we could examine log(xi + 0.00001) - log (yi + 0.00001). This is b ecause testing the significance in terms of the Arithmetic Mean inside Eq. (7) should b e equivalent to testing that in terms of the entire Geometric Mean formula. For convenience, "Arithmetic Mean inside the Geometric Mean" will b e denoted by the prefix "AG ": We shall test AG AveP and AG Q-measure to discuss the sensitivity of G AveP and G Q-measure in Section 3.4. 3.2 Unpaired Test: Two Sample Problem As mentioned ab ove, the Unpaired Bootstrap Hyp othesis Test is more widely applicable than the Paired one, and it can handle Geometric Means directly. The downside is that the Unpaired Test has much less power than the Paired one since it uses less information. For this reason, the Paired Test should b e preferred wherever it is applicable. The Unpaired Test treats x and y as unpaired data, naturally. (In this pap er, x and y are data obtained by running two IR strategies against a common test collection and therefore |x| = |y| = n. But more generally, the two sets of obser- where is the standard deviation of z, given by ¯ X 1 = ( (zi - z )2 /(n - 1)) 2 . ¯ ¯ i Moreover, let w = (w1 , . . . , wn ) where wi = zi - z , in order ¯ to create bootstrap samples wb of p er-topic p erformance differences that ob ey H0 . Figure 1 shows the algorithm for obtaining B b ootstrap samples of topics (Qb ) and the corresp onding values for wb . (We let B = 1000 throughout this pap er.) For example, let us assume that we only 527 let v = (x1 , . . . , xn , y1 , . . . , ym ); for b = 1 to B from a set of integers (1, . . . , n + m), obtain a random sample of size n + m by sampling with replacement; for i = 1 to n j = i-th element of the sample of integers; xb = j -th element of v; i for i = n + 1 to n + m j = i-th element of the sample of integers; b yi-n = j -th element of v; Figure 3: Algorithm for creating Bootstrap samples b xb = (xb , . . . , xb ) and yb = (y1 b , . . . , ym ) for the Un1 n paired Test. (We let m = n throughout this paper.) count = 0; for b = 1 to B d b = M (x b ) - M (y b ); ^ if( |db | |d| ) then count++; AS L = count/B ; Figure 4: Algorithm for estimating the Achieved Significance Level based on the Unpaired Test. vations may come from different test collections, so |y| will b e denoted by m rather than n hereafter.) As with standard significance tests, we assume that x and y are indep endently and identically distributed samples from unknown distributions F and G, resp ectively. The test statistic we consider in this case is ^ d = M (x ) - M (y ) (8) D I F F = ; for each system pair (X, Y ) C 1 B sort |t(wX,Y )|, . . . , |t(wX,Y )|; X b if |t(w ,Y )| is the B -th largest value X then add |wb,Y | to DI F F ; ¯ estimated dif f = max{dif f DI F F } (rounded to two significant figures); Figure 5: Algorithm for estimating the performance difference required for achieving a given significance level with the Paired Test. D I F F = ; for each system pair (X, Y ) C sort |d1,Y |, . . . , |dBY | and X X, add the B -th largest value to DI F F ; estimated dif f = max{dif f DI F F } (rounded to two significant figures); Figure 6: Algorithm for estimating the performance difference required for achieving a given significance level with the Unpaired Test. where, for example, M (x) is the value of metric M computed based on x. (Note that M does not have to b e an Arithmetic Mean metric.) But what we really want to know is d, which represents the "true" absolute p erformance difference b etween Systems X and Y when the whole p opulation of topics is taken into account. Thus the hyp otheses we can set up for a two-tailed test are: H0 : d=0 vs H1 : d = 0 . We now need a null distribution for the data under H0 . A natural choice would b e to assume that F = G, i.e., that the observed values xi and yi actually come from an identical distribution. (In fact, F = G itself is commonly used as the null hyp othesis.) First, let v denote a vector of size n + m obtained by concatenating the two p er-topic p erformance vectors x and y. Figure 3 shows how to generate Bootstrap samples for the Unpaired Test. For simplicity, let us assume that x = (0.1, 0.3) and y = (0.2, 0.0), and therefore that v = (0.1, 0.3, 0.2, 0.0). Then we generate random integers that range b etween 1 and 4: Supp ose that we have obtained (1,4,1,2) for b = 1. Then, by splitting this vector into (1,4) and (1,2), we obtain x1 = (0.1, 0.0) and y1 = (0.1, 0.3). In this way, Figure 3 shuffles the observed values without looking at whether they come from x or y. Figure 4 shows how to compute the two-tailed ASL based on the Unpaired Test, in a way similar to Figure 2. The idea is simple: Perform a Bootstrap Hyp othesis Test for every system pair, and count how many of the pairs satisfy AS L < . Moreover, since each bootstrap replicate of the difference b etween two summary statistics (i.e., wb for ¯ the Paired Test and db for the Unpaired Test) is derived from n = |Q| topics, we can obtain a natural estimate of the p erformance difference required for guaranteeing AS L < , given n. This may b e useful for informally guessing whether two systems are significantly different by just looking at the difference b etween two summary statistics. Let wX,Y and db,Y explicitly denote the ab ove b ootstrap ¯ b X replicates for a particular system pair (X, Y ). Figures 5 and 6 show our algorithms for estimating the p erformance difference required for achieving AS L < given n, based on the Paired Test and the Unpaired Test, resp ectively. For example, if = 0.05 is chosen for Figure 6, the algorithm obtains the B = 1000 0.05 = 50-th largest value among |db,Y | for each (X, Y ). Among the |C | = 435 values thus X obtained, the algorithm takes the maximum value just to b e conservative. Figure 5 is almost identical to Figure 6, although it looks slightly more complicated: Since we used Studentisation with the Paired Test, the b ootstrap replicate b wX,Y is not equal to the test statistic t(wX,Y ) (See Figure 2), ¯ b which we are using as the sort key . 3.4 Experimental Results Figures 7 and 8 plot, for each IR metric, the Paired and Unpaired Bootstrap ASLs of system pairs from the NTCIR3 Chinese data. The horizontal axis represents 435 system pairs sorted by the ASL. Since the figures may b e too small for the reader, here we summarise the results in words: According to Figure 7, Q-measure, AveP and nDCG1000 are the most sensitive metrics, while PDoc1000 is clearly the least sensitive. Whereas, nCG1000 , AG AveP and AG Q-measure (which represent G AveP and G Q-measure, resp ectively) lie in the middle. The trends are similar in Figure 8, but it can b e observed that the Unpaired Test has considerably less p ower for any metric. (The "outlier" curve represents PDoc1000 : it fails to 3.3 Sensitivity Comparison Methods We now prop ose straightforward methods for assessing and comparing the sensitivity of IR effectiveness metrics. 528 0.1 "Q-measure" "nDCG1000" "nCG1000" "AveP" "PDoc1000" "AG_Q-measure" "AG_AveP" 0.08 achieved significance level (ASL) Table 2: Sensitivity and the estimated differences based on Paired Tests Bootstrap Tests ( = 0.05; NTCIR-3 CLIR data). (a) Chinese Q-measure AveP nDCG1000 nCG1000 AG AveP AG Q-measure PDoc1000 (b) Japanese nDCG1000 Q-measure AveP nCG1000 AG AveP AG Q-measure PDoc1000 AS L < 242/435=56% 240/435=55% 235/435=54% 195/435=45% 163/435=37% 150/435=34% 116/435=27% AS L < 316/435=73% 305/435=70% 296/435=68% 268/435=62% 258/435=59% 251/435=58% 160/435=37% estimated diff. 0.10 0.11 0.13 0.15 1.42 1.76 0.02 estimated diff. 0.14 0.13 0.11 0.18 1.62 1.74 0.04 0.06 0.04 0.02 0 0 100 200 300 400 system pair sorted by ASL 500 600 Figure 7: ASL curves based on Paired Bootstrap Hypothesis Tests (NTCIR-3 CLIR Chinese data). 0.1 "Q-measure" "nDCG1000" "nCG1000" "AveP" "PDoc1000" "G_Q-measure" "G_AveP" 0.08 achieved significance level (ASL) 0.06 0.04 0.02 0 0 100 200 300 400 system pair sorted by ASL 500 600 Figure 8: ASL curves based on Unpaired Bootstrap Hypothesis Tests (NTCIR-3 CLIR Chinese data). satisfy AS L < = 0.05 for all system pairs.) By comparing the two figures, it can b e observed that our AG AveP / AG Q-measure results based on the Paired Test (i.e., our "indirect" assessments of G AveP and G Q-measure) are consistent with our G AveP / G Q-measure results based on the Unpaired Test. We have similar graphs for the NTCIR-3 Japanese data, which we omit due to lack of space. Table 2(a) cuts Figure 7 in half to show, for each IR metric, how many pairs of Chinese systems satisfied AS L < = 0.05 with the Paired Test. The metrics have b een sorted by this measure of sensitivity (Column 2). Table 2(b) shows similar results for the Japanese data, based on a Japanese version of Figure 7 not shown in this pap er. It can b e observed that Q-measure, nDCG1000 and AveP are the most sensitive while PDoc1000 is the least sensitive for b oth data sets. Moreover, the sensitivity values are substantially higher for the Japanese data: This is b ecause the p erformance variance for the Japanese run set is greater than that for the Chinese run set. For example, the maximum/mininum observed AveP values for the Japanese run set are .4932 and .0617, while those for the Chinese run set are .4166 and .2224. Column 3 of Table 2 shows, for each metric, the estimated difference required for satisfying AS L < = 0.05, given the topic set size n = 42. The algorithm we describ ed in Figure 5 was used to obtain these estimates. For AG AveP and AG Q-measure, the estimates represent differences b etween two Arithmetic Means of logs rather than differences b etween two Geometric Means. Hence we tried to translate these estimates back into the latter by looking up the system pair (X, Y ) and the b ootstrap replicate numb er b that actually yielded the log-space differences, but our results did not always look accurate: For the purp ose of estimating the required difference b etween Geometric Means, the Unpaired Test may b e more reliable (See b elow). Our table based on the Unpaired Test, which we omit due to lack of space, is generally similar to Table 2: The sensitivity values (Column 2) are naturally much lower, but the estimated differences (Column 3) are comparable to the Paired case. According to the Unpaired Test, the estimated differences for G AveP and G Q-measure based on the Chinese data are 0.16 and 0.17, resp ectively. However, as Figure 8 suggests, the Unpaired Test table app ears to b e less useful for discussing which metrics are more sensitive than others. Note that Column 3 is not for comparing different IR metrics: some metrics tend to take small values while others tend to take large values, so such comparisons are not necessarily valid. Relative p erformance differences [13] could b e used for comparison across metrics, but this is b eyond the scop e of our work. We prefer to regard, for example, M (x) = 0.0001, M (y) = 0.0002 and M (x) = 0.1000, M (y) = 0.2000 as quite different situations. 3.5 Comparison with Stability/Swap Methods Readers familiar with the stability method [2] and the swap method [15] will note that our Bootstrap-based methods for comparing the sensitivity of metrics is related to them. The crucial difference is that our methods are based on the Bootstrap which has time-honoured theoretical foundations. At the implementation level, the main difference is that the stability and swap methods use sampling without replacement whereas the Bootstrap samples are obtained by sampling with replacement. Sanderson and Zob el [13] and Sakai [10] explored variations of the swap method and the latter showed that with- and without-replacement yield similar results for the purp ose of ranking metrics according 529 for each system pair (X, Y ) C for b = 1 to B mar g in = f max(M (X, Qb ), M (Y , Qb )); if( |M (X, Qb ) - M (Y , Qb )| < mar g in ) E Q (X , Y ) + + else if( M (X, Qb ) > M (Y , Qb ) ) GT (X, Y ) + + else GT (Y , X ) + +; Figure 9: Algorithm for computing E Q(X, Y ), GT (X, Y ) and GT (Y , X ). 10 9 8 7 "Q-measure" "nDCG1000" "nCG1000" "AveP" "PDoc1000" "G_Q-measure" "G_AveP" for each system pair (X, Y ) C for b = 1 to B D b = M (X , Q b ) - M (Y , Q b ); D b = M (X, Q b ) - M (Y , Q b ); counter (B I N (Db )) + +; if( Db D b > 0 ) then continue else swap counter (B I N (Db )) + +; for each bin i swap r ate(i) = swap counter (i)/counter (i); Figure 11: Algorithm for computing the swap rates. data is omitted due to lack of space.) The results are generally in agreement with those based on Bootstrap Hyp othesis Tests: nDCG1000 , Q-measure and AveP are the most sta6 ble, while PDoc1000 is the least stable. In the middle lie 5 nCG1000 , G AveP and G Q-measure. The essence of the swap method is to estimate the swap 4 rate, which represents the probability of the event that two 3 exp eriments are contradictory given an overall p erformance difference. Our version works as follows: First, in addition 2 to the set of B Bootstrap samples {Qb }, we create an1 other set of B Bootstrap samples {Q b } by sampling with replacement from Q. Let D denote the p erformance dif0 0 10 20 30 40 50 60 70 80 90 100 ference b etween two systems as measured by M based on a proportion of ties topic set; we prepare 21 performance difference bins [12, 15], Figure 10: MR-PT curves based on the stability where the first bin represents p erformance differences such method (NTCIR-3 CLIR Chinese data). that 0 D < 0.01, the second bin represents those such that 0.01 D < 0.02, and so on, and the last bin represents those such that 0.20 D. Let B I N (D) denote the mapping to sensitivity (See Section 5). In light of this, this section from a difference D to one of the 21 bins where it b elongs. uses the with-replacement versions of the stability and swap The algorithm shown in Figure 11 calculates a swap rate for methods, and show that they yield results that are in agreeeach bin. Note that Db is not the same as our db from Figment with our Bootstrap-based results. More sp ecifically, ure 4: Db is the p erformance difference b etween X and Y we reuse our paired-test Bootstrap samples Qb with these as measured using the Bootstrap topic sample Qb ; whereas, two methods. db is the Bootstrap replicate of the observed p erformance The essence of the stability method is to compare systems difference under the assumption that the per-topic values of X and Y in terms of metric M using B different topic sets X and Y come from an identical distribution. and count how often X outp erforms Y , how often Y outWe can thus plot swap rates against p erformance differp erforms X and how often the two are regarded as equivaence bins. By looking for bins whose swap rates do not b lent. Our version works as follows: Let M (X, Q ) denote exceed (say) 5%, we can estimate how much absolute difthe value of metric M for System X computed based on ference is required in order to conclude that System X is Qb . Given a fuzziness value f [2, 12], we count GT (X, Y ), b etter than Y with 95% "confidence": But, as mentioned GT (Y , X ) and E Q(X, Y ) as shown in Figure 9, where GT (X, Y )+ earlier, the swap method is not directly related to statistical GT (Y , X ) + E Q(X, Y ) = B . Then, the minority rate (M R) significance tests. and the prop ortion of ties (P T ) for M are computed as: Table 3 summarises the results of our "swap" exp eriments P using the NTCIR-3 CLIR Chinese and Japanese data. The min(GT (X, Y ), GT (Y , X )) C P . (9) MR = "abs" column shows the absolute difference required for guarBC anteeing that the swap rate does not exceed 5%. The "rel" P column translates these values into relative values using the C E Q (X , Y ) P PT = . (10) maximum p erformance recorded among all trials (the "max" BC column). The "sensitivity" column, by which the IR metM R estimates the chance of reaching a wrong conclusion rics have b een sorted, shows the p ercentage of comparisons ab out a system pair, while P T reflects lack of discrimina(among the total of 435*1000 comparisons) that actually tive p ower. Thus, for a good p erformance metric, b oth of satisfied the difference threshold shown in the "abs" column. these values should b e small. As a fixed fuzziness value Again, the results are generally consistent with the Bootimplies different trade-offs for different metrics, we vary f strap ones: For b oth data sets, nDCG1000 , Q-measure and AveP are the most sensitive metrics, while PDoc1000 is the (= 0.01, 0.02, . . . , 0.20) for comparing the stability. least sensitive. In the middle lie nCG1000 , G AveP and G QFigure 10 shows our M R-P T curves based on the NTCIRmeasure. The results generalise those by Voorhees [16] who 3 CLIR Chinese data. (A similar graph for the Japanese minority rate 530 Table 3: Differences and sensitivity based on the swap method (swap rate 5%; NTCIR-3 CLIR Chinese and Japanese data). Chinese nDCG1000 Q-measure AveP nCG1000 G AveP G Q-measure PDoc1000 Japanese nDCG1000 Q-measure AveP nCG1000 G AveP G Q-measure PDoc1000 abs 0.07 0.07 0.08 0.08 0.09 0.10 0.01 abs 0.07 0.07 0.07 0.10 0.10 0.12 0.01 max .7414 .5374 .5295 .9514 .4739 .4967 .0983 max .7994 .6433 .6449 .9913 .5699 .5981 .0982 rel 9% 13% 15% 8% 18% 20% 10% rel 9% 11% 11% 10% 18% 20% 10% sensitivity 47% 43% 40% 35% 33% 33% 20% sensitivity 69% 67% 66% 56% 54% 53% 41% Figure 12: Kendall's rank correlations with NTCIR3 CLIR Chinese data. compared AveP and G AveP. Note also that the estimated p erformance differences for guaranteeing 5% swap rate or less are lower than those required for achieving AS L < = 0.05 with the Bootstrap Hyp othesis Tests. 4. QUANTIFYING THE ACCURACY OF RANK CORRELATION BETWEEN TWO IR METRICS Figure 13: Bootstrap estimates of standard error for the rank correlations shown in Figure 12. The previous section discussed how the Bootstrap Hyp othesis Tests can b e used to assess the sensitivity of individual IR metrics. We now show a different application of the Bootstrap to the evaluation of IR metrics. Kendal l's rank correlation is often used for comparing the system rankings according to two different IR metrics [7, 12, 17], but the statistical significance of rank correlation is rarely discussed in the IR community. In fact, there is a traditional significance test available for Kendall's rank correlation: Let the numb er of systems b e ns , and let b e the value of Kendall's rank correlation based on two system rankings. Then, it is known that Z0 = | | ((4ns + 10)/(9ns (ns - 1))) 2 1 (11) ob eys a normal distribution. Thus, a normal test can easily b e applied. (In the case of Spearman's rank correlation s, the test statistic would b e t0 = (|s| (ns - 2)1/2 )/(1 - s2 )1/2 which ob eys a t-distribution with ns - 2 degrees of freedom.) Note that the test statistic Z0 is prop ortional to | | given ns : In terms of a two-tailed test with ns = 30 runs, the rank correlation is significant at = 0.01 if it is over 0.34. However, there is another way to critically assess the rank correlation values: We could use the Bootstrap Estimate of Standard Error, which reflects the accuracy of an observed statistic. Let b b e the Bootstrap replicate of computed b us n PiB g the Bootstrap sample Q instead of Q, and let m = b /B . Then the Bootstrap estimate of standard error b=1 for Kendall's rank correlation is given by: seboot = ( ^ B X ( b - m )2 1 )2 . B-1 b=1 Figure 12 shows Kendall's rank correlations with the NTCIR-3 CLIR Chinese data for all pairs of IR metrics considered in this study. All of the correlation values exceed 0.6, and therefore are statistically highly significant. In particular, the most highly correlated pairs are (Q-measure, AveP) and (G Q-measure, G AveP); Also, (Q-measure, nDCG1000 ), (nDCG1000 , AveP) and (nCG1000 , PDoc1000 ) are highly correlated. Figure 13 shows the corresp onding Bootstrap Estimates of Standard Error. These values are approximately ten times smaller than the observed rank correlation values, suggesting that the observed values are quite accurate. The results are generally consistent with the traditional significance tests based on Z0 : For example, the graph shows that the correlations b etween (Q-measure, AveP) and (G Q-measure, G AveP) are highly accurate, while those b etween (AveP, G Q-measure), (AveP, G AveP), (Q-measure, G Q-measure) and (Q-measure, G AveP) are less accurate. These results probably reflect the fact that, unlike Arithmetic Means, Geometric Means produce quite different system rankings when there is change in the topic set as this change yields a "new" set of worst-p erforming topics that will b e emphasised, and that the new worst-p erforming topics are different across systems. The Bootstrap Estimate of Standard Error is widely applicable and useful for assessing the accuracy of any computable statistic. For example, it is applicable to the ob^ served p erformance difference d b etween two IR systems (See Eq. 8). In this case, one can compute ^ seboot = ( B X (D b - m D )2 1 )2 . B-1 b=1 (12) (13) We can thus quantify the accuracy of an observed rank correlation using seboot . ^ using Db from Figure 11 and mD = PB b=1 Db /B . If, for 531 ^ example, d is not much larger than seboot , then the difference ^ is probably not substantial. However, Bootstrap Hyp othesis Tests address such questions more formally. of the true distribution. Clarifying the limitations of our approach will b e one of the sub jects of our future work. Acknowledgements 5. RELATED WORK Savoy [14] used the paired Bootstrap Hyp othesis Test and Confidence Intervals, along with traditional significance tests, for comparing two IR strategies. Vu and Gallinari [17] compared, for an XML retrieval task, the Bootstrap Confidence Interval of the top run with those of the rest in order to compare the sensitivity of metrics such as Q-measure. In contrast, our methods p erform a Bootstrap Hyp othesis Test for every system pair and is less dep endent on a single run. Furthermore, our methods can estimate the p erformance difference required for achieving a given significance level, and we examined a wider variety of metrics. Unlike our new methods, the swap method [8, 15] is not directly related to significance tests: Sanderson and Zob el [13] used significance tests for filtering out some system pairs before applying the swap method; Sakai [12] rep orted that the system pair ranking according to significance tests and that according to the swap method are not very highly correlated. The original swap method used sampling without replacement; Sanderson and Zob el [13] also used sampling without replacement (although they called their method "selection with replacement" in their pap er) but ensured that the two topic sets Qi and Qi were indep endent; Sakai [10] showed that sampling with and without replacement yield similar results for comparing different IR metrics. While all of these studies used resampled topic sets that are half the size of Q, we used Bootstrap samples with the swap method, so that |Qb | = |Q|. Hence extrap olation was not required. As for rank correlation: Previous work that used rank correlation (e.g., [7, 16, 17]) often did not question statisitical significance, p ossibly b ecause the correlations values are generally very high. This pap er showed that the Bootstrap Estimate of Standard Error provides a simple way of examining the accuracy of rank correlation. I thank Steve Rob ertson for his insightful comments and advice on an early version of this pap er, Ian Sob oroff for guiding me to the Bootstrap Book [3], and the anonymous reviewers for their suggestions. 7. REFERENCES [1] Asakawa, S. and Selb erg, E.: The New MSN Search Engine Develop ed by Microsoft (in Japanese), Information Processing Society of Japan Magazine, Vol 46, No. 9, pp. 1008-1015, 2005. [2] Buckley, C. and Voorhees, E. M.: Evaluating Evaluation Measure Stability, ACM SIGIR 2000 Proceedings, pp. 33-40, 2000. [3] Efron, B. and Tibshirani, R.: An Introduction to the Bootstrap, Chapman & Hall/CRC, 1993. [4] Hull, D.: Using Statistical Testing in the Evaluation of Retrieval Exp eriments, ACM SIGIR '93 Proceedings, pp. 329-338, 1993. [5] J¨rvelin, K. and Kek¨l¨inen, J.: Cumulated a aa Gain-Based Evaluation of IR Techniques, ACM Transactions on Information Systems, Vol. 20, No. 4, pp. 422-446, 2002. [6] Johnson, D. H.: The Insignificance of Statistical Significance Testing, Journal of Wild life Management Vol. 63, Issue. 3, pp. 763-772, 1999. [7] Kek¨l¨inen, J.: Binary and Graded Relevance in IR aa evaluations - Comparison of the Effects on Ranking of IR Systems, Information Processing and Management, Vol. 41, pp.1019-1033, 2005. [8] Lin, W.-H. and Hauptmann, A.: Revisiting the Effect of Topic Set Size on Retrieval Error, ACM SIGIR 2005 Proceedings, pp. 637-638, 2005. [9] NTCIR: http://research.nii.ac.jp/ntcir/ [10] Sakai, T.: The Effect of Topic Sampling on Sensitivity Comparisons of Information Retrieval Metrics, 6. CONCLUSIONS NTCIR-5 Proceedings, pp. 505-512, 2005. This pap er showed that Bootstrap Hyp othesis Tests are [11] Sakai, T. et al.: Toshiba BRIDJE at NTCIR-5 CLIR: very useful not only for comparing IR strategies, but also for Evaluation using Geometric Means, NTCIR-5 comparing the sensitivity of IR metrics. The paired BootProceedings, pp. 56-63, 2005. strap Test is directly applicable to any Arithmetic Mean [12] Sakai, T.: On the Reliability of Information Retrieval metric. The unpaired Bootstrap Test has less p ower, but Metrics based on Graded Relevance, Information is directly applicable even to unconventional metrics. This Processing and Management, to app ear, 2006. pap er also showed that the Bootstrap Estimate of Stan[13] Sanderson, M. and Zob el, J.: Information Retrieval dard Error provides a simple way of quantifying the accuSystem Evaluation: Effort, Sensitivity, and Reliability, racy of rank correlation b etween two IR metrics. Through ACM SIGIR 2005 Proceedings, pp. 162-169, 2005. exp eriments with the NTCIR-3 data, we showed that Q[14] Savoy, J.: Statistical Inference in Retrieval measure, nDCG1000 and AveP are all very sensitive metrics, Effectiveness Evaluation, Information Processing and while PDoc1000 is very insensitive. nCG1000 and Geometric Management, Vol. 33, No. 4, pp. 495-512, 1997. Mean AveP / Q-measure lie in the middle. (But, as men[15] Voorhees, E. M. and Buckley, C.: The Effect of Topic tioned in Section 2.1, nCGl has a defect.) Moreover, these Set Size on Retrieval Exp eriment Error, ACM SIGIR Bootstrap-based results are in agreement with those based 2002 Proceedings, pp. 316-323, 2002. on the heuristics-based stability and swap methods. [16] Voorhees, E. M.: Overview of the TREC 2004 Robust Finally, it should b e noted that the Bootstrap is not assumptionRetrieval Track, TREC 2004 Proceedings, 2005. free: the most basic assumption that it relies on is that the [17] Vu, H.-T. and Gallinari, P.: On Effectiveness original topics of the test collection are indep ent and identiMeasures and Relevance Functions in Ranking INEX cally distributed samples from the p opulation P . Moreover, Systems, AIRS 2005 Proceedings, LNCS 3689, the Bootstrap is known to fail when the empirical distribpp. 312-327, 2005. ution based on the observed data is a p oor approximation 532