SIGIR 2007 Proceedings Session 3: Evaluation Alternatives to Bpref Tetsuya Sakai NewsWatch, Inc. (current affiliation) / Toshiba Corporate R&D Center sakai@newswatch.co.jp ABSTRACT Recently, a numb er of TREC tracks have adopted a retrieval effectiveness metric called bpref which has b een designed for evaluation environments with incomplete relevance data. A graded-relevance version of this metric called rpref has also b een prop osed. However, we show that the application of Qmeasure, normalised Discounted Cumulative Gain (nDCG) or Average Precision (AveP) to condensed lists, obtained by filtering out all unjudged documents from the original ranked lists, is actually a b etter solution to the incompleteness problem than bpref. Furthermore, we show that the use of graded relevance b oosts the robustness of IR evaluation to incompleteness and therefore that Q-measure and nDCG based on condensed lists are the b est choices. To this end, we use four graded-relevance test collections from NTCIR to compare ten different IR metrics in terms of system ranking stability and pairwise discriminative p ower. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Experimentation Keywords Test Collection, Evaluation Metrics, Graded Relevance 1. INTRODUCTION Information Retrieval (IR) evaluation using incomplete relevance data is b eginning to receive attention. Large-scale test collections constructed through pooling [4, 5, 16, 17], such as the TREC, CLEF and NTCIR collections, are all incomplete to some degree, in that only a small sample of the document collection has b een judged for relevance for each topic. While the collection sizes tend to grow monotonically in order to mimic real-world data such as the Web, the available manp ower for relevance assessments often remain more or less constant, and therefore IR researchers are exp ected to live with the incompleteness issue as long they adhere to the Cranfield paradigm [3, 13]. At SIGIR '04, Buckley and Voorhees [3] prop osed an IR evaluation metric called bpref (binary preference) which is highly correlated with AveP (Average Precision) when full relevance assessments are available and is yet more robust when the relevance assessments are reduced. Recent TREC tracks have started using this metric along with AveP. Bpref p enalises a system if it ranks a judged nonrelevant document ab ove a judged relevant one, and is indep edendent of how the unjudged documents are retrieved. At the SIGIR '06 p oster session, De Beer and Moens [6] prop osed rpref, which is a graded-relevance extension of bpref. More recently at CIKM '06, Yilmaz and Aslam [16] prop osed three new methods that may replace bpref, which we shall discuss in Section 2. This pap er shows that the application of Q-measure [9, 10], normalised Discounted Cumulative Gain (nDCG) [7] or AveP to condensed lists, obtained by filtering out all unjudged documents from the original ranked lists, is actually a b etter solution to the incompleteness problem than bpref. Furthermore, we show that the use of graded relevance b oosts the robustness of IR evaluation to incompleteness and therefore that Q-measure and nDCG based on condensed lists are the b est choices. To this end, we use four graded-relevance test collections from NTCIR to compare ten different IR metrics in terms of system ranking stability and pairwise discriminative power [9]. Section 2 discusses some work that are related to this study. Section 3 provides the original definitions of bpref and rpref. Section 4 redefines these metrics based on a condensed list of documents, which leads to some simple alternatives to bpref. Section 5 describ es our NTCIR data. Section 6 compares the metrics in terms of the entire system ranking using Kendal l's rank correlation [3, 10, 16], and Section 7 compares them in terms of discriminative p ower using Sakai's bootstrap sensitivity method [9]. Finally, Section 8 concludes this pap er. 2. RELATED WORK Several researchers have tackled the problem of reducing human effort required for relevance assessments: SIGIR '98 saw new methods for creating judgment p ools efficiently [5, 17]; At SIGIR '01, Sob oroff [12] prop osed a method for ranking systems without any relevance assessments, but the method tends to rank them by "p opularity" rather than p erformance [1]. More recently at SIGIR '06, Carterette, Allan and Sitaraman [4] analyzed the distribution of AveP over all p ossible assignments of relevance to all unjudged documents and prop osed a method to construct a test collection with Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. 71 SIGIR 2007 Proceedings Session 3: Evaluation minimal relevance assessments; Coincidently, Aslam, Pavlu and Yilmaz [2] prop osed a method for obtaining unbiased estimates of standard metrics such as AveP based on random sampling, while Sob oroff [13] used bpref to examine how "decay" of Web documents affect system evaluation. Among existing studies, the CIKM '06 pap er by Yilmaz and Aslam [16] is probably the most relevant to the present work, since they also prop osed their "alternatives to bpref ", namely, Induced, Subcol lection and Inferred versions of AveP. Induced AveP is exactly what we call "AveP ", which we derived indep endently as shown in Section 4. The other two metrics are more complex and less widely applicable in that they require knowledge of al l p ooled documents, including pooled but unjudged ones. (Sub collection AveP requires even more knowledge [16].) While the aim of Yilmaz and Aslam was to estimate the "actual" AveP values accurately, the present study considers a wider range of widely applicable IR metrics, including those based on graded relevance, for the purp ose of ranking systems reliably in an incomplete rel( evance environment. Our exp eriments include b oth AveP i.e., Induced AveP) and the actual AveP. Another contribution of this pap er is that we evaluate the metrics in terms of discriminative power [9] given incomplete relevance data. we used earlier. For each Dk J , let k [0, 1] denote its judged relevance with resp ect to the topic in question, where 0 represents judged nonrelevance and 1 represents the highest p ossible relevance. Then, rpref N is defined as: r p r ef N = where R = 1 R X Dk J, k >0 k (1 - penaltyk ) N (2) X Dk J k (3) N and penaltyk = = X Dk J (1 - k ) (4) X Dl J, rl 0, rk =1 r k (1 - penaltyk ) (7) Nk That is, Nk is the numb er of judged documents ranked ab ove Dk . Note that Nk = 0 if rk = 1, hence the "rk = 1" condition is necessary. 72 SIGIR 2007 Proceedings Session 3: Evaluation 4. ALTERNATIVE DEFINITIONS AND ALTERNATIVE METRICS Section 3 showed that bpref, rpref N and rpref relative rely on judged documents only, and that they do not explicitly dep end on the absolute document ranks. This section b egins by redefining these preference-based metrics using a condensed list, obtained by removing all unjudged documents from the original system output. This suggests that "standard" metrics based explicitly on ranks may in fact b e applied successfully to IR evaluation given incomplete relevance data. 4.1 Alternative Definitions of Bpref and Rpref Supp ose we need to evaluate a ranked list that contains (say) 1000 documents. The document at Rank r may or may not b e a judged one. Since bpref relies on judged documents only, we can safely remove al l unjudged documents from the list b efore computing bpref. As a result, we obtain a condensed list that contains judged documents only. Let r ( r ) denote the new rank of a document in the condensed list. Then, based on these ranks, bpref can b e rewritten as: min(R, r - count(r )) 1X ) (9) s r e l (r )(1 - bpr ef = R min(R, N ) i r rpref 's judged relevance values k [0, 1] (See Section 3.2) can b e obtained by dividing the raw gain values by g ain(H), where H is the highest relevance level across all topics of the test collection, e.g., H = S . Although bpref and rpref do not require the notion of ideal ranked output [7] for their definitions, we introduce it here in order to clarify how the preference-based metrics are related to cumulative-gain-basePones: An ideal ranked output for d a given topic with R = L R(L) relevant documents is one such that g (r ) > 0 for 1 r R and g (r ) g (r - 1) for r > 1. Thus, for NTCIR, listing up all S-relevant documents, followed by all A-relevant documents, followed by all B-relevant documents produces an ideal ranked output. Note that an ideal ranked output is unaffected by the removal of unjudged documents. Let gI (r ) and cgI (r ) denote the (cumulative) gain of an ideal ranked output. Then: X X cgI (R) = g I (r ) = R(L)g ain(L) . (12) 1r R L That is, the ideal cumulative gain at Rank R is the sum of all available gain values for the topic. Hence, From Eqs. 3 and 12, R = = cgI (R)/g ain(H) . (13) Moreover, from Eqs. 6 and 13, N R + N - cgI (R)/g ain(H) . (14) where isr el(r ) is one if the document at Rank r is relevant and zero otherwise; and count(r ) is the numb er of relevant documents in top r of the ranked list. Hence r - count(r ) is the numb er of judged nonrelevant documents ranked ab ove the document at Rank r . It is easy to see that the two definitions (Eqs. 1 and 9) are equivalent. (The "buggy" bref, or old bpref, uses min(R, Nret ) instead of min(R, N ) in Eq. 9, where Nret is the numb er of retrieved judged nonorelevant documents. See the file bpref bug in trec eval version 8.1.) Eq. 9 implies that bpref is in fact two different metrics: For any topic such that R N , bpref reduces to: 1X min(R, r - count(r )) s r e l (r )(1 - bpr ef R = ). R R i r We can also rewrite rpref 's p enalty (Eq. 5) based on the notion of gain for a condensed list. The p enalty based on a relevant document at Rank r can b e expressed as: X g (r ) - g (i ) penalty (r ) = (15) g (r ) <, ) ) i r g ( i 0 anH = 1 cgI (R) X r , g ( r ) >0 (10) Whereas, for any topic such that R N , bpref reduces to: 1X r - count(r ) ) (11) s r e l (r )(1 - bpr ef N = R N i r g (r )(1 - penalty (r R+N - ) cgI (R) g ain(H) ). (16) The latter is also known as bref allnonrel. Note that R N r - count(r ) holds in the latter case and therefore the minimum op erator is not necessary. Similarly, let us now rewrite rpref N (Eq. 2), which is in fact a graded-relevance version of bpref N shown ab ove. We b egin by adopting the concept of cumulative gain as prop osed by J¨rvelin and Kek¨l¨inen [7]. Let L denote a a aa relevance level, and let g ain(L) denote the gain value for retrieving an L-relevant document. Without loss of generality, this pap er assumes that we have S-relevant (highly relevant), A-relevant (relevant) and B-relevant (partially relevant) documents as in NTCIR [8] in addition to judged nonrelevant documents. Also, let R(L) P note the numb er de of L-relevant documents. Let cg (r ) = 1ir g (i) denote the cumulative gain at Rank r of the system output, where g (i) = g ain(L) if the document at Rank i is L-relevant and g (i) = 0 otherwise (i.e., if the document at Rank i is either judged nonrelevant or unjudged). It is easy to see that Similarly, since the numb er of judged documents ranked ab ove a document at Rank r in a condensed list is exactly r - 1, rpref relative (Eq. 7) can b e rewritten as: r pr ef r elativ e = 1 cgI (R) X r =1, g (r )>0 g (r )(1- penalty (r ) ). r -1 (17) De Beer and Moens [6] also mention its binary-relevance variant, which we call bpref relative and formalise as: 1X r - count(r ) isr el(r )(1- bpr ef r elativ e = ) . (18) R= r -1 r 1 NoP that, in a binary relevance environment, Eq. 15 reduces te P to i 0 after filtering out al l unjudged documents from the original ranked list? Thus, let us consider, in addition to AveP, Qmeasure [9, 10] and nDCG [7], b oth of which explicitly rely on document ranks, and can handle graded relevance. Q-measure is defined as: 1X cg (r ) + count(r ) Q-measur e = isr el(r ) (21) Rr cgI (r ) + r where is a parameter for controlling the p enalty on late arrival of relevant documents. It is very highly correlated with AveP (Note that = 0 reduces Q-measure to AveP) and its discriminative p ower (See Section 7) is known to b e at least as high as that of AveP. Since Sakai [11] has shown Q-measure's robustness to the choice of , we let = 1. Instead of the raw gains g (r ), nDCG uses discounted gains dg (r ) = g (r )/ loga (r ) for r > a and dg (r ) = g (r ) for r a. Let dgI (r ) denote a discounted gain for an ideal ranked list. Then, nDCG at document cut-off l is defined as: X X nDC Gl = d g (r )/ d g I (r ) . (22) 1r l 1r l (19) Note that now we do examine a relevant document at Rank 1. It is also easy to show that rpref relative2 equals one for an ideal ranked output for any topic. Now, let us modify bpref relative similarly: 1X r - count(r ) bpr ef r elativ e2 = s r e l (r )(1 - ) R r i r = count(r 1X s r e l (r ) R r i r ) . (20) Throughout this pap er, we let l = 1000 as it is known that small document cut-offs hurt nDCG s stability [10]. Moreover, we let a = 2 b ecause it is known that using a large logarithm base makes nDCG counterintuitive and insensitive [11]. nDCG is more forgiving for low-recall topics than AveP and Q-measure, as it does not dep end on R directly. Just like AveP (i.e., bpref relative2), we can apply Qmeasure and nDCG to condensed lists, i.e., after removal of all unjudged documents. We shall refer to these metrics as Q and nDCG . The remainder of this pap er studies how these metrics compare to bpref and rpref variants. Since Q-measure and nDCG are robust to the choice of gain values [10], we let g ain(S ) = 3, g ain(A) = 2, g ain(B ) = 1 throughout this pap er. But this is exactly the wel l-known noninterpolated Average Precision (AveP) based on a condensed list in place of a raw list containing both judged and unjudged documents. We can summarise our discussions so far as follows: ˇ bpref is actually two metrics: bpref R (if R N ) and bpref N (if R N ). ˇ bpref R, bpref N and rpref N use absolute normalisation [6], so the misplacement p enalties based on highly ranked relevant documents are small; ˇ bpref relative and rpref relative use relative normalisation to emphasize misplacement p enalties based on highly ranked relevant documents, but have flaws; ˇ bpref relative2 and rpref relative2 are free from all of these flaws, and more imp ortantly, bpref relative2 is exactly AveP based on a condensed list. (We shall therefore refer to bpref relative2 as AveP hereafter; It is also known as Induced AveP [16].) i In short, the only essential difference between bpref and AveP s that while the former uses absolute normalisation, the latter uses relative normalisation, as noted also in [16]. The ab ove fact p oses this question: For evaluating IR systems with incomplete relevance information, is it real ly necessary to introduce bpref or its variants? Why can't we just use existing metrics based explicitly on ranks instead, 5. FULL AND REDUCED DATA We use four data sets (i.e., test collections and submitted runs) from the NTCIR CLIR task [8]: Table 1 shows some statistics. We use the top 30 runs from each data set as measured by (actual) Mean AveP for our analyses: Under this condition, Kendall's rank correlation b etween two system rankings, representing two IR metrics, is statistically significant at = 0.01 if it exceeds 0.34 (two-sided normal test) [9]. The table also shows that N is always larger than R for the NTCIR collections: For example, for NTCIR-3C, the smallest N across all topics is 911, while the largest R across all topics is only 264. Thus, since R < N always holds, bpref is equivalent to bpref R for our data. Hence we shall refer to bpref as bpref R in our exp eriments. To examine the effect of incompleteness on the entire system ranking and discriminative p ower, we created reduced relevance data from the full relevance data, following the original Buckley/Voorhees methodology [3]: First, for each topic, we created a randomised list of judged relevant documents of size R, and a separate randomised list of judged nonrelevant documents of size N . Then, for each reduction rate j {90, 70, 50, 30, 10}, we created a reduced set of relevance data by taking the first Rj and Nj documents from the two lists, resp ectively, where Rj = max(1, tr uncate(R j /100)) and Nj = max(10, tr uncate(N j /100)). The contants 1 and 10 have b een copied from [3], representing the 74 SIGIR 2007 Proceedings Session 3: Evaluation Table 1: The NTCIR Data: N > R holds for all topics in all four data sets. NTCIR-3 NTCIR-3 NTCIR-5 NTCIR-5 Chinese ("NTCIR-3C") Japanese ("NTCIR-3J") Chinese ("NTCIR-5C") Japanese ("NTCIR-5J") topic s 42 42 50 47 range N R [911, 2564] [4, 264] [707, 2862] [7, 354] [788, 2114] [6, 249] [930, 2280] [6, 345] N 1432.6 1676.6 1836.3 1768.9 average over topic set R R(S ) R(A) 78.2 21.0 24.9 60.4 7.9 31.5 61.0 7.0 30.7 89.1 3.2 41.8 R(B ) 32.3 21.0 23.3 44.2 runs used 30 30 30 30 Figure 1: Reduction rate (x axis) vs. performance averaged over topics and runs (y axis). minimum numb er of judged (non)relevant documents required for a topic. (In practice, the constant 10 was seldom used since N was generally very large.) This stratified sampling is essentially equivalent to random sampling from the entire set of judged documents [16]. Figure 1 shows the effect of relevance data reduction on the absolute overall p erformances (e.g., Mean AveP) averaged across all 30 runs for each data set. The horizontal axis represents the reduction rate j . It is clear that the "actual" AveP, Q-measure, nDCG values quickly diminish as the relevance data b ecomes more and more incomplete (as represented by the dotted lines), while the bpref R (i.e., bpref ) curve is relatively flat. This much supp orts a finding in [3]. However, it is also clear that the other metrics (Q , nDCG , rpref relative2 and AveP , ) do just as wel l as bpref in terms of the absolute value stability. This further encouraged us to test whether applying a standard metric to condensed lists works at least as wel l as bpref. While all of the correlation values are statistical ly highly significant (See Section 5), we indicate values higher than 0.8 in b old just for convenience. Let (M1 ,M2 ) denote a correlation value b etween two rankings by metrics M1 and M2 . We can observe that bpref N and rpref N are not highly correlated with "standard" metrics such as AveP: For example, (bpref N, AveP)= .480 for NTCIR-5J. (They are not highly correlated with bpref R either.) This reflects the fact that bpref N and rpref N give small p enalties for misplacements ab ove highly ranked relevant documents: Since N is generally very large for the NTCIR data, absolute normalisation virtually ignores the p enalties for small values of r . As we shall see in Section 7, bpref N and rpref N are also p oor in terms of discriminative p ower. Hence we shall omit these two metrics in our reduced relevance data exp eriments. As also noted in [16], the correlation b etween bpref R (i.e., bpref ) and the actual AveP is consistently lower than that b etween AveP and the actual AveP. For example, for NTCIR-5J, (bpref R, AveP)= .885, while (AveP , AveP)= .963. That is, given the ful l relevance data, AveP is a better approximation of the actual AveP than bpref is. Similarly, the correlations (Q , Q-measure) and (nDCG , nDCG) are also very high. Figure 2 shows the effect of relevance data reduction on the system ranking for each metric. For example, the AveP curve represents the values of (AveP, AveP(j )), where AveP(j ) denotes AveP computed based on the j % relevance data (j {90, 70, 50, 30, 10}). It can b e observed that only the dotted line of the actual AveP stands apart from the others. More sp ecifically, our observations are as follows: 6. EFFECT OF INCOMPLETENESS ON RANK CORRELATION This section uses Kendall's rank correlation [3, 10, 16] to study the effect of incompleteness on the entire system ranking. Table 2 shows the correlation values b etween system rankings by two different metrics for all four data sets, given full relevance data. Among the ten metrics we consider, recall that AveP, Q-measure and nDCG are based on the original ranked lists, while AveP , Q and nDCG are based on judged documents only just like the bpref variants. 75 SIGIR 2007 Proceedings Session 3: Evaluation Table 2: Kendall's rank correlation using full relevance data from NTCIR. NTCIR-3C bpref R bpref N rpre f N rpref relative2 AveP Q nDCG AveP Q-measure NTCIR-3J bpref R bpref N rpre f N rpref relative2 AveP Q nDCG AveP Q-measure NTCIR-5C bpref R bpref N rpre f N rpref relative2 AveP Q nDCG AveP Q-measure NTCIR-5J bpref R bpref N rpre f N rpref relative2 AveP Q nDCG AveP Q-measure bpref N .756 - bpref N .867 - bpref N .531 - bpref N .476 - - rpre f N .747 .963 rpre f N .862 .986 rpre f N .508 .949 rpre f N .499 .949 - rpref relative2 .936 .738 .729 rpref relative2 .949 .834 .839 rpref relative2 .839 .508 .494 rpref relative2 .876 .416 .448 - AveP .926 .720 .710 .945 AveP .977 .853 .857 .963 AveP .913 .526 .513 .899 AveP .903 .490 .503 .899 - Q .931 .779 .770 .959 .931 Q .936 .876 .880 .949 .940 Q .811 .572 .559 .899 .890 Q .811 .536 .559 .853 .862 - nDCG .839 .798 .816 .867 .821 .871 nDCG .885 .880 .894 .890 .871 .922 nDCG .770 .586 .572 .867 .802 .857 nDCG .733 .595 .628 .766 .766 .821 - A veP .931 .733 .724 .959 .986 .945 .834 A veP .986 .862 .867 .954 .991 .949 .880 A veP .871 .531 .517 .876 .949 .903 .798 A veP .885 .480 .494 .890 .963 .862 .756 - Q-measure .945 .756 .747 .972 .954 .968 .857 .968 Q-measure .940 .862 .867 .963 .945 .986 .908 .954 Q-measure .821 .563 .549 .908 .899 .972 .857 .922 Q-measure .821 .517 .540 .871 .880 .972 .811 .890 - nDCG .839 .789 .807 .867 .821 .862 .991 .834 .857 nDCG .890 .885 .899 .894 .876 .926 .995 .885 .913 nDCG .775 .591 .577 .871 .807 .862 .995 .802 .862 nDCG .738 .591 .623 .770 .770 .825 995 .761 .816 (i) a he rankings by Q , nDCG , rpref relative2 and AveP T re at least as robust to incompleteness as the ranking by bpref R. (ii) The rankings by the actual Q-measure and nDCG are more robust to incompleteness than the ranking by the actual AveP; In fact, they do as well as bpref R. For example, for NTCIR-5J, (Q , Q (10))=0.66, (bpref R, bpref R(10))=0.42 and (Q-measure, Q-measure(10))=0.41. Whereas, (AveP,AveP(10))=0.21 which is not even statistically significant. Observation (i) supp orts our hyp othesis that applying a standard metric to condensed lists works at least as wel l as bpref. Observation (ii) suggests that the best gradedrelevance metrics are more robust to incompleteness than the binary AveP. Our explanation for this is as follows: Due to the distribution of R(X ) shown in Table 1, randomised relevance data reduction is exp ected to remove more partially relevant documents than highly relevant documents. Whereas, reflecting the same original distribution, most runs actually contain many more partially relevant documents than highly relevant ones. Hence, highly relevant documents contained within the runs are relatively unaffected by the reduction, while partially relevant documents contained within the runs may b ecome "nonrelevant" due to the reduction. Therefore, graded-relevance metrics, which dep end more on highly relevant documents than on partially relevant ones, demonstrate a higher robustness to incompleteness. 7. EFFECT OF INCOMPLETENESS ON DISCRIMINATIVE POWER This section compares the IR metrics in terms of discriminative p ower using the paired test version of Sakai's bootstrap sensitivity method [9], which is known to agree very well with the more ad hoc Voorhees/Buckley swap method [15]. The input to this method are a test collection, a set of runs, an IR metric, and the significance level for b ootstrap hyp othesis tests. Using resampled topic sets, the method conducts a b ootstrap hyp othesis test for every system pair, and computes the discriminative p ower, i.e., for how many system pairs the IR metric was able to detect a significant difference, and the overall p erformance difference required to achieve that significance. Table 3 ranks the IR metrics according to discriminative p ower at = 0.05 given full relevance data. For example, for NTCIR-5C, the actual Q-measure detects a significant difference at = 0.05 for 174 out of 435 (30*29/2) system pairs (40%), and the estimated overall p erformance difference required to detect one given 50 topics is 0.11 [9]. Metrics based on judged documents only are shown in b old. It is clear that, given full relevance data, bpref R is substantially more discriminative than bpref N and rpref N, but al l other metrics are more discriminative than bpref. In particular, note that Q , nDCG , rpref relative2 and AveP al l outperform bpref consistently, using judged documents only. In addition, the sup eriority of Q over bpref R, AveP and rpref relative2 is consistent across all data sets. 76 SIGIR 2007 Proceedings Session 3: Evaluation Figure 2: Reduction rate (x axis) vs. Kendall's rank correlation (y axis). Table 3: Discriminative power at = 0.05 using full relevance data from NTCIR. NTCIR-3C Q-measure AveP Q nDCG rpref relative2 nDCG AveP bpref R rpref N bpref N NTCIR-3J nDCG nDCG Q Q-measure AveP rpref relative2 AveP bpref R bpref N rpref N disc. p ower 242/435=55.6% 240/435=55.2% 2 38/435=54.7% 2 36/435=54.3% 236/435=54.3% 235/435=54.0% 2 34/435=53.8% 232/435=51.5% 203/435=46.7% 198/435=45.6% disc. p ower 3 17/435=72.9% 316/435=72.6% 3 08/435=70.8% 305/435=70.1% 298/435=68.5% 298/435=68.5% 2 96/435=68.0% 283/435=65.1% 272/435=62.5% 272/435=62.5% diff. required 0.10 0.11 0.10 0.11 0.10 0.13 0.12 0.12 0.13 0.14 diff. required 0.12 0.14 0.11 0.13 0.11 0.11 0.11 0.11 0.16 0.16 NTCIR-5C Q-measure Q nDCG rpref relative2 AveP AveP nDCG bpref R bpref N rpref N NTCIR-5J Q-measure Q nDCG nDCG rpref relative2 AveP AveP bpref R bpref N rpref N disc. p ower 174/435=40.0% 67/435=38.4% 163/435=37.5% 163/435=37.5% 159/435=36.6% 1 58/435=36.3% 1 56/435=35.9% 132/435=30.3% 109/435=25.1% 105/435=24.1% disc. p ower 136/435=31.3% 1 31/435=30.1% 120/435=27.6% 1 19/435=27.4% 115/435=26.4% 113/435=26.0% 1 13/435=26.0% 100/435=23.0% 68/435=15.6% 56/435=12.9% 1 diff. required 0.11 0.11 0.10 0.10 0.11 0.11 0.10 0.10 0.11 0.12 diff. required 0.09 0.09 0.13 0.12 0.10 0.10 0.10 0.10 0.11 0.12 Figure 3 shows the effect of relevance data reduction (j {70, 50, 30, 10}) on discriminative p ower at = 0.05. For example, for NTCIR-5C with only 10% relevance data, the actual AveP detects a significant difference for only ab out 2.5% of all run pairs. (Note that its discriminative p ower is 36.6% with 100% relevance data, as shown in Table 3.) Whereas, under the same condition, the actual Q-measure and nDCG are comparable to bpref (a little over 10% in disa criminative p ower); Q , nDCG , rpref relative2 and AveP re the most discriminative (over 20%). Although the actual Q-measure does not do well with NTCIR-3J at j = 10, and nDCG( ) app ears to do worse at j = 50 than at j = 30 with NTCIR-5C and 5J, the trends consistent across all four data sets are as follows: (i ) (ii ) Given incomplete relevance data, bpref R is more discriminative than the actual AveP; but so are the actual Q-measure and nDCG. Thus, the b enefit of introducing bpref is not clear in terms of discriminative p ower either. 8. CONCLUSIONS This pap er showed that the application of Q-measure, nDCG or AveP to condensed lists, obtained by filtering out all unjudged documents from the original ranked lists, is actually a b etter solution to the incompleteness problem than bpref. Furthermore, we showed that the use of graded relevance b oosts the robustness of IR evaluation to incompleteness and therefore that Q-measure and nDCG based on condensed lists are the b est choices. In terms of the entire system ranking, Q-measure, nDCG and AveP based on condensed lists (Q , nDCG and AveP ) are more robust to incompleteness than bpref R (i.e., bpref ); even the actual Q , nDCG , rpref relative2 and AveP are more discriminative than bpref R in a very incomplete relevance environment; Q and nDCG are probably the most discriminative. 77 SIGIR 2007 Proceedings Session 3: Evaluation Figure 3: Reduction rate (x axis) vs. discriminative power at = 0.05 (y axis). Q-measure and nDCG do as well as bpref R. In terms of pairwise discriminative p ower, Q , nDCG and AveP are more discriminative than bpref R whether full or reduced relevance data is used; Q is slightly but consistently more discriminative than AveP ; even the actual Q-measure and nDCG can often b e as discriminative as bpref R in a very incomplete relevance environment. A variant of bpref/rpref (rpref relative2) does well also, but the b enefit of introducing this relatively complex metric is not clear either. Finally, it should b e noted that we have not examined IR metrics from the viewp oint of user satisfaction: It remains to b e seen, for example, how recent criticisms of AveP [14] extend to different exp erimental environments and different metrics. Prop erties such as rank correlation with an established metric and discriminative p ower provide necessary conditions for a good IR metric, not sufficient conditions. Judgments, ACM SIGIR 2006 Proceedings, pp. 637-638, 2006. J¨rvelin, K. and Kek¨l¨inen, J.: Cumulated a aa Gain-Based Evaluation of IR Techniques, ACM Transactions on Information Systems, Vol. 20, No. 4, pp. 422-446, 2002. Kando, N.: Overview of the Fifth NTCIR Workshop, NTCIR-5 Proceedings, 2005. Sakai, T.: Evaluating Evaluation Metrics based on the Bootstrap, ACM SIGIR 2006 Proceedings, pp. 525-532, 2006. Sakai, T.: On the Reliability of Information Retrieval Metrics based on Graded Relevance, Information Processing and Management, 43(2), pp. 531-548, 2007. Sakai, T.: On Penalising Late Arrival of Relevant Documents in Information Retrieval Evaluation with Graded Relevance, Proceedings of the First International Workshop on Evaluating Information Acess (EVIA 2007), to app ear, 2007. Sob oroff, I., Nicholas, C. and Cahan, P.: Ranking Retrieval Systems without Relevance Judgments, ACM SIGIR 2001 Proceedings, pp. 66-73, 2001. Sob oroff, I.: Dynamic Test Collections: Measuring Search Effectiveness on the Live Web, ACM SIGIR 2006 Proceedings, pp. 276-283, 2006. Turpin, A. and Scholer, F.: User Performance versus Precision Measures for Simple Search Tasks, ACM SIGIR 2006 Proceedings, pp. 11-18, 2006. Voorhees, E. M. and Buckley, C.: The Effect of Topic Set Size on Retrieval Exp eriment Error, ACM SIGIR 2002 Proceedings, pp. 316-323, 2002. Yilmaz, E. and Aslam, J. A.: Estimating Average Precision with Incomplete and Imp erfect Judgments, CIKM 2006 Proceedings, 2006. Zob el, J.: How Reliable are the Results of Large-Scale Information Retrieval Exp eriments? ACM SIGIR '98 Proceedings, pp. 307-314, 1998. [7] [8] [9] [10] [11] 9. REFERENCES [12] [1] Aslam, J. A. and Savell, R.: On the Effectiveness of Evaluating Retrieval Systems in the Absence of Relevance Judgments, ACM SIGIR 2003 Proceedings, pp. 361-362, 2003. [2] Aslam, J. A., Pavlu, V. and Yilmaz, E.: A Statistical Method for System Evaluation Using Incomplete Judgments, ACM SIGIR 2006 Proceedings, pp. 541-548, 2006. [3] Buckley, C. and Voorhees, E. M.: Retrieval Evaluation with Incomplete Information, ACM SIGIR 2004 Proceedings, pp. 25-32, 2004. [4] Carterette, B., Allan, J. and Sitaraman, R.: Minimal Test Collections for Retrieval Evaluation, ACM SIGIR 2006 Proceedings, pp. 268-275, 2006. [5] Cormack, G. V., Palmer, C. R. and Clarke, C. L. A. Efficient Construction of Large Test Collections, ACM SIGIR '98 Proceedings, pp. 282-289 (1998). [6] De Beer, J. and Moens, M.-F.: Rpref - A Generalization of Bpref towards Graded Relevance [13] [14] [15] [16] [17] 78