Give Me Just One Highly Relevant Document: P-measure Tetsuya Sakai Knowledge Media Laboratory, Toshiba Corporate R&D Center tetsuya.sakai@toshiba.co.jp ABSTRACT We introduce an evaluation metric called P-measure for the task of retrieving one highly relevant document. It models user b ehaviour in practical tasks such as known-item search, and is more stable and sensitive than Reciprocal Rank which cannot handle graded relevance. Categories and Sub ject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Exp erimentation Keywords: Test Collection, Evaluation Metrics examining the ranked list until he finds one document with a satisfactory relevance level. 2. METRICS Let R denote the numb er of relevant documents for a topic, and count(r ) denote the numb er of relevant documents within top r of a system output of size L. Let isr el(r ) b e 1 if the document at Rank r is relevant and 0 otherwise. Let g ain(L) denote the gain value for retrieving an L-relevant document, where, in the case of NTCIR, L = S (highly relevant), L = A (relevant) or L = B (partially relevant) [3, 5]. We use g ain(S ) = 3, g ain(A) = 2, g ain(B ) = 1 P by default. Let cg (r ) = 1ir g (i) denote the cumulative gain at Rank r for a system output, where g (i) = g ain(L) if the document at Rank i is L-relevant and g (i) = 0 otherwise. Similarly, let cgI (r ) denote the cumulative gain at Rank r for an ideal ranked output: For NTCIR, an ideal ranked output lists up all S-, A- and B-relevant documents in this order. Q-measure [5] is defined as: 1X Q-measur e = isr el(r )B R(r ) R 1rL where B R (r ) = cg (r ) + count(r ) . cgI (r ) + r 1. INTRODUCTION Different IR tasks require different evaluation metrics: A patent survey task may require a recal l-oriented metric, while a known-item search task may require a precision-oriented one. When we search the Web, we often stop going through the ranked list after finding one good Web page even though the list may contain some more relevant pages, either knowing or assuming that the rest of the retrieved pages lack novelty. Thus, finding exactly one relevant document with high precision is an imp ortant IR task. Reciprocal Rank (RR) is commonly used for the ab ove task: RR = 0 if the ranked output does not contain a relevant document; otherwise, RR = 1/r1 , where r1 is the rank of the retrieved relevant document that is nearest to the top of the list. However, RR cannot distinguish b etween a retrieved highly relevant document and a retrieved partial ly relevant document. In light of this, Sakai [3] prop osed a metric called O-measure for the task of finding one highly relevant document. O-measure is a generalisation of RR, and is a variant of Q-measure which is very highly correlated with Average Precision (AveP) but can handle graded relevance [5]. Using well-known methods [1, 6], Sakai [3] showed that O-measure is generally more stable than (and at least as sensitive as) RR, and that finding one highly relevant document and finding any one relevant document may require different search strategies. Both RR and O-measure assume that the user stops examining the ranked list as soon as he finds one relevant document, even if it is only partial ly relevant. However, this assumption may not b e valid in some IR situations such as known-item search. We therefore prop ose an evaluation metric called P-measure, which assumes that the user keeps Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. Note that replacing B R(r ) with P (r ) = count(r )/r gives AveP, and also that B R(r ) can p enalise "late arrival" of relevant documents [3, 5]. In a binary relevance environment, Q-measur e = Av eP holds iff there is no relevant document b elow Rank R, and Q-measur e > Av eP holds otherwise. O-measure [3], designed for the task of finding one highly relevant document, is defined as: O-measur e = B R(r1 ) = cg (r1 ) + count(r1 ) g (r 1 ) + 1 = . cgI (r1 ) + r1 cgI (r1 ) + r1 In a binary relevance environment, O-measur e = RR holds iff r1 R, and O-measur e > RR holds otherwise. One could argue that O-measure is counterintuitive in some cases: Consider Figure 1, in which a topic has one relevant document for each relevance level. System X has the B-relevant one at Rank 1 and the S-relevant one at Rank 2, while System Y has only the S-relevant one at Rank 2. Now, r1 = 1 for X while r1 = 2 for Y . Thus, O-measure evaluates X by just looking at Rank 1 even though there is a better document at Rank 2. With default gain values, Y outp erforms X in terms of O-measure (although smaller gain values could b e used to favour X [3]). 695 minority rate This is not necessarily a flaw. O-measure models a user who stops examining the ranked list as soon as he finds any relevant document. For example, he may b e looking at a plain list of document IDs, or a list of p oor-quality text snipp ets of retrieved documents, from which it is difficult for him to realise that the second document returned by System X is in fact highly relevant. However, if the user is looking for a known (or suspected) item or if it is easy for him to sp ot a highly relevant document in the ranked list, then we probably should assess X by considering the fact that it has an S-relevant document at Rank 2. We now define P-measure. If the system output does not contain a relevant document, P -measur e = 0. Otherwise, Let the preferred rank rp b e the rank of the first record obtained by sorting the system output, using the relevance level as the first key and the rank as the second key. Then: cg (rp ) + count(rp ) . P -measur e = B R(rp ) = cgI (rp ) + rp For System X , P -measur e = (4 + 2)/(5 + 2) = 6/7. For Y , P -measur e = O-measur e = (3 + 1)/(5 + 2) = 4/7. Figure 1: System Y outperforms System X in terms of O-measure with default gain values. 12 "Q-measure" "AveP" "P-measure" "O-measure" "RR" 10 8 6 4 2 3. EVALUATION OF P-MEASURE 0 0 10 20 30 40 proportion of ties 50 60 70 80 To evaluate P-measure, we used the "stability" method which examines the stability of system comparisons with resp ect to change in the topic set, and the "swap" method which estimates the p erformance difference required to b e sure that a system outp erforms another [1, 6]. We used two data sets: the NTCIR-5 CLIR Chinese and Japanese runs and test collections with 50 and 47 topics. For each data set, we used top 50 runs as measured by Mean AveP, and 1000 bootstrap samples of topics [4]. Figure 2 shows that Pmeasure is more stable than RR (and p ossibly O-measure); Table 1 shows that P-measure is more sensitive than RR (and p ossibly O-measure). For example, for the Chinese data, P-measure discriminates 39.2% of the 1,225,000 observations (50*49/2 run pairs times 1,000 resampled topic sets) with 95% confidence, while RR discriminates only 28.7%. But even P-measure is not as stable and sensitive as Qmeasure and AveP, as it does not examine al l relevant documents. In practice, a document cut-off may b e used with P-measure, since P-measure assumes that the user is willing to examine an "unlimited" numb er of documents. However, a small cut-off makes IR evaluation unstable, so a large topic set should b e used in such a case [1, 5]. Figure 2: Buckley/Voorhees stability (NTCIR-5 CLIR Chinese data). 4. CONCLUSIONS We prop osed P-measure for the task of retrieving one highly relevant document, which arguably b etter models user b ehaviour for practical tasks such as known-item search than existing metrics. It is more stable and sensitive than Reciprocal Rank (and p ossibly O-measure). One p ossible weakness of P-measure is that, like R-measure describ ed in [5], it may reach one even for a sub optimal ranked output. For example, P-measure would b e one for a perfect inverse of the ideal output shown in Figure 1 since rp = 3, although such a system output may b e rare in practice. If need b e, this problem may b e avoided by using P P+ (p ee-plus)-measure: 1r rp isr el(r )B R(r )/count(rp ). In our future work, we plan to use Sakai's Bootstrap-based method [4] for comparing P(+ )-measure, O-measure, RR as well as "Weighted RR" [2] prop osed at the NTCIR Web track (although the track actually uses binary RR). 5. REFERENCES [1] Buckley, C. and Voorhees, E. M.: Evaluating Evaluation Measure Stability, ACM SIGIR 2000 Proceedings, pp. 33-40, 2000. [2] Eguchi, K. et al.: Overview of the Web Retrieval Task at the Third NTCIR Workshop. National Institute of Informatics Technical Report NII-2003-002E, 2003. [3] Sakai, T.: On the Task of Finding One Highly Relevant Document with High Precision, Information Processing Society of Japan Transactions on Databases, Vol.47, No.SIG 4 (TOD29), 2006. [4] Sakai, T.: Evaluating Evaluation Metrics based on the Bootstrap, ACM SIGIR 2006 Proceedings, 2006. [5] Sakai, T.: On the Reliability of Information Retrieval Metrics based on Graded Relevance, Information Processing and Management, to appear, 2006. [6] Voorhees, E. M. and Buckley, C.: The Effect of Topic Set Size on Retrieval Experiment Error, ACM SIGIR 2002 Proceedings, pp. 316-323, 2002. Table 1: Voorhees/Buckley sensitivity (95% confidence; NTCIR-5 CLIR Chinese and Japanese). Chinese Q-measure AveP P-measure O-measure RR Japanese AveP Q-measure P-measure O-measure RR absolute diff. required 0.06 0.07 0.12 0.13 0.15 absolute diff. required 0.07 0.07 0.14 0.15 0.17 sensitivity 51.6% 45.5% 39.2% 34.2% 28.7% sensitivity 28.8% 27.3% 15.7% 11.8% 10.8% 696