Rpref - A Generalization of Bpref towards Graded Relevance Judgments Jan De Beer Legal Informatics & Information Retrieval group Interdisciplinary Centre for Law and ICT Katholieke Universiteit Leuven, Belgium Marie-Francine Moens Legal Informatics & Information Retrieval group Interdisciplinary Centre for Law and ICT Katholieke Universiteit Leuven, Belgium jan.debeer@law.kuleuven.be ABSTRACT We present rpref ; our generalization of the bpref evaluation metric for assessing the quality of search engine results, given graded rather than binary user relevance judgments. marie-france.moens@law.kuleuven.be pursue their line of work by prop osing rpref as a generalization of the robust, ranking-based evaluation metric bpref introduced by Buckley et al. [1].1 2. PREFERENCE METRICS 2.1 Rationale Both bpref and rpref can b e categorized as preferencebased metrics ([3]). The key idea b ehind them is to evaluate the preference ordering as derived by the search engine by comparing it to an assessor ordering, which serves as a golden standard. The comparison is based on a counting of `misplaced' documents; judged irrelevant documents that are ranked higher than any judged relevant documents (as for bpref ), generalized for rpref to weighted counts of judged less relevant documents that are ranked higher than judged more relevant documents. Completely in line with Kek¨l¨inen and J¨rvelin's arguaa a ment, one could motivate these metrics from a practical p oint of view. Users aim to satisfy their information needs as efficient and effective as p ossible. As the engine's ranking tells them the first documents in the list are presumed to b e most relevant, they typically consult the documents in the order indicated. Any misplaced documents thereby create a distraction from finding what they are looking for. Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and Software--Performance evaluation (efficiency and effectiveness) General Terms Theory, Measurement 1. INTRODUCTION User relevance judgments are an essential ingredient to most evaluation metrics for information retrieval (IR), as the quality of results is inherently sub jective. In practice however, this dep endency seriously hamp ers the scop e of evaluation, as the judgment is a time- and painstaking activity. For this and related reasons (most noticeably reliability), one aims at metrics that are robust in the face of incomplete and imp erfect judgments. With incomplete judgments, not all documents are judged. Imp erfect judgments include documents no longer part of the collection. Both esp ecially apply to dynamic document collections. For a motivation of using graded rather than binary relevance judgments, we refer to the work of Kek¨l¨inen and aa J¨rvelin ([5]), Gordon and Pathak ([4]) amongst others. As a stated by the former authors: "In modern large database environments, the numb er of topically relevant documents to a request may easily exceed the numb er of documents a user is willing to examine. It would therefore b e desirable from the user viewp oint to rank highly relevant documents highest in the retrieval results and to develop and evaluate IR methods accordingly." The authors continue by presenting graded adaptations of precision/recall metrics. We 2.2 Rpref In view of a submitted request, let k N0 b e the absolute rank number of a document Dk contained in a document col lection C . The ranking defines the preference ordering as determined by the search engine through the equivalence relation . Whenever a ranking cutoff p oint c is imp osed,2 one lets k = for each document Dk b eyond c. This makes the ordering totally defined over C , rendering p ooling techniques effective (cf. TREC ­ Text REtrieval Conference). Backed by the robustness towards incomplete and imp erfect judgments ([1]), there is some freedom in the selection and the numb er of documents to b e judged by assessors in the formation of a golden standard. We refer to this selection as the sample set S C . However, as stated in [1], a necessary but sufficient condition for the comp osition of S is that all documents in S are selected indep endently of their judged relevance k [0, 1], with 0 b eing completely irrelevant, and 1 p erfectly relevant. On the issue of assessor disagreement, see Vo orhees ([6]). In practice, the cutoff is typically intro duced b ecause of the paged presentation of search results (e.g., we restricted ourselves to the first page only), or as a mandatory parameter in some search engines. 2 1 Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 637 In the computation of rpref (1), a weighted sum is made over all judged documents (all Dk S ). The weight equals the document's relevance k , such that misplaced documents with regard to highly relevant documents introduce a corresp ondingly larger p enalty. The inner sum computes for each Dk an accumulated misplacement penalty, considering the relative deviation with every less relevant document that is ranked higher by the search engine. Notice the inner sum's normalization by the constant N to avoid counting absolute numb ers. The final division by R normalizes the score to the unit interval. The higher the score, the b etter the preference ordering by the search engine is. 1 · R Figure 1: Demonstration orderings and scores. S = { D1 , D2 , D3 , D4 , D5 , D6 , D7 , D8 , D9 , D10 , D11 , D12 , D13 , D14 , D15 , D16 } 1.0 1.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.3 0.2 0.1 0.1 0.0 0.0 D3 D1 D2 D4 D5 D6 D7 D8 D12 T1 A A1 A A1 A A A A A A1 A1 A1 A2 A2 rpref = D k S ¼ k · 1 È - Dl S l < k l < k k - l k ½ A A2 B B2 D1 D6 D3 D5 D9 D12 D2 D8 D10 T2 A A1 A A2 A A B B A A1 A1 B1 B2 A1 A A2 B B1 D16 D14 D13 D15 D7 D11 D10 D5 T3 B B2 B B2 B B A B B B2 B2 A2 B1 B1 A A1 ( 3. DEMONSTRATION 1) We now demonstrate the stepwise generalization of bpref to rpref on real-life data. The example involves three marketleading search tools blinded as T1 , T2 and T3 , evaluated for a particular request against a document collection C comprising half a million of p olice case rep orts ([2]). Figure 1 shows our sample set S with indicated relevance judgments [0, 1]. The figure then shows the order of these documents in the tools' result lists. The additional columns show the documents group ed around chosen relevance threshold values in two (A and B , split around 0.5), resp ectively four partitions (A1 , A2 , B1 and B2 , split around 0.25, 0.5, and 0.75). Continuing this process iteratively, the binary preference ordering as used for bpref (two partitions) is recursively refined until one recovers the original ordering as used for rpref (at most |S | non-empty partitions). The figure concludes with a graph plotting the thus obtained n-pref scores (n partitions), using the relative normalization variant. Notice the convergence towards the rpref score with increasing n. The example demonstrates that although the set of misplaced document pairs grows monotonically with each step, the score itself does not necessarily converge monotonically. This is b ecause in each step misplacement p enalties are computed more precisely, b oth within and b etween partitions of the previous step. N 1 - k R= È Dk S k , N= È Dk S 2.3 Bpref In the sp ecial event of binary judgments k {0, 1}, (1) reduces to one of the variants of bpref (2). Firstmost, R and N will simply denote the total numb er of judged relevant, resp ectively irrelevant documents. Second, all misplacement - p enalties of the form k l will b e 1, such that the inner sum k reduces to a simple count of misplaced documents for Dk . Lastly, weights k of the outer sum can b e safely removed when iterating only over relevant documents. bpref = 1 · R D 1- k S k = 1 |{Dl S : l < k , l = 0}| N (2) 2.4 Normalization Variants One could think of different variants for b oth (1) and (2) regarding the normalization of terms. The depicted absolute normalization variant mitigates the smaller p enalties accumulated by front documents through the constant denominator N , which accommodates the largest p ossible accumulated p enalty of tail Dk . Therefore, a more appropriate, relative denominator might b e the numb er of preference pairs for Dk , given by |{Dl S : l < k }|. 4. ACKNOWLEDGEMENTS The authors thank the Belgian p olice for their interest and active collab oration, in particular Kris D'Hoore, Martine Pattyn and Paul Wouters. They also thank their research partners Jan Vanthienen and Nishant Kumar. This work was supp orted by the Belgian Science Policy Office. ag/01/101. http://www.b elsp o.b e/agora 2.5 Discussion Both bpref and rpref are dep endent on the relative ordering of the assessed documents only. No other information can b e gleaned from these metrics. In particular, the resulting scores give no indication of effectiveness in answering the search request. Combination with other metrics therefore remains valuable. There is a case for omitting the highest ranked document from the outer sum in b oth (1) and (2), as comparison to higher ranked documents is futile. Moreover, relative normalization consistently produces 1 - 0 for this document. 0 Adding the assumption of representative (unbiased) sample sets, the relative normalization variant b ecomes safe for averaging over requests. Scores are no longer directly related to the somewhat arbitrary numb er of non- (or less-) relevant documents judged for the request. Instead, all preference pairs contribute to the score in relative terms. Eventhough exp ected to hold (by analogy), it is yet to b e verified exp erimentally whether the acclaimed robustness for bpref ([1]) gracefully extends to rpref. 5. C.RuckleEREE. CESorhees. Retrieval evaluation with EF y and N M. Vo [1] B [2] incomplete information. In Proceedings of ACM SIGIR, volume 27, July 2004. J. De Beer, N. Kumar, M.-F. Mo ens, and J. Vanthienen. Assessing the state of the art of commercial to ols for unstructured information exploitation. In Proceedings of the International Conf. on Business Information Systems, 2006. M. B. Eisenb erg. Measuring relevance judgments. Information Processing and Management, 24(4):373­389, 1988. M. Gordon and P. Pathak. Finding information on the world wide web: The retrieval effectiveness of search engines. Information Processing and Management, 35(2):141­180, 1999. J. Kek¨l¨inen and K. J¨rvelin. Using graded relevance aa a assessments in IR evaluation. Journal of the Am. Soc. for Information Science, 53(13):1120­1129, 2002. E. M. Vo orhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36(5):697­716, 2000. [3] [4] [5] [6] 638