Inferring Document Relevance via Average Precision Javed A. Aslam, Emine Yilmaz College of Computer and Information Science Nor theastern University 360 Huntington Ave, #202 WVH Boston, MA 02115 {jaa,emine}@ccs.neu.edu ABSTRACT We consider the problem of evaluating retrieval systems using a limited numb er of relevance judgments. Recent work has demonstrated that one can accurately estimate average precision via a judged p ool corresp onding to a relatively small random sample of documents. In this work, we demonstrate that given values or estimates of average precision, one can accurately infer the relevances of unjudged documents. Combined, we thus show how one can efficiently and accurately infer a large judged p ool from a relatively small numb er of judged documents, thus p ermitting accurate and efficient retrieval evaluation on a large scale. Categories and Subject Descriptors H3.4 [Information Storage and Retrieval]: Systems and Software ­ Performance evaluation General Terms Theory, Measurement, Exp erimentation Keywords Relevance Judgments, Average Precision trieval p erformance and that these random samples generalize well to previously unseen runs [1]. Our results demonstrate that accurate estimates of average precision, the numb er of documents relevant to a topic (R), and often other measure of retrieval p erformance can b e obtained using small subsets of the complete judgment set. One disadvantage of the aforementioned technique is that it cannot b e used to assess the p erformance of systems in any standard way: in order to estimate the measures, a sp ecial procedure requiring access to information on the sampling process is needed, so standard tools such as trec eval or other software implementations which calculate average precision and other p erformance measures cannot b e used. In this work, we demonstrate that given a set of ranked lists of documents submitted in resp onse to a given topic, together with the average precisions associated with these lists and R, the numb er of documents relevant to the topic, one can accurately infer which underlying documents are relevant and which are not. When combined with the aforementioned techniques for accurately estimating average precision from a small sample of documents, one can effectively infer a large judged p ool from a relatively small numb er of relevance assessments, thus p ermitting simple, standard, accurate, and efficient retrieval evaluation on a large scale. 1. INTRODUCTION We consider the problem of efficiently evaluating the p erformance of retrieval systems on a large scale. The large scale evaluation of retrieval systems requires significant human effort due to the relevance judgments needed. In TRECstyle evaluations of retrieval systems, for each topic, a pool of the union of the top 100 documents retrieved by each system is formed, and these documents are assessed to determine their relevance for the associated topic. This method is referred to as depth 100 pooling, and the relevance judgments formed are stored in a file called the qrel. The costs for such an assessment effort can b e quite high; in TREC 8, for example, 86,830 relevance judgments were used to assess the quality of the retrieved lists corresp onding to 129 system runs in resp onse to 50 topics. In recent work, we have shown that sampling techniques can b e used to efficiently estimate standard measures of re We gratefully acknowledge the supp ort provided by NSF grant CCF-0418390. 2. METHODOLOGY The methodology for inferring relevance assessments is conceptually simple: given the ranked lists of documents submitted in resp onse to a given topic together with the average precisions associated with these lists and R, the numb er of documents relevant to the topic, find the binary relevance judgments associated with the underlying documents which minimize the "difference" b etween the given average precisions and those incurred by the inferred relevance assessments. This is a constrained integer optimization problem. Constraints: (1) The total numb er of relevant documents is R. (2) Any document contained in multiple lists must have the same relevance assessment. Integrality: The inferred assessments must b e binary. Optimization: The average precisions incurred must b e "close" to the given average precisions. Such a characterization of the problem gives rise to a numb er of issues. First, this constrained integer optimization problem is intractable, for much the same reason that integer programming is intractable. To alleviate this problem, we relax the condition that the inferred relevance assessments must b e binary. We instead allow the inferred relevance assessments to corresp ond to probabilities of relevance, and we deduce an Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 601 docs judged 40 2.3% 95 5.5% 144 8.3% 260 15% 379 22% 494 28% prec 0.5562 0.5919 0.6243 0.7068 0.7720 0.8101 recall 0.3833 0.5495 0.6004 0.6887 0.7361 0.7694 F1 0.4171 0.5332 0.5880 0.6906 0.7465 0.7835 Table 1: Precision, recall and F1 values of the qrels inferred using 40, 95, 144, 260, 379 and 494 judgments for TREC 8. qrels from actual ap 0.5 0.45 0.4 0.35 0.5 0.45 0.4 0.35 qrels from 95 judgments 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 Train RMS = 0.0022, = 0.9859 Test RMS = 0.0023, = 0.9825 0.2 0.3 0.4 0.5 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 Train RMS = 0.0156, = 0.9437 Test RMS = 0.0078, = 0.9427 0.2 0.3 0.4 0.5 actual map actual map Figure 1: MAP estimates for training and test systems vs. actual MAP values obtained via inferred qrels from (left) actual ap values and (right) from estimates of ap obtained using 95 judgments for TREC 8. expected value for average precision from these probabilistic relevance assessments as follows [2] !! Z i-1 X 1 X pi E [AP ] = pj 1+ R i=1 i j =1 where pi is the probability of relevance associated with the document at rank i in the list of length Z . Second, we ensure that the inferred relevance judgments incur average precision values "close" to those given by minimizing the sum squared error b etween the actual and inferred exp ected average precision values. Thus, our optiP mization criterion is min i (E [AP i ] - ap i )2 where ap i is the given average precision associated with list i. The problem as formulated ab ove can b e solved using any numb er of constrained optimization routines, available, for instance, in MatLab. Finally, we convert these probabilistic relevance assessments to binary relevance assessments by assigning a relevance score of 1 with probability p and a score of 0 with probability 1 - p, in the same spirit that linear programming with randomized rounding is used to solve integer programs. When the given average precision (and R) values are actually estimates derived from sampling, we can also make use of the judged documents in the sample. Once the inferred qrels are obtained as ab ove, we assign the known relevance assessments to the documents in the sample. The complete judgment set contains 1737 judgments on average p er topic; hence, these judgments corresp ond to 2.3, 5.5, 8.3, 15, 22, and 28% of the complete judgment set, resp ectively. Each row corresp onds to an estimate obtained using a different sample of the given size, and the columns rep ort the average of the precision, recall, and F1 values over all queries when the inferred qrel file is treated as a lab eled set and compared to the actual qrel file. Note that with estimates corresp onding to 494 judgments on average p er topic, an average qrel precision and recall of 81% and 77% is achieved. Even with estimates obtained from as few as 40 judgments on average p er topic, the inferred qrels achieve non-trivial precisions and recalls. Another way of evaluating the quality of these inferred qrels is to evaluate systems using these judgments and compare these assessments with the results obtained using the actual qrel; in particular, we are interested in how well these inferred qrels generalize to evaluating unseen runs. We separate the runs submitted to TREC 8 into training and testing sets: The 70 runs which contributed to the original TREC depth 100 p ool (and thus the actual qrel) form the training set, while the 59 runs which did not contribute to the p ool (or qrel) form the testing set. For each topic, we infer relevance judgments using the describ ed method with estimates of average precision for the training runs obtained through k judgments on average p er topic. We then evaluate the mean average precisions of the training and testing runs using these inferred judgments and compare these values with the actual mean average precisions of these runs. Figure 1 shows the results of this exp eriment using the actual average precision values (and R) for TREC 8 as well as those results obtained from estimates using 95 judgments on average p er topic. The y -axis is the mean average precision value of a run computed from the inferred qrels formed by the method, while the x-axis is the actual mean average precision. The plot also rep orts the root mean squared (RMS) error and the Kendall values in comparing the training and testing MAP values with actual MAP values. Note that estimated MAP values for b oth training and testing systems are very close to the actual MAP values b oth in terms of value and ranking purp oses, demonstrating that the inferred qrels are accurate and generalize well. In conclusion, we prop ose a method that can infer relevance judgments from estimates of average precision and R, and we demonstrate that these inferred relevance judgments can accurately assess the p erformance of retrieval systems, even when the inferred relevance judgments were effectively obtained from very few real relevance judgments. map estimate map estimate 4. REFERENCES [1] J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 2006. To app ear. [2] J. A. Aslam, E. Yilmaz, and V. Pavlu. The maximum entropy method for analyzing retrieval measures. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 27­34. ACM Press, August 2005. 3. EXPERIMENTAL RESULTS Table 1 shows the quality of the inferred qrels formed by using the estimates1 obtained using k = 40, 95, 144, 260, 379, and 494 judgments on average p er topic for TREC 8. 1 Details on how these estimates were obtained may b e found in our companion pap er [1]. 602