SIGIR 2007 Proceedings Session 3: Evaluation Robust Test Collections for Retrieval Evaluation Ben Car terette Center for Intelligent Information Retrieval Computer Science Depar tment University of Massachusetts Amherst Amherst, MA 01003 car teret@cs.umass.edu ABSTRACT Low-cost methods for acquiring relevance judgments can b e a b oon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time evaluation, it is not clear that they can b e trusted when re-used to evaluate new systems. In this work, we formally define what it means for judgments to b e reusable: the confidence in an evaluation of new systems can b e accurately assessed from an existing set of relevance judgments. We then present a method for augmenting a set of relevance judgments with relevance estimates that require no additional assessor effort. Using this method practically guarantees reusability: with as few as five judgments p er topic taken from only two systems, we can reliably evaluate a larger set of ten systems. Even the smallest sets of judgments can b e useful for evaluation of new systems. Categories and Sub ject Descriptors: H.3 Information Storage and Retrieval; H.3.4 Systems and Software: Performance Evaluation General Terms: Exp erimentation, Measurement, Reliability Keywords: information retrieval, evaluation, test collections, reusability Can they reliably reuse the original judgments? Can they evaluate without more relevance judgments? Evaluation is an imp ortant asp ect of information retrieval research, but it is only a semi-solved problem: for most retrieval tasks, it is imp ossible to judge the relevance of every document; there are simply too many of them. The solution used by NIST at TREC (Text REtrieval Conference) is the p ooling method [19, 20]: all comp eting systems contribute N documents to a p ool, and every document in that p ool is judged. This method creates large sets of judgments that are reusable for training or evaluating new systems that did not contribute to the p ool [21]. This solution is not adequate for our hyp othetical researcher. The p ooling method gives thousands of relevance judgments, but it requires many hours of (paid) annotator time. As a result, there have b een a slew of recent pap ers on reducing annotator effort in producing test collections: Cormack et al. [11], Zob el [21], Sanderson and Joho [17], Carterette et al. [8], and Aslam et al. [4], among others. As we will see, the judgments these methods produce can significantly bias the evaluation of a new set of systems. Returning to our hyp othetical resesarcher, can she reuse her relevance judgments? First we must formally define what it means to b e "reusable". In previous work, reusability has b een tested by simply assessing the accuracy of a set of relevance judgments at evaluating unseen systems. While we can say that it was right 75% of the time, or that it had a rank correlation of 0.8, these numb ers do not have any predictive p ower: they do not tell us which systems are likely to b e wrong or how confident we should b e in any one. We need a more careful definition of reusability. Sp ecifically, the question of reusability is not how accurately we can evaluate new systems. A "malicious adversary" can always produce a new ranked list that has not retrieved any of the judged documents. The real question is how much confidence we have in our evaluations, and, more imp ortantly, whether we can trust our estimates of confidence. Even if confidence is not high, as long as we can trust it, we can identify which systems need more judgments in order to increase confidence. Any set of judgments, no matter how small, b ecomes reusable to some degree. Small, reusable test collections could have a huge impact on information retrieval research. Research groups would b e able to share the relevance judgments they have done "in-house" for pilot studies, new tasks, or new topics. The amount of data available to researchers would grow exp onentially over time. 1. INTRODUCTION Consider an information retrieval researcher who has invented a new retrieval task. She has built a system to p erform the task and wants to evaluate it. Since the task is new, it is unlikely that there are any extant relevance judgments. She does not have the time or resources to judge every document, or even every retrieved document. She can only judge the documents that seem to b e the most informative and stop when she has a reasonable degree of confidence in her conclusions. But what happ ens when she develops a new system and needs to evaluate it? Or another research group decides to implement a system to p erform the task? Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. 55 SIGIR 2007 Proceedings Session 3: Evaluation 2. ROBUST EVALUATION Ab ove we gave an intuitive definition of reusability: a collection is reusable if we can "trust" our estimates of confidence in an evaluation. By that we mean that if we have made some relevance judgments and have, for example, 75% confidence that system A is b etter than system B , we would like there to b e no more than 25% chance that our assessment of the relative quality of the systems will change as we continue to judge documents. Our evaluation should b e robust to missing judgments. In our previous work, we defined confidence as the probability that the difference in an evaluation measure calculated for two systems is less than zero [8]. This notion of confidence is defined in the context of a particular evaluation task that we call comparative evaluation: determining the sign of the difference in an evaluation measure. Other evaluation tasks could b e defined; estimating the magnitude of the difference or the values of the measures themselves are examples that entail different notions of confidence. We therefore see confidence as a probability estimate. One of the questions we must ask ab out a probability estimate is what it means. What does it mean to have 75% confidence that system A is b etter than system B ? As describ ed ab ove, we want it to mean that if we continue to judge documents, there will only b e a 25% chance that our assessment will change. If this is what it means, we can trust the confidence estimates. But do we know it has that meaning? Our calculation of confidence rested on an assumption ab out the probability of relevance of unjudged documents, sp ecifically that each unjudged document was equally likely to b e relevant or nonrelevant. This assumption is almost certainly not realistic in most IR applications. As it turns out, it is this assumption that determines whether the confidence estimates can eb trusted. Before elab orating on this, we formally define confidence. ecause xA = 0, xB = 1, xC = 1. Though the ordering B , A, C is different from the lab eling A, B , C , it does not affect the computation. We can now see average precision itself is a random variable with a distribution over all p ossible assignments of relevance to all documents. This random variable has an exp ectation, a variance, confidence intervals, and a certain probability of b eing less than or equal to a given value. All of these are dep endent on the probability that document i is relevant: pi = p(Xi = 1). Supp ose in our previous example we do not know the relevance judgments, but we b elieve pA = 0.4, pB = 0.8, pC = 0.7. We can then compute e.g. P (AP = 0) = 0.2 · 0.6 · 0.3 = 0.036, or P (AP = 1 ) = 0.2 · 0.4 · 0.7 = 0.056. 2 Summing over all p ossibilities, we can compute exp ectation and variance: V a j 1 p E [AP ] aij pi pj ii pi + i ar[AP ] A + i =j ( 1 p n i 2 i) >i a2i pi qi + i k j >i a2j pi pj (1 - pi pj ) i 2aij aik pi pj pk (1 - pi ) 2aii aij pi pj (1 - pi ) + >j =i P asymptotically converges to a normal distribution with exp ectation and variance as defined ab ove.1 For our comparative evaluation task we are interested in the sign of the difference in two average precisions: AP = AP1 - AP2 . As we showed in our previous work, AP has a closed form when documents are ordered arbitrarily: A P = n 1ij X i =1 i cij Xi Xj 2.1 Estimating Confidence Average precision (AP) is a standard evaluation metric that captures b oth the ability of a system to rank relevant documents highly (precision) as well as its ability to retrieve relevant documents (recall). It is typically written as the mean precision at the ranks of relevant documents: 1i AP = pr ec@r (i) |R| R where R is the set of relevant documents and r (i) is the rank of document i. Let Xi b e a random variable indicating the relevance of document i. If documents are oirdered by rank, we can express precision as pr ec@i = 1/i j =1 Xj . Average precision then b ecomes the quadratic equation AP = = n ji 1i X Xi /i Xj i =1 =1 cij = aij - bij where bij is defined analogously to aij for the second ranking. Since AP is normal, AP is normal as well, meaning we can use the normal cumulative density function to determine the confidence that a difference in AP is less than zero. Since topics are indep endent, we can easily extend this to mean average precision (MAP). MAP is also normally distributed; its exp ectation and variance are: 1t E M AP = E [APt ] (1) T T 1t V ar [APt ] V M AP = 2 T T M AP = M AP1 - M AP2 Confidence can then b e estimated by calculating the exp ectation and variance and using the normal density function to find P (M AP < 0). 1 X in =1 j i i aij Xi Xj where aij = 1/ max{r (i), r (j )}. Using aij instead of 1/i allows us to numb er the documents arbitrarily. To see why this is true, consider a toy example: a list of 3 documents with relevant documents B , C at ranks 1 and 3 and nonrelevant document A at rank 2. Average precision will b e 12 112 1 1 1 ( x + 2 xB xA + 1 xB xC + 2 x2 + 3 xA xC + 1 x2 ) = 1 +3 A 21B 3 3C 2 2.2 Confidence and Robustness Having defined confidence, we turn back to the issue of trust in confidence estimates, and show how it ties into the robustness of the collection to missing judgments. 1 e bThese are actually approximations to the true -xp ectation and variance, but the error is a negligible O(n2 n ). 56 SIGIR 2007 Proceedings Session 3: Evaluation Let Z b e the set of all pairs of ranked results for a common set of topics. Supp ose we have a set of m relevance judgments xm = {x1 , x2 , ..., xm } (using small x rather than capital X to distinguish b etween judged and unjudged documents); these are the judgments against which we compute confidence. Let Z b e the subset of pairs in Z for which we predict that M AP = -1 with confidence given the judgments xm . For the confidence estimates to b e accurate, we need at least · |Z | of these pairs to actually have M AP = -1 after we have judged every document. If they do, we can trust the confidence estimates; our evaluation will b e robust to missing judgments. If our confidence estimates are based on unrealistic assumptions, we cannot exp ect them to b e accurate. The assumptions they are based on are the probabilities of relevance pi . We need these to b e "realistic". We argue that the b est p ossible distribution of relevance p(Xi ) is the one that explains all of the data (all of the observations made ab out the retrieval systems) while at the same time making no unwarranted assumptions. This is known as the principle of maximum entropy [13]. The entropy of a random variable X with distribution i p(X ) is defined as H (p) = - p(X = i) log p(X = i). This has found a wide array of uses in computer science and information retrieval. The maximum entropy distribution is the one that maximizes H . This distribution is unique and has an exp onential form. The following theorem shows the utility of a maximum entropy distribution for relevance when estimating confidence. Theorem 1. If p(X n |I , xm ) = argmaxp H (p), confidence estimates wil l be accurate. where xm is the set of relevance judgments defined ab ove, X n is the full set of documents that we wish to estimate the relevance of, and I is some information ab out the documents (unsp ecified as of now). We forgo the proof for the time b eing, but it is quite simple. This says that the better the estimates of relevance, the more accurate the evaluation. The task of creating a reusable test collection thus b ecomes the task of estimating the relevance of unjudged documents. The theorem and its proof say nothing whatsoever ab out the evaluation metric. The probability estimates are entirely indep edent of the measure we are interested in. This means the same probability estimates can tell us ab out average precision as well as precision, recall, bpref, etc. Furthermore, we could assume that the relevance of documents i and j is indep endent and achieve the same result, which we state as a corollary: Corollary 1. If p(Xi |I , xm ) = argmaxp H (p), confidence estimates wil l be accurate. The task therefore b ecomes the imputation of the missing values of relevance. The theorem implies that the closer we get to the maximum entropy distribution of relevance, the closer we get to robustness. natural source of information is the retrieval systems themselves: how they ranked the judged documents, how often they failed to rank relevant documents, how they p erform across topics, and so on. If we treat each system as an information retrieval "exp ert" providing an opinion ab out the relevance of each document, the problem b ecomes one of exp ert opinion aggregation. This is similar to the metasearch or data fusion problem in which the task is to take k input systems and merge them into a single ranking. Aslam et al. [3] previously identified a connection b etween evaluation and metasearch. Our problem has two key differences: 1. We explicitly need probabilities of relevance that we can plug into Eq. 1; metasearch algorithms have no such requirement. 2. We are accumulating relevance judgments as we proceed with the evaluation and are able to re-estimate relevance given each new judgment. In light of (1) ab ove, we introduce a probabilistic model for exp ert combination. 3.1 A Model for Expert Opinion Aggregation Supp ose that each exp ert j provides a probability of relevance qij = pj (Xi = 1). The information ab out the relevance of document i will then b e the set of k exp ert opinions I = qi = (qi1 , qi2 , · · · , qik ). The probability distribution we wish to find is the one that maximizes the entropy of pi = p(Xi = 1|qi ). As it turns out, finding the maximum entropy model is equivalent to finding the parameters that maximize the likelihood [5]. Blower [6] explicitly shows that finding the maximum entropy model for a binary variable is equivalent to solving a logistic regression. Then 1 k exp j =1 j qij (2) k pi = p(Xi = 1|qi ) = + exp j =1 j qij where 1 , · · · , k are the regression parameters. We include a b eta prior for p(j ) with parameters , . This can b e seen as a typ e of smoothing to account for the fact that the training data is highly biased. This model has the advantage of including the statistical dep endence b etween the exp erts. A model of the same form was shown by Clemen & Winkler to b e the b est for aggregating exp ert probabilities [10]. A similar maximumentropy-motivated approach has b een used for exp ert aggregation [15]. Aslam & Montague [1] used a similar model for metasearch, but assumed indep endence among exp erts. Where do the qij s come from? Using raw, uncalibrated scores as predictors will not work b ecause score distributions vary too much b etween topics. A language modeling ranker, for instance, will typically give a much higher score to the top retrieved document for a short query than to the top retrieved document for a long query. We could train a separate predicting model for each topic, but that does not take advantage of all of the information we have: we may only have a handful of judgments for a topic, not enough to train a model to any confidence. Furthermore, it seems reasonable to assume that if an exp ert makes good predictions for one topic, it will make good predictions for other topics as well. We could use a hierarchical model [12], but that will not generalize to unseen topics. Instead, we 3. PREDICTING RELEVANCE In our statement of Theorem 1, we left the nature of the information I unsp ecified. One of the advantages of our confidence estimates is that they admit information from a wide variety of sources; essentially anything that can b e modeled can b e used as information for predicting relevance. A 57 SIGIR 2007 Proceedings Session 3: Evaluation will calibrate the scores of each exp ert individually so that scores can b e compared b oth within topic and b etween topic. Thus our model takes into account not only the dep endence b etween exp erts, but also the dep endence b etween exp erts' p erformances on different tasks (topics). 3.2 Calibrating Experts Each exp ert gives us a score and a rank for each document. We need to convert these to probabilities. A method such as the one used by Manmatha et al. [14] could b e used to convert scores into probabilities of relevance. The "pairwise preference" method of Carterette & Petkova [9] could also b e used, interp eting the ranking of one document over another as an expression of preference. Let qij b e exp ert j 's self-rep orted probability that docu ment i is relevant. Intuitively it seems clear that qij should decrease with rank, and it should b e zero if document i is unranked (the exp ert did not b elieve it to b e relevant). The pairwise preference model can handle these two requirements easily, so we will use it. Let rj (i) b e the "relevance coefficient" of the document at rank rj (i). We want to find the s that maximize the likelihood function: r exp(rj (i) - rj (k) ) Lj t () = 1 + exp(rj (i) - rj (k) ) j (i)