SIGIR 2007 Proceedings

Session 3: Evaluation

Robust Test Collections for Retrieval Evaluation
Ben Car terette
Center for Intelligent Information Retrieval Computer Science Depar tment University of Massachusetts Amherst Amherst, MA 01003

car teret@cs.umass.edu ABSTRACT
Low-cost methods for acquiring relevance judgments can b e a b oon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time evaluation, it is not clear that they can b e trusted when re-used to evaluate new systems. In this work, we formally define what it means for judgments to b e reusable: the confidence in an evaluation of new systems can b e accurately assessed from an existing set of relevance judgments. We then present a method for augmenting a set of relevance judgments with relevance estimates that require no additional assessor effort. Using this method practically guarantees reusability: with as few as five judgments p er topic taken from only two systems, we can reliably evaluate a larger set of ten systems. Even the smallest sets of judgments can b e useful for evaluation of new systems. Categories and Sub ject Descriptors: H.3 Information Storage and Retrieval; H.3.4 Systems and Software: Performance Evaluation General Terms: Exp erimentation, Measurement, Reliability Keywords: information retrieval, evaluation, test collections, reusability Can they reliably reuse the original judgments? Can they evaluate without more relevance judgments? Evaluation is an imp ortant asp ect of information retrieval research, but it is only a semi-solved problem: for most retrieval tasks, it is imp ossible to judge the relevance of every document; there are simply too many of them. The solution used by NIST at TREC (Text REtrieval Conference) is the p ooling method [19, 20]: all comp eting systems contribute N documents to a p ool, and every document in that p ool is judged. This method creates large sets of judgments that are reusable for training or evaluating new systems that did not contribute to the p ool [21]. This solution is not adequate for our hyp othetical researcher. The p ooling method gives thousands of relevance judgments, but it requires many hours of (paid) annotator time. As a result, there have b een a slew of recent pap ers on reducing annotator effort in producing test collections: Cormack et al. [11], Zob el [21], Sanderson and Joho [17], Carterette et al. [8], and Aslam et al. [4], among others. As we will see, the judgments these methods produce can significantly bias the evaluation of a new set of systems. Returning to our hyp othetical resesarcher, can she reuse her relevance judgments? First we must formally define what it means to b e "reusable". In previous work, reusability has b een tested by simply assessing the accuracy of a set of relevance judgments at evaluating unseen systems. While we can say that it was right 75% of the time, or that it had a rank correlation of 0.8, these numb ers do not have any predictive p ower: they do not tell us which systems are likely to b e wrong or how confident we should b e in any one. We need a more careful definition of reusability. Sp ecifically, the question of reusability is not how accurately we can evaluate new systems. A "malicious adversary" can always produce a new ranked list that has not retrieved any of the judged documents. The real question is how much confidence we have in our evaluations, and, more imp ortantly, whether we can trust our estimates of confidence. Even if confidence is not high, as long as we can trust it, we can identify which systems need more judgments in order to increase confidence. Any set of judgments, no matter how small, b ecomes reusable to some degree. Small, reusable test collections could have a huge impact on information retrieval research. Research groups would b e able to share the relevance judgments they have done "in-house" for pilot studies, new tasks, or new topics. The amount of data available to researchers would grow exp onentially over time.

INTRODUCTION

Consider an information retrieval researcher who has invented a new retrieval task. She has built a system to p erform the task and wants to evaluate it. Since the task is new, it is unlikely that there are any extant relevance judgments. She does not have the time or resources to judge every document, or even every retrieved document. She can only judge the documents that seem to b e the most informative and stop when she has a reasonable degree of confidence in her conclusions. But what happ ens when she develops a new system and needs to evaluate it? Or another research group decides to implement a system to p erform the task?

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 2327, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00.

SIGIR 2007 Proceedings

Session 3: Evaluation

ROBUST EVALUATION

Ab ove we gave an intuitive definition of reusability: a collection is reusable if we can "trust" our estimates of confidence in an evaluation. By that we mean that if we have made some relevance judgments and have, for example, 75% confidence that system A is b etter than system B , we would like there to b e no more than 25% chance that our assessment of the relative quality of the systems will change as we continue to judge documents. Our evaluation should b e robust to missing judgments. In our previous work, we defined confidence as the probability that the difference in an evaluation measure calculated for two systems is less than zero [8]. This notion of confidence is defined in the context of a particular evaluation task that we call comparative evaluation: determining the sign of the difference in an evaluation measure. Other evaluation tasks could b e defined; estimating the magnitude of the difference or the values of the measures themselves are examples that entail different notions of confidence. We therefore see confidence as a probability estimate. One of the questions we must ask ab out a probability estimate is what it means. What does it mean to have 75% confidence that system A is b etter than system B ? As describ ed ab ove, we want it to mean that if we continue to judge documents, there will only b e a 25% chance that our assessment will change. If this is what it means, we can trust the confidence estimates. But do we know it has that meaning? Our calculation of confidence rested on an assumption ab out the probability of relevance of unjudged documents, sp ecifically that each unjudged document was equally likely to b e relevant or nonrelevant. This assumption is almost certainly not realistic in most IR applications. As it turns out, it is this assumption that determines whether the confidence estimates can eb trusted. Before elab orating on this, we formally define confidence.

ecause xA = 0, xB = 1, xC = 1. Though the ordering B , A, C is different from the lab eling A, B , C , it does not affect the computation. We can now see average precision itself is a random variable with a distribution over all p ossible assignments of relevance to all documents. This random variable has an exp ectation, a variance, confidence intervals, and a certain probability of b eing less than or equal to a given value. All of these are dep endent on the probability that document i is relevant: pi = p(Xi = 1). Supp ose in our previous example we do not know the relevance judgments, but we b elieve pA = 0.4, pB = 0.8, pC = 0.7. We can then compute e.g. P (AP = 0) = 0.2 · 0.6 · 0.3 = 0.036, or P (AP = 1 ) = 0.2 · 0.4 · 0.7 = 0.056. 2 Summing over all p ossibilities, we can compute exp ectation and variance: V a j 1 p E [AP ] aij pi pj ii pi +
i

ar[AP ] A + i
=j

(

1 p

n i
2 i)

a2i pi qi + i k

j
>i

a2j pi pj (1 - pi pj ) i 2aij aik pi pj pk (1 - pi )

2aii aij pi pj (1 - pi ) +

>j =i

P asymptotically converges to a normal distribution with exp ectation and variance as defined ab ove.1 For our comparative evaluation task we are interested in the sign of the difference in two average precisions: AP = AP1 - AP2 . As we showed in our previous work, AP has a closed form when documents are ordered arbitrarily: A P =
n 1ij X i =1 i

cij Xi Xj

2.1 Estimating Confidence
Average precision (AP) is a standard evaluation metric that captures b oth the ability of a system to rank relevant documents highly (precision) as well as its ability to retrieve relevant documents (recall). It is typically written as the mean precision at the ranks of relevant documents: 1i AP = pr ec@r (i) |R| R where R is the set of relevant documents and r (i) is the rank of document i. Let Xi b e a random variable indicating the relevance of document i. If documents are oirdered by rank, we can express precision as pr ec@i = 1/i j =1 Xj . Average precision then b ecomes the quadratic equation AP = =
n ji 1i X Xi /i Xj i =1 =1

cij = aij - bij where bij is defined analogously to aij for the second ranking. Since AP is normal, AP is normal as well, meaning we can use the normal cumulative density function to determine the confidence that a difference in AP is less than zero. Since topics are indep endent, we can easily extend this to mean average precision (MAP). MAP is also normally distributed; its exp ectation and variance are: 1t E M AP = E [APt ] (1) T T 1t V ar [APt ] V M AP = 2 T T M AP = M AP1 - M AP2 Confidence can then b e estimated by calculating the exp ectation and variance and using the normal density function to find P (M AP < 0).

1 X

in
=1

j
i

aij Xi Xj

where aij = 1/ max{r (i), r (j )}. Using aij instead of 1/i allows us to numb er the documents arbitrarily. To see why this is true, consider a toy example: a list of 3 documents with relevant documents B , C at ranks 1 and 3 and nonrelevant document A at rank 2. Average precision will b e 12 112 1 1 1 ( x + 2 xB xA + 1 xB xC + 2 x2 + 3 xA xC + 1 x2 ) = 1 +3 A 21B 3 3C 2

2.2 Confidence and Robustness
Having defined confidence, we turn back to the issue of trust in confidence estimates, and show how it ties into the robustness of the collection to missing judgments.
1 e bThese are actually approximations to the true -xp ectation and variance, but the error is a negligible O(n2 n ).

SIGIR 2007 Proceedings

Session 3: Evaluation

Let Z b e the set of all pairs of ranked results for a common set of topics. Supp ose we have a set of m relevance judgments xm = {x1 , x2 , ..., xm } (using small x rather than capital X to distinguish b etween judged and unjudged documents); these are the judgments against which we compute confidence. Let Z b e the subset of pairs in Z for which we predict that M AP = -1 with confidence given the judgments xm . For the confidence estimates to b e accurate, we need at least · |Z | of these pairs to actually have M AP = -1 after we have judged every document. If they do, we can trust the confidence estimates; our evaluation will b e robust to missing judgments. If our confidence estimates are based on unrealistic assumptions, we cannot exp ect them to b e accurate. The assumptions they are based on are the probabilities of relevance pi . We need these to b e "realistic". We argue that the b est p ossible distribution of relevance p(Xi ) is the one that explains all of the data (all of the observations made ab out the retrieval systems) while at the same time making no unwarranted assumptions. This is known as the principle of maximum entropy [13]. The entropy of a random variable X with distribution i p(X ) is defined as H (p) = - p(X = i) log p(X = i). This has found a wide array of uses in computer science and information retrieval. The maximum entropy distribution is the one that maximizes H . This distribution is unique and has an exp onential form. The following theorem shows the utility of a maximum entropy distribution for relevance when estimating confidence. Theorem 1. If p(X n |I , xm ) = argmaxp H (p), confidence estimates wil l be accurate. where xm is the set of relevance judgments defined ab ove, X n is the full set of documents that we wish to estimate the relevance of, and I is some information ab out the documents (unsp ecified as of now). We forgo the proof for the time b eing, but it is quite simple. This says that the better the estimates of relevance, the more accurate the evaluation. The task of creating a reusable test collection thus b ecomes the task of estimating the relevance of unjudged documents. The theorem and its proof say nothing whatsoever ab out the evaluation metric. The probability estimates are entirely indep edent of the measure we are interested in. This means the same probability estimates can tell us ab out average precision as well as precision, recall, bpref, etc. Furthermore, we could assume that the relevance of documents i and j is indep endent and achieve the same result, which we state as a corollary: Corollary 1. If p(Xi |I , xm ) = argmaxp H (p), confidence estimates wil l be accurate. The task therefore b ecomes the imputation of the missing values of relevance. The theorem implies that the closer we get to the maximum entropy distribution of relevance, the closer we get to robustness.

natural source of information is the retrieval systems themselves: how they ranked the judged documents, how often they failed to rank relevant documents, how they p erform across topics, and so on. If we treat each system as an information retrieval "exp ert" providing an opinion ab out the relevance of each document, the problem b ecomes one of exp ert opinion aggregation. This is similar to the metasearch or data fusion problem in which the task is to take k input systems and merge them into a single ranking. Aslam et al. [3] previously identified a connection b etween evaluation and metasearch. Our problem has two key differences: 1. We explicitly need probabilities of relevance that we can plug into Eq. 1; metasearch algorithms have no such requirement. 2. We are accumulating relevance judgments as we proceed with the evaluation and are able to re-estimate relevance given each new judgment. In light of (1) ab ove, we introduce a probabilistic model for exp ert combination.

3.1 A Model for Expert Opinion Aggregation
Supp ose that each exp ert j provides a probability of relevance qij = pj (Xi = 1). The information ab out the relevance of document i will then b e the set of k exp ert opinions I = qi = (qi1 , qi2 , · · · , qik ). The probability distribution we wish to find is the one that maximizes the entropy of pi = p(Xi = 1|qi ). As it turns out, finding the maximum entropy model is equivalent to finding the parameters that maximize the likelihood [5]. Blower [6] explicitly shows that finding the maximum entropy model for a binary variable is equivalent to solving a logistic regression. Then 1 k exp j =1 j qij (2) k pi = p(Xi = 1|qi ) = + exp j =1 j qij where 1 , · · · , k are the regression parameters. We include a b eta prior for p(j ) with parameters , . This can b e seen as a typ e of smoothing to account for the fact that the training data is highly biased. This model has the advantage of including the statistical dep endence b etween the exp erts. A model of the same form was shown by Clemen & Winkler to b e the b est for aggregating exp ert probabilities [10]. A similar maximumentropy-motivated approach has b een used for exp ert aggregation [15]. Aslam & Montague [1] used a similar model for metasearch, but assumed indep endence among exp erts. Where do the qij s come from? Using raw, uncalibrated scores as predictors will not work b ecause score distributions vary too much b etween topics. A language modeling ranker, for instance, will typically give a much higher score to the top retrieved document for a short query than to the top retrieved document for a long query. We could train a separate predicting model for each topic, but that does not take advantage of all of the information we have: we may only have a handful of judgments for a topic, not enough to train a model to any confidence. Furthermore, it seems reasonable to assume that if an exp ert makes good predictions for one topic, it will make good predictions for other topics as well. We could use a hierarchical model [12], but that will not generalize to unseen topics. Instead, we

PREDICTING RELEVANCE

In our statement of Theorem 1, we left the nature of the information I unsp ecified. One of the advantages of our confidence estimates is that they admit information from a wide variety of sources; essentially anything that can b e modeled can b e used as information for predicting relevance. A

SIGIR 2007 Proceedings

Session 3: Evaluation

will calibrate the scores of each exp ert individually so that scores can b e compared b oth within topic and b etween topic. Thus our model takes into account not only the dep endence b etween exp erts, but also the dep endence b etween exp erts' p erformances on different tasks (topics).

3.2 Calibrating Experts
Each exp ert gives us a score and a rank for each document. We need to convert these to probabilities. A method such as the one used by Manmatha et al. [14] could b e used to convert scores into probabilities of relevance. The "pairwise preference" method of Carterette & Petkova [9] could also b e used, interp eting the ranking of one document over another as an expression of preference. Let qij b e exp ert j 's self-rep orted probability that docu ment i is relevant. Intuitively it seems clear that qij should decrease with rank, and it should b e zero if document i is unranked (the exp ert did not b elieve it to b e relevant). The pairwise preference model can handle these two requirements easily, so we will use it. Let rj (i) b e the "relevance coefficient" of the document at rank rj (i). We want to find the s that maximize the likelihood function: r exp(rj (i) - rj (k) ) Lj t () = 1 + exp(rj (i) - rj (k) )
j (i)<rj (k)

Figure 1: Conceptual diagram of our aggregation model. Experts E1 , E2 have ranked documents A, B , C for topic T1 and documents D, E , F for topic T2 . The first step is to obtain qij . Next is calibration to true performance to find qij . Finally we obtain pi = p(Xi = 1|qi1 , qi2 ), · · · .

3.3 Model Summary
Our model has three comp onents that differ by the data they take as input and what they produce as output. A conceptual diagram is shown in Figure 1. 1. ranks probabilities (p er system p er topic). This gives us qij , exp ert j 's self-rep orted probability of the relevance of document i. This is unsup ervised; it requires no lab eled data (though if we have some, we use it to set prior parameters). 2. probabilities calibrated probabilities (p er system). This gives us qij = Cj (qij ), exp ert j 's calibrated probability of the relevance of document i. This is semisup ervised; we have relevance judgments at some ranks which we use to impute the probability of relevance at other ranks. 3. calibrated probabilities document probabilities. This gives us pi = p(Xi = 1|qi ), the probability of relevance of document i given calibrated exp ert probabilities qij . This is sup ervised; we learn coefficients from a set of judged documents and use those to estimate the relevance of the unjudged documents. Although the model app ears rather complex, it is really just three successive applications of logistic regression. As such, it can b e implemented in a statistical programming language such as R in a few lines of code. The use of b eta (conjugate) priors ensures that no exp ensive computational methods such as MCMC are necessary [12], so the model is trained and applied fast enough to b e used on-line. Our code is available at http://ciir.cs.umass.edu/~carteret/.

We again include a b eta prior on p(rj (i) ) with parameters |Rt | + 1 and |Nt | + 1, the size of the sets of judged relevant and nonrelevant documents resp ectively. Using these as prior parameters ensures that the resulting probabilities will b e concentrated around the ratio of relevant documents that have b een discovered for topic t. This means that the probability estimates decrease by rank and are higher for topics that have more relevant documents. After finding the that maximizes the likelihood, we have
qij = 1+exp(r (i) ) j exp(r (i) ) j

. We define = -, so that the proba-

bility that an unranked document is relevant is 0. Since qij is based on the rank at which a document is retrieved rather than the identity of the document itself, the probabilities are identical from exp ert to exp ert, e.g. if exp ert E put document A at rank 1, and exp ert D put document B at rank 1, we will have qAE = qBD . Therefore we only have to solve this once for each topic. The ab ove model gives topic-indep endent probabilities for each document. But supp ose an exp ert who rep orts 90% probability is only right 50% of the time. Its opinion should b e discounted based on its observed p erformance. Sp ecifi cally, we want to learn a calibration function qij = Cj (qij ) that will ensure that the predicted probabilities are tuned to the exp ert's ability to retrieve relevant documents given the judgments that have b een made to this p oint. Platt's SVM calibration method [16] fits a sigmoid func tion b etween qij and the relevance judgments to obtain qij =
Cj (qij ) =
exp(Aj +Bj qij ) . 1+exp(Aj +Bj qij )

Since qij is topic-indep endent,

4. EXPERIMENTS
Three hyp otheses are under consideration. The first, and most imp ortant, is that using our exp ert aggregation model to predict relevance produces test collections that are robust enough to b e reusable; that is, we can trust the estimates of confidence when we evaluate systems that did not contribute any judgments to the p ool. The other two hyp otheses relate to the improvement we see by using b etter estimates of relevance than we did in our previous work [8]. These are that (a) it takes fewer relevance

we only need to learn one calibration function for each exp ert. Once we have the calibration function, it is applied to adjust the exp erts' predictions to their actual p erformance. The calibrated probabilities are plugged into model (2) to find the document probabilities.

SIGIR 2007 Proceedings

Session 3: Evaluation

track ad hoc 94 ad hoc 95 ad hoc 96 ad hoc 97 ad hoc 98 ad hoc 99 web 04 robust 05 terabyte 05

no. topics 50 49 50 50 50 50 225 50 50

no. runs 40 33 61 74 103 129 74 74 58

no. judged 97,319 87,069 133,681 72,270 80,345 86,830 88,566 37,798 45,291

no. rel 9,805 6,503 5,524 4,611 4,674 4,728 1,763 6,561 10,407

Algorithm 1 (MTC) Given two ranked lists and confidence level , predict the sign of M AP . 1: pi 0.5 for all documents i 2: while P (M AP < 0) < do 3: calculate weight wi for all unjudged documents i (see Carterette et al. [8] for details) 4: j argmaxi wi 5: xj 1 if document j is relevant, 0 otherwise 6: pj xj 7: end while

Table 1: Number of topics, number of runs, number of documents judged, and number found relevant for each of our data sets. judgments to reach 95% confidence and (b) the accuracy of the predictions is higher than if we were to simply assume pi = 0.5 for all unjudged documents.

the search to uniform priors with relatively high variance. For exp ert aggregation, the prior parameters are = = 1.

4.3 Experimental Design
First, we want to know whether we can augment a set of relevance judgments with a set of relevance probabilities in order to reuse the judgments to evaluate a new set of systems. For each exp erimental trial: 1. 2. 3. 4. Pick a random subset of k runs. From those k, pick an initial c < k to evaluate. Run RTC to 95% confidence on the initial c. Using the model from Section 3, estimate the probabilities of relevance for all documents retrieved by all k runs. 5. Calculate E MAP for all k runs, and P (M AP < 0) for all pairs of runs.

4.1 Data
We obtained full ad hoc runs submitted to TRECs 3 through 8. Each run ranks at most 1000 documents for 50 topics (49 topics for TREC 4). Additionally, we obtained all runs from the Web track of TREC 13, the Robust2 track of TREC 14, and the Terabyte (ad hoc) track of TREC 14. These are the tracks that have "replaced" the ad hoc track since its end in 1999. Statistics are shown in Table 1. We set aside the TREC 4 (ad hoc 95) set for training, TRECs 3 and 58 (ad hoc 94 and 9699) for primary testing, and the remaining sets for additional testing. We use the qrels files assembled by NIST as "truth". The numb er of relevance judgments made and relevant documents found for each track are listed in Table 1. For computational reasons, we truncate ranked lists at 100 documents. There is no reason that we could not go deep er, but calculating variance is O(n3 ) and thus very timeconsuming. Because of the reciprocal rank nature of AP, we do not lose much information by truncating at rank 100.

We do the same for MTC, but omit step 4. Note that after evaluating the first c systems, we make no additional relevance judgments. To put our method to the test, we selected c = 2: we will build a set of judgments from evaluating only two initial systems. We will then generalize to a set of k = 10 (of which those two are a subset). As we run more trials, we obtain the data we need to test all three of our hyp otheses.

4.2 Algorithms
We will compare three algorithms for acquiring relevance judgments. The baseline is a variation of TREC p ooling that we will call incremental pooling (IP). This algorithm takes a numb er k as input and presents the first k documents in rank order (without regard to topic) to b e judged. It does not estimate the relevance of unjudged documents; it simply assumes any unjudged document is nonrelevant. The second algorithm is that presented in Carterette et al. [8] (Algorithm 1). Documents are selected based on how "interesting" they are in determining whether a difference in mean average precision exists. For this approach pi = 0.5 for all i; there is no estimation of probabilities. We will call this MTC for minimal test col lection. The third algorithm augments MTC with up dated estimates of probabilities of relevance. We will call this RTC for robust test col lection. It is identical to Algorithm 1, except that every 10th iteration we estimate pi for all unjudged documents i using the exp ert aggregation model of Section 3. RTC has smoothing (prior distribution) parameters that must b e set. We trained using the ad hoc 95 set. We limited
2 "Robust" here means robust retrieval; this is different from our goal of robust evaluation.

4.4 Experimental Evaluation
Recall that a set of judgments is robust if the accuracy of the predictions it makes is at least its estimated confidence. One way to evaluate robustness is to bin pairs by their confidence, then calculate the accuracy over all the pairs in each bin. We would like the accuracy to b e no less than the lowest confidence score in the bin, but preferably higher. Since summary statistics are useful, we devised the following metric. Supp ose we are a b ookmaker taking b ets on whether M AP < 0. We use RTC or MTC to set the P ( M AP <0 odds O = 1-PM AP <) ) . Supp ose a b ettor wagers $1 on ( 0 M AP 0. If it turns out that M AP < 0, we win the dollar. Otherwise, we pay out O. If our confidence estimates are p erfectly accurate, we break even. If confidence is greater than accuracy, we lose money; we win if accuracy is greater than confidence. Counterintuitively, the most desirable outcome is breaking even: if we lose money, we cannot trust the confidence estimates, but if we win money, we have either underestimated confidence or judged more documents than necessary. However, the cost of not b eing able to trust the confidence estimates is higher than the cost of extra relevance judgments, so we will treat p ositive outcomes as "good".

SIGIR 2007 Proceedings

Session 3: Evaluation

The amount we win on each pairwise comparison i is: W i = y i - (1 - y i ) Pi yi - Pi = 1 - Pi 1 - Pi

accuracy

yi = 1 if M AP < 0 and 0 otherwise, and Pi = P (M AP < 0). The summary statistic is W , the mean of Wi . Note that as Pi increases, we lose more for b eing wrong. This is as it should b e: the p enalty should b e great for missing the high probability predictions. However, since our losses grow without b ound as predictions approach certainty, we cap -Wi at 100. For our hyp othesis that RTC requires fewer judgments than MTC, we are interested in the numb er of judgments needed to reach 95% confidence on the first pair of systems. The median is more interesting than the mean: most pairs require a few hundred judgments, but a few pairs require several thousand. The distribution is therefore highly skewed, and the mean strongly affected by those outliers. Finally, for our hyp othesis that RTC is more accurate than MTC, we will look at Kendall's correlation b etween a ranking of k systems by a small set of judgments and the true ranking using the full set of judgments. Kendall's , a nonparametric statistic based on pairwise swaps b etween two lists, is a standard evaluation for this typ e of study. It ranges from -1 (p erfectly anti-correlated) to 1 (rankings identical), with 0 meaning that half of the pairs are swapp ed. As we touched on in the introduction, though, an accuracy measure like rank correlation is not a good evaluation of reusability. We include it for completeness.

confidence 0.5 - 0.6 0.6 - 0.7 0.7 - 0.8 0.8 - 0.9 0.9 - 0.95 0.95 - 0.99 1.0 W median judged mean

MTC % in bin accuracy 33.7% 61.7% 18.1% 73.1% 10.4% 70.1% 9.4% 69.0% 7.3% 78.0% 17.9% 70.4% 3.3% 68.3% -5.34 251 0.393

RTC % in bin accuracy 28.6% 61.9% 20.1% 76.3% 15.5% 78.0% 12.1% 84.9% 6.6% 93.1% 12.4% 93.4% 4.7% 98.9% -0.39 235 0.555

Table 2: Confidence that P (M AP < 0) and accuracy of prediction when generalizing a set of relevance judgments acquired using MTC and RTC. Each bin contains over 1,000 trials from the adhoc 3, 58 sets. RTC is much more robust than MTC. W is defined in Section 4.4; closer to 0 is better. Median judged is the number of judgments to reach 95% confidence on the first two systems. Mean is the average rank correlation for all 10 systems.
1

0.9

0.8

4.4.1 Hypothesis Testing
Running multiple trials allows the use of statistical hyp othesis testing to compare algorithms. Using the same sets of systems allows the use of paired tests. As we stated ab ove, we are more interested in the median numb er of judgments than the mean. A test for difference in median is the Wilcoxon sign rank test. We can also use a paired t-test to test for a difference in mean. For rank correlation, we can use a paired t-test to test for a difference in .

0.7

0.6 breakeven RTC MTC 0.5 0.5 0.6 0.7 0.8 0.9 1 confidence

RESULTS AND ANALYSIS

The comparison b etween MTC and RTC is shown in Table 2. With MTC and uniform probabilities of relevance, the results are far from robust. We cannot reuse the relevance judgments with much confidence. But with RTC, the results are very robust. There is a slight dip in accuracy when confidence gets ab ove 0.95; nonetheless, the confidence predictions are trustworthy. Mean Wi shows that RTC is much closer to 0 than MTC. The distribution of confidence scores shows that at least 80% confidence is achieved more than 35% of the time, indicating that neither algorithm is b eing too conservative in its confidence estimates. The confidence estimates are rather low overall; that is b ecause we have built a test collection from only two initial systems. Recall from Section 1 that we cannot require (or even exp ect) a minimum level of confidence when we generalize to new systems. More detailed results for b oth algorithms are shown in Figure 2. The solid line is the ideal result that would give W = 0. RTC is on or ab ove this line at all p oints until confidence reaches ab out 0.97. After that there is a slight dip in accuracy which we discuss b elow. Note that b oth

Figure 2: Confidence vs. accuracy of MTC and RTC. The solid line is the perfect result that would give W = 0; performance should be on or above this line. Each point represents at least 500 pairwise comparisons.

algorithms are well ab ove the line up to around confidence 0.7. This is b ecause the baseline p erformance on these data sets is high; it is quite easy to achieve 75% accuracy doing very little work [7]. Number of Judgments: The median numb er of judgments required by MTC to reach 95% confidence on the first two systems is 251, an average of 5 p er topic. The median required by RTC is 235, ab out 4.7 p er topic. Although the numb ers are close, RTC's median is significantly lower by a paired Wilcoxon test (p < 0.0001). For comparison, a p ool of depth 100 would result in a minimum of 5,000 judgments for each pair. The difference in means is much greater. MTC required a mean of 823 judgments, 16 p er topic, while RTC required a mean of 502, 10 p er topic. (Recall that means are strongly skewed by a few pairs that take thousands of judgments.) This difference is significant by a paired t-test (p < 0.0001).

SIGIR 2007 Proceedings

Session 3: Evaluation

Ten p ercent of the sets resulted in 100 or fewer judgments (less than two p er topic). Performance on these is very high: W = 0.41, and 99.7% accuracy when confidence is at least 0.9. This shows that even tiny collections can b e reusable. For the 50% of sets with more than 235 judgments, accuracy is 93% when confidence is at least 0.9. Rank Correlation: MTC and RTC b oth rank the 10 systems by E MAP (Eq. (1)) calculated using their resp ective probability estimates. The mean rank correlation b etween true MAP and E MAP is 0.393 for MTC and 0.555 for RTC. This difference is significant by a paired t-test (p < 0.0001). Note that we do not exp ect the correlations to b e high, since we are ranking the systems with so few relevance judgments. It is more imp ortant that we estimate confidence in each pairwise comparison correctly. We ran IP for the same numb er of judgments that MTC took for each pair, then ranked the systems by MAP using only those judgments (all unjudged documents assumed nonrelevant). We calculated the correlation to the true ranking. The mean correlation is 0.398, which is not significantly different from MTC, but is significantly lower than RTC. Using uniform estimates of probability is indistinguishable from the baseline, whereas estimating relevance by exp ert aggregation b oosts p erformance a great deal: nearly 40% over b oth MTC and IP. Overfitting: It is p ossible to "overfit": if too many judgments come from the first two systems, the variance in M AP is reduced and the confidence estimates b ecome unreliable. We saw this in Table 2 and Figure 2 where RTC exhibits a dip in accuracy when confidence is around 97%. In fact, the numb er of judgments made prior to a wrong prediction is over 50% greater than the numb er made prior to a correct prediction. Overfitting is difficult to quantify exactly, b ecause making more relevance judgments does not always cause it: at higher confidence levels, more relevance judgments are made, and as Table 2 shows, accuracy is greater at those higher confidences. Obviously having more relevance judgments should increase b oth confidence and accuracy; the difference seems to b e when one system has a great deal more judgments than the other. Pairwise Comparisons: Our pairwise comparisons fall into one of three groups: 1. the two original runs from which relevance judgments are acquired; 2. one of the original runs vs. one of the new runs; 3. two new runs. Table 3 shows confidence vs. accuracy results for each of these three groups. Interestingly, p erformance is worst when comparing one of the original runs to one of the additional runs. This is most likely due to a large difference in the numb er of judgments affecting the variance of M AP . Nevertheless, p erformance is quite good on all three subsets. Worst Case: The case intuitively most likely to produce an error is when the two systems b eing compared have retrieved very few documents in common. If we want the judgments to b e reusable, we should to b e able to generalize even to runs that are very different from the ones used to acquire the relevance judgments. A simple measure of similarity of two runs is the average p ercentage of documents they retrieved in common for each topic [2]. We calculated this for all pairs, then looked at p erformance on pairs with low similarity. Results are shown in

confidence 0.5 - 0.6 0.6 - 0.7 0.7 - 0.8 0.8 - 0.9 0.9 - 0.95 0.95 - 0.99 1.0 W

two original 95.9% 96.2% 100% -1.11

accuracy one original 48.1% 57.1% 67.9% 82.2% 93.7% 92.5% 98.0% -0.87

no original 62.8% 79.2% 81.7% 86.3% 92.6% 93.1% 99.1% -0.27

Table 3: Confidence vs. accuracy of RTC when comparing the two original runs, one original run and one new run, and two new runs. RTC is robust in all three cases. confidence 0.5 - 0.6 0.6 - 0.7 0.7 - 0.8 0.8 - 0.9 0.9 - 0.95 0.95 - 0.99 1.0 W accuracy when similar 0 - 0.1 0.1 - 0.2 0.2 - 0.3 68.4% 63.1% 61.4% 84.2% 78.6% 76.6% 82.0% 79.8% 78.9% 93.6% 83.3% 82.1% 99.3% 92.7% 92.4% 98.7% 93.4% 93.3% 99.9% 97.9% 98.1% 0.44 -0.45 -0.49

Table 4: Confidence vs. accuracy of RTC when a pair of systems retrieved 030% documents in common (broken out into 0%10%, 10%20%, and 20% 30%). RTC is robust in all three cases.

Table 4. Performance is in fact very robust even when similarity is low. When the two runs share very few documents in common, W is actually p ositive. MTC and IP b oth p erformed quite p oorly in these cases. When the similarity was b etween 0 and 10%, b oth MTC and IP correctly predicted M AP only 60% of the time, compared to an 87.6% success rate for RTC. By Data Set: All the previous results have only b een on the ad hoc collections. We did the same exp eriments on our additional data sets, and broke out the results by data set to see how p erformance varies. The results in Table 5 show everything ab out each set, including binned accuracy, W , mean , and median numb er of judgments to reach 95% confidence on the first two systems. The results are highly consistent from collection to collection, suggesting that our method is not overfitting to any particular data set.

6. CONCLUSIONS AND FUTURE WORK
In this work we have offered the first formal definition of the common idea of "reusability" of a test collection and presented a model that is able to achieve reusability with very small sets of relevance judgments. Table 2 and Figure 2 together show how biased a small set of judgments can b e: MTC is dramatically overestimating confidence and is much less accurate than RTC, which is able to remove the bias to give a robust evaluation. The confidence estimates of RTC, in addition to b eing accurate, provide a guide for obtaining additional judgments: focus on judging documents from the lowest-confidence comparisons. In the long run, we see small sets of relevance judg-

SIGIR 2007 Proceedings

Session 3: Evaluation

confidence 0.5 - 0.6 0.6 - 0.7 0.7 - 0.8 0.8 - 0.9 0.9 - 0.95 0.95 - 0.99 1.0 W median judged mean

ad hoc 94 64.1% 76.1% 75.2% 83.2% 93.0% 93.1% 99.2% -0.34 235 0.538

ad hoc 96 61.8% 77.8% 78.9% 85.5% 93.6% 94.3% 96.8% -0.34 276 0.573

ad hoc 97 62.2% 74.5% 77.6% 84.6% 92.8% 93.1% 98.7% -0.48 243 0.556

accuracy ad hoc 98 ad hoc 99 62.0% 59.4% 78.2% 74.3% 80.0% 78.6% 84.9% 86.8% 93.7% 92.6% 93.7% 92.8% 99.5% 99.6% -0.35 -0.44 213 179 0.579 0.532

web 04 64.3% 78.1% 82.6% 84.5% 94.2% 95.0% 100% -0.07 448 0.596

robust 05 61.5% 75.9% 77.5% 86.7% 93.9% 93.9% 99.2% -0.41 310 0.565

terabyte 05 61.6% 75.9% 80.4% 87.7% 94.2% 91.6% 98.3% -0.67 320 0.574

Table 5: Accuracy, W , mean , and median number of judgments for all 8 testing sets. The results are highly consistent across data sets. ments b eing shared by researchers, each group contributing a few more judgments to gain more confidence ab out their particular systems. As time goes on, the numb er of judgments grows until there is 100% confidence in every evaluation--and there is a full test collection for the task. We see further use for this method in scenarios such as web retrieval in which the corpus is frequently changing. It could b e applied to evaluation on a dynamic test collection as defined by Sob oroff [18]. The model we presented in Section 3 is by no means the only p ossibility for creating a robust test collection. A simpler exp ert aggregation model might p erform as well or b etter (though all our efforts to simplify failed). In addition to exp ert aggregation, we could estimate probabilities by looking at similarities b etween documents. This is an obvious area for future exploration. Additionally, it will b e worthwhile to investigate the issue of overfitting: the circumstances it occurs under and what can b e done to prevent it. In the meantime, capping confidence estimates at 95% is a "hack" that solves the problem. We have many more exp erimental results that we unfortunately did not have space for but that reinforce the notion that RTC is highly robust: with just a few judgments p er topic, we can accurately assess the confidence in any pairwise comparison of systems.
[5] A. L. Berger, S. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):3971, 1996. [6] D. J. Blower. An easy derivation of logistic regression from the bayesian and maximum entropy perspective. In Proceedings of the 23rd International Workship on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, pages 3043, 2004. [7] B. Carterette and J. Allan. Research methodology in studies of assessor effort for retrieval evaluation. In Proceedings of RIAO, 2007. [8] B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of SIGIR, pages 268275, 2006. [9] B. Carterette and D. I. Petkova. Learning a ranking from pairwise preferences. In Proceedings of SIGIR, 2006. [10] R. T. Clemen and R. L. Winkler. Unanimity and compromise among probability forecasters. Management Science, 36(7):767779, July 1990. [11] G. V. Cormack, C. R. Palmer, and C. L. Clarke. Efficient Construction of Large Test Collections. In Proceedings of SIGIR, pages 282289, 1998. [12] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, 2004. [13] E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003. [14] R. Manmatha and H. Sever. A Formal Approach to Score Normalization for Metasearch. In Proceedings of HLT, pages 8893, 2002. [15] I. J. Myung, S. Ramamoorti, and J. Andrew D. Baily. Maximum entropy aggregation of expert predictions. Management Science, 42(10):14201436, October 1996. [16] J. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. pages 6174, 2000. [17] M. Sanderson and H. Joho. Forming test collections with no system pooling. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3340, 2004. [18] I. Soboroff. Dynamic test collections: measuring search effectiveness on the live web. In Proceedings of SIGIR, pages 276283, 2006. [19] K. Sparck Jones and C. J. van Rijsbergen. Information Retrieval Test Collections. Journal of Documentation, 32(1):5975, 1976. [20] E. M. Voorhees and D. K. Harman, editors. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. [21] J. Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of SIGIR, pages 307314, 1998.

Acknowledgments
This work was supp orted in part by the Center for Intelligent Information Retrieval and in part by the Defense Advanced Research Pro jects Agency (DARPA) under contract numb er HR0011-06-C-0023. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect those of the sp onsor.

REFERENCES

[1] J. Aslam and M. Montague. Models for Metasearch. In Proceedings of SIGIR, pages 275285, 2001. [2] J. Aslam and R. Savell. On the effectiveness of evaluating retrieval systems in the absence of relevance judgments. In Proceedings of SIGIR, pages 361362, 2003. [3] J. A. Aslam, V. Pavlu, and R. Savell. A unified model for metasearch, pooling, and system evaluation. In Proceedings of CIKM, pages 484491, 2003. [4] J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of SIGIR, pages 541548, 2006.