A Statistical Method for System Evaluation Using Incomplete Judgments Javed A. Aslam Virgil Pavlu Emine Yilmaz College of Computer and Information Science Nor theastern University 360 Huntington Ave, #202 WVH Boston, MA 02115 {jaa,vip,emine}@ccs.neu.edu ABSTRACT We consider the problem of large-scale retrieval evaluation, and we prop ose a statistical method for evaluating retrieval systems using incomplete judgments. Unlike existing techniques that (1) rely on effectively complete, and thus prohibitively exp ensive, relevance judgment sets, (2) produce biased estimates of standard p erformance measures, or (3) produce estimates of non-standard measures thought to b e correlated with these standard measures, our prop osed statistical technique produces unbiased estimates of the standard measures themselves. Our prop osed technique is based on random sampling. While our estimates are unbiased by statistical design, their variance is dep endent on the sampling distribution employed; as such, we derive a sampling distribution likely to yield low variance estimates. We test our prop osed technique using b enchmark TREC data, demonstrating that a sampling p ool derived from a set of runs can b e used to efficiently and effectively evaluate those runs. We further show that these sampling p ools generalize well to unseen runs. Our exp eriments indicate that highly accurate estimates of standard p erformance measures can b e obtained using a numb er of relevance judgments as small as 4% of the typical TRECstyle judgment p ool. 1. INTRODUCTION We consider the problem of large-scale retrieval evaluation and prop ose a statistical technique for efficiently and effectively estimating standard measures of retrieval p erformance via random sampling. Standard methods of retrieval evaluation can b e quite exp ensive when conducted on a large-scale. For example, in many years of the annual Text REtrieval Conference (TREC), upwards of 100 runs each consisting of a ranked list of up to 1,000 documents were submitted with resp ect to each of 50 or more topics. In principle, each document in the collection would need to b e assessed for relevance with resp ect to each query in order to evaluate many standard retrieval measures such as average precision and R-precision; in practice, this is prohibitively exp ensive. TREC instead employs depth pooling wherein the union of the top k documents retrieved in each run corresp onding to a given query is formed, and the documents in this depth k p ool are judged for relevance with resp ect to this query. In TREC, k = 100 has b een shown to b e an effective cutoff in evaluating the relative p erformance of retrieval systems [8, 12], and while the depth 100 p ool is considerably smaller than the document collection, it still engenders a large assessment effort: in TREC8, for example, 86,830 relevance judgments were used to assess the quality of the retrieved lists corresp onding to 129 system runs in resp onse to 50 topics [11]. The table given b elow shows the relationship b etween p ool depth and the numb er of judgments required p er topic on average for various TRECs. Pool Depth 1 3 5 10 20 100 7 27 65 98 179 337 1596 TREC 8 10 29 25 71 64 110 100 200 184 379 339 1712 1414 Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and Software ­ Performance evaluation General Terms Theory, Measurement, Exp erimentation Keywords Evaluation, Sampling, Average Precision We gratefully acknowledge the supp ort provided by NSF grant CCF-0418390. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. Shallower p ools [12] and greedily chosen dynamic p ools [6, 2] have also b een studied in an attempt to alleviate the assessment effort; however, such techniques tend to produce biased estimates of standard retrieval measures, esp ecially when relatively few relevance judgments are used. Recently, Buckley and Voorhees prop osed a new measure, bpref, and they show that when bpref is employed with a judged sample drawn uniformly from the depth 100 p ool, the retrieval systems can b e effectively ranked [5]. However, bpref is not designed to approximate any standard measure of retrieval 541 performance, and while bpref is correlated with average precision, esp ecially at high sampling rates, these correlations and the system rankings produced degenerate at low sampling rates. Our goal in this work is to efficiently and accurately estimate standard measures of retrieval p erformance. Unlike previously prop osed methodologies which tend to produce biased estimates of standard measures using few relevance judgments or methodologies based on estimating measures other than the most widely rep orted standard measure, our methodology, by statistical design, produces unbiased estimates of the standard measures of retrieval p erformance themselves. The core of our methodology is the derivation, for each measure, of a distribution over documents such that the value of the measure is prop ortional to the expectation of observing a relevant document drawn according to that distribution. (In the case of average precision, the distribution is over pairs of documents, and the observation is the product of the relevances for the pair drawn.) Given such distributions, one can estimate the exp ectations (and hence measurement values) using random sampling. By statistical design, such estimates will b e unbiased. Furthermore, through the use of the statistical estimation technique of importance sampling [1], we show how low variance estimates of multiple retrieval measures can b e simultaneously estimated for multiple runs given a single sample. In sum, we show how b oth efficient and effective estimates of standard retrieval measures can b e inferred from a random sample, thus providing an alternative to large-scale TREC-style evaluations. We tested our methodology using the b enchmark TREC data collections, and results are rep orted for TRECs 7, 8, and 10. We compare the p erformance of TREC-style depth p ools to sampling p ools of an equivalent size. Generally sp eaking, as the numb er of relevance assessments is decreased (either by considering a lower depth p ool, in the case of TREC, or by sampling at a lower rate, in the case of the prop osed technique), the bias and variance of the estimates inferred from TREC-style depth p ools increases, while only the variance of estimates inferred from sampling p ools increases. At equivalent levels of judgment effort, the estimates produced by sampling are consistently b etter than those produced by depth p ooling, often dramatically so. While TREC-style depth p ools are used to evaluate the systems from which they were generated, they are also often used to evaluate new runs which did not originally contribute to the p ool. Depth 100 TREC-style p ools generalize well in the sense that they can b e used to effectively evaluate new runs, and this has b een a tremendous b oon to researchers who use TREC data. We also show that random sampling p ools generalize well, achieving estimation errors on unseen runs comparable to the estimation errors on runs from which the sample was drawn. In the sections that follow, we describ e our methodology and the results from exp eriments using the TREC collections. We conclude with a summary and directions for future research. trieval measures from random samples.1 While many of the details are somewhat complex and/or necessarily omitted for space considerations, the basic ideas can b e summarized as follows: 1. For each measure, we derive a random variable and associated probability distribution such that the value of the measure in question is prop ortional to the expectation of the random variable with resp ect to the probability distribution. For example, to estimate precisionat-cutoff 500, one could simply uniformly sample documents from the top 500 in a given list and output the fraction of relevant documents seen. Thus, the underlying random variable for precision-at-cutoff c is dictated by the binary relevance assessments, and the associated distribution is uniform over the top c documents. (Since R-precision is precision-at-cutoff R, an identical strategy holds, given the value or an estimate of R.) For average precision, the situation is somewhat more complex: we show that the required sampling distribution is over pairs of documents and the underlying random variable is the product of the binary relevance judgments for that pair. 2. Given that the value of a measure can b e viewed as the exp ectation of a random variable, one can apply standard sampling techniques to estimate this exp ectation and hence the value of the measure. To implement this methodology efficiently , one would like to estimate all retrieval measures for all runs simultaneously using a single judged sample. As such, one is confronted with the task of estimating the exp ectation of a random variable with resp ect to a known distribution by using a sample drawn according to a different (but known) distribution. We employ scaling factors used in the statistical estimation field of importance sampling [1] in order to facilitate this goal. 3. Finally, to implement this methodology effectively, one desires low variance unbiased estimators so that the computed empirical means will converge to their true exp ectations quickly. While we show that any known sampling distribution can b e used to yield unbiased estimators for retrieval measures, we also describ e a heuristic for generating a sp ecific sampling distribution which is likely to yield low variance estimators. In the sections that follow, we describ e each of these ideas in more detail, and we then present extensive exp eriments using TREC data. 2.1 Sum precision as an expected value While our goal is to simultaneously estimate multiple measures of p erformance over multiple lists, we b egin by considering the problem of estimating average precision from a random sample. Unlike R-precision or precision at standard cutoffs, deriving a sampling distribution for average precision is non-trivial, and it yields a distribution which is empirically quite useful for estimating the other measures of interest. By definition, average precision (AP ) is the average of the precisions at all relevant documents, or equivalently, the 1 A somewhat different treatment of this material may b e found in our recent workshop work-in-progress rep ort [4]. 2. METHODOLOGY In this section, we sketch our prop osed statistical technique for efficiently and effectively estimating standard re- 542 1 2 3 . . . Z 1 1 1/ 2 1/ 3 1/ Z 2 1/ 2 1/ 3 3 ... Z 1/ 3 1/ Z 1/ Z ... 1/ Z 1 2 3 . . . Z 1 2 1/ 2 1/ 3 1/ Z 2 1/ 2 1 1/ 3 3 ... Z . . . 1/Z 1/ . . . 1/ 3 Z 2/ . . . 1/ 3 Z 1/ 3 1/ Z 1/ Z ... 2/ Z Table 1: (Left) Weights associated with pairs of ranks; normalizing by Z yields an asymmetric joint distribution. (Right) Symmetric weights; normalizing by 2Z yields the symmetric joint distribution JD . sum of the precisions at all relevant documents divided by R, the numb er of documents relevant to the query. Let SP b e this sum of precisions at all relevant documents. In what follows, we first discuss estimating SP , and later we discuss estimating R; our estimate of AP will b e the ratio of the estimates of SP and R. One can compute sum precision as follows, where Z is the length of the retrieved list, rel (i) is the binary relevance of the document at rank i, and R is the numb er of documents relevant to the query. SP = X i : rel (i)=1 Z X i=1 Figure 1: AP evaluation diagram PC (i) = i X j =1 Z X i=1 rel (i) · PC (i) X 1j iZ = rel (i) rel (j )/i = 1 · rel (i) · rel (j ) i Figure 2: PC evaluation diagram Thus, in order to evaluate SP , one must compute the weighted product of relevances of documents at pairs of ranks, where for any pair j i, the associated weight is 1/i. (See Table 1, left.) In order to view this sum as an exp ectation, we define an event space corresp onding to pairs of ranks (i, j ), a random variable X corresp onding to the product of the binary relevances rel (i) · rel (j ), and an appropriate probability distribution over the event space. One such distribution corresp onds to the (appropriately normalized) weights given in Table 1 (left); for convenience, we shall instead define a symmetrized version of these weights (see Table 1 (right)) and the corresp onding joint distribution JD (appropriately normalized by 2Z ). It is not difficult to see that SP = Z · E[X ] where the exp ectation is computed with resp ect to either distribution. Thus, if U is a multiset of pairs drawn according to JD , we obtain the following estimate for SP 1X c SP = Z · rel (i) · rel (j ). |U | (i,j )U p ool. In order to combat this p otential inefficiency, we shall construct a single distribution over pairs of documents derived from the joint distributions JD s associated with every system s. We shall effectively b e sampling from a distribution different from the one necessary to estimate the exp ectations desired. To combat this, we introduce scaling factors as follows. Let D(i, j ) b e the joint distribution over documents from which we effectively sample. Note that i and j now denote documents, not ranks. Similarly abusing notation, let JD s (i, j ) denote the joint distribution over documents (not ranks) required to estimate the prop er exp ectation for system s. We define scaling factors SF s (i, j ) which corresp ond to the ratio b etween the desired and sampling distributions SF s (i, j ) = JD s (i, j ) Ds (i, j ) 2.2 Simultaneous estimation for multiple runs One is often faced with the task of evaluating the average precisions of many retrieval systems with resp ect to a given query (as in TREC), and in a naive implementation of the technique describ ed, the documents judged for one system will not necessarily b e reused in judging another system. In contrast, TREC creates a single p ool of documents from the collection of runs to b e evaluated, judges that p ool, and evaluates all of the systems with resp ect to this single judged where Ds is the distribution induced by D over documents retrieved by s. We then have X 1 c rel (i) · rel (j ) · SF s (i, j ) SP = Zs · |Us | (i,j )Us where Zs is the length of the list returned by system s and Us U is the subset of samples corresp onding to documents retrieved by s. Note that the ab ove formulation holds for any sampling distribution D. In what follows, we describ e a heuristic for determining a good sampling distribution-- one which corresp onds to a distribution over documents (for efficiency) and which explicitly attempts to minimize the variance in the estimates produced (for accuracy). 543 2.3 Deriving the sampling distribution The technique describ ed ab ove can b e used to estimate the average precision of one or more retrieval runs with resp ect to any given query. However, it is relatively inefficient: (1) On a p er run basis, indep endent and identically distributed (i.i.d.) pairs of documents are drawn and judged, but the induced pairs of judged documents across i.i.d. samples are not used. In order to combat this p otential inefficiency, we shall instead draw a sample from a distribution over documents and consider all induced pairs of judgments. In determining a sampling distribution D, we consider two factors. First, we imp ose the condition that D b e a symmetric product distribution, i.e., D(i, j ) = M (i) · M (j ) for some (marginal) distribution over documents M . The purp ose for this is efficiency: we will sample documents according to M and consider all induced pairs of documents, which will b e distributed (approximately) according to D. Second, we seek a D which explicitly attempts to minimize the variance in our estimator, for accuracy. We b egin by considering the latter factor. Variance minimization. For a sampling distribution D and a given system s, let Ds b e the distribution induced by D over pairs of documents contained in the list returned by system s. Furthermore, let Y b e the random variable rel (i) · rel (j ) · SF s (i, j ) such that SP = Zs · EDs [Y ]. Since Zs and R are fixed, in order to minimize the variance of AP , we must minimize the variance of Y . Var[Y ] = E[Y 2 ] - E2 [Y ] X = Ds (i, j ) · rel (i)2 · rel (j )2 · SF s (i, j )2 - (SP /Zs )2 i,j marginal M (i) by averaging the marginals associated with each system s. 1X M (i) = MD s (i) Ns We finally note that in a typical TREC setting, one averages AP over 50 queries to obtain a final estimate of the p erformance of a system, and this averaging results in a further significant variance reduction. Exact computation of scaling factors. M (i) is the distribution we use for sampling documents, and given a sample of K such documents, we consider all K 2 induced pairs and estimate the required exp ectations from these induced pairs and appropriate scaling factors. For sufficiently large K , the distribution over induced pairs will approximate the associated product distribution D(i, j ) = M (i) · M (j ); however, the actual distribution is multinomial. As a consequence, if Ks is the size of the subset of K sampled documents which are retrieved by system s, one obtains the following final scaling factors: SF s (i, j ) = JD s (i, j ) . I (i, j ) · K 2 /Ks 2 = X i,j :rel (i)=rel (j )=1 Ds (i, j ) · JD s (i, j )2 - (SP /Zs )2 Ds (i, j )2 Finally, we derive the exact form of the multinomial sampling distribution over induced pairs. Sampling K documents (with replacement) from a distribution M and forming all K 2 pairs of documents yields a multinomial distribution I over the p ossible outcomes. Let k = (k1 , k2 , . . . , kW ) corresp ond to counts for the sampled documents where k1 + k2 + · · · + kW = K and ki is the count associated with doc` ´Q K ument i. Then Pr(k ) = k1 ,k2 ...kW · M (d)kd . For i = j and k > 2, the induced pairs distribution is derived as follows P ki ·kj I (i, j ) = · Pr(k) K2 k=(...,ki ,...,kj ,...) 1 K2 d = X i,j :rel (i)=rel (j )=1 JD s (i, j )2 - (SP /Zs )2 Ds (i, j ) = To minimize this variance, it is enough to minimize the first term since SP /Zs is fixed. Employing importance sampling techniques for minimizing the variance of Monte Carlo estimators, one can derive that the b est sampling distribution D is the distribution induced by JD s over relevant documents. (See Anderson [1] for an example of such a derivation.) Of course, we do not have the complete relevance judgments necessary to calculate the ideal sampling distribution. HowP ever, the marginal distribution MD s (i) = j JD s (i, j ) associated with the average precision sampling distribution JD s (i, j ) has b een shown to b e a reasonable prior for relevant documents [3], and using such a prior one can argue that a sampling distribution Ds (i, j ) prop ortional2 to (MD s (i) · MD s (j ))3/2 is likely to result in low variance. (Details omitted for space considerations.) Ds (i, j ) is a product distribution having identical marginals with resp ect to i and j ; let MD s (i) b e the marginal associated with Ds (i, j ). If our task were to estimate the p erformance of only one retrieval system, we could sample documents according to MD s (i), consider all induced pairs of documents, and estimate AP using appropriate scaling factors. However, in general our task is to simultaneously estimate AP for N systems from a single sample. We obtain a final sampling 2 P " = " (1 - M (i) - M (j ))K -ki -kj P" (K -2)!(K -1)K 1 K2 k i + k j K k i + k j K ki kj ` K k i , kj , . . . ´ M (i)ki M (j )kj · (ki -1)!(kj -1)!(K -ki -kj )! M (i)M (j ) · " = M (i)ki -1 M (j )kj -1 (1 - M (i) - M (j ))K -ki -kj ´ P "` K (K -1)M (i)M (j ) K -2 · ki -1,kj -1,... K2 k i + k j K M (i)ki -1 M (j )kj -1 (1 - M (i) - M (j ))K -ki -kj = K -1 M (i)M (j ) K " When i = j , a similar derivation yields " " 1 I (i, i) = M (i) 1 + (K - 1)M (i) + (1 - M (i))K -1 . K 2.4 Estimating R and AP To obtain an estimate for AP , we must know or obtain estimates for S P , R and Zs , and the exp ectation describ ed ab ove. We have describ ed in detail how to estimate S P , and Zs is a known quantity (the length of the system's returned list). However, R, the total numb er of documents The expression must b e normalized to form a distribution. 544 judgment effort.) Generate all K 2 pairs of judged documents. (5) Compute the multinomial induced pairs distribution (Figure 1, step 1), I (i, j ); this is the "effective" distribution over pairs of documents from which we sampled. From I (i, j ) and JD s (i, j ), compute the required scaling factors. (6) From the induced pairs and the scaling factors, compute the estimates of SP for each run (Figure 1 black b ox). (7) Estimate R using the sampled documents drawn according to M (i ) and appropriate scaling factors (Figure 3). (8) Estimate AP by the ratio of the estimates for SP and R . Note that the estimate of a ratio is not necessarily the ratio of estimates. More accurate ratio estimates derived via a second order Taylor series approximation [10] were tested, and they were generally found to b e of little b enefit for the computational effort required. 2.7 New retrieval runs Figure 3: Sampling diagram relevant to the given query, is not typically known and must also b e estimated. Sophisticated approaches for estimating R exist [9]; however, in this preliminary study we employ techniques similar to those describ ed ab ove. In order to estimate R (as calculated by TREC), one could simply uniformly sample documents from the depth 100 p ool. Given that our sample is drawn according to M (i) instead, one can employ appropriate scaling factors to obtain the correct estimate. A question of particular imp ortance is how can we use the samples generated by our method to evaluate a new run, i.e. a run that did not contribute to the sampling distribution. In order for sampling to work correctly, there should b e sufficient sampled documents in the new run so that the evaluation using sampling is meaningful. The evaluation (estimation) methodology is indep endent of the fact that the run participated to the sampling process; therefore it can b e applied to the new runs in the same way as for the runs used in producing the sample. On the scale factor computation, the numerator is a function of the ranks of sampled documents in the new list and the denominator is computed based on the sampling distribution conditioned to the n ew run. In TREC data, it is already the case that the actual p ools are created from only a subset of the submitted runs. In all our exp eriments, to computing the average sampling distribution, we only use the runs that contributed to the p ool (training systems ) and use the runs that did not contribute to the p ool as testing systems. In the plots that follow, the dots (·) in the plots corresp ond to the training systems, and the pluses (+) corresp ond to the testing systems. 2.5 Estimating PC and RP To estimate precision-at-cutoff c, one could simply uniformly sample documents from the top c in any given list. Given that we sample documents according to M (i), we again employ appropriate scaling factors to obtain correct estimates for PC (c). R-precision is simply the precision-at-cutoff R. We do b not know R; however, we can obtain an estimate R for R as describ ed ab ove. Given this estimate, we simply estimate b PC (R). 3. EXPERIMENTAL RESULTS We tested the prop osed sampling method as a mechanism for estimating the p erformance of retrieval systems using data from TRECs 7, 8 and 10. We used mean average precision (MAP), mean R-precision (MRP), and mean precision at cutoffs 5, 10, 15, 20, 30, 100, 200, 500, and 1000 (MPC(c)) as evaluation measures. We compared the estimates obtained by the sampling method with the "actual" evaluations, i.e. evaluations obtained by depth 100 TRECstyle p ooling. The estimates are found to b e consistently good even when the total numb er of documents judged is far less than the numb er of judgments used to calculate the actual evaluations. To evaluate the quality of our estimates, we calculated three different statistics, root mean squared (RMS) error (how different the estimated values are from the actual valqP N 2 1 ues, i.e. RMS = i=1 (ai - ei ) , where ai are the acN tual and ei are the estimates values), linear correlation coefficient (how well the actual and estimated values fit to a straight line), and Kendall's (how well the estimated measures rank the systems compared to the actual rankings). 2.6 Practical summary Below we give a summary for implementing the sampling method. Figure 3 can serve as a guide. (1) For each run s, use the joint distribution over pairs of documents, JDs (i , j ), dictated by SP such that sampling pairs of documents according to JD would yield SP in exp ectation. (2) To minimize variance, we introduce a prior over relevant documents (Figure 3, step 2). Let MDs (i ) b e the marginal distribution of JD s (i, j ), and let Ds (i, j ) b e the (appropriately normalized) joint distribution corresp onding to (MD s (i) · MD s (j ))3/2 . Let MD s (i) b e the marginal distribution of D . This is the sampling distribution over documents that would b e used for run s. (3) Over all runs, compute M (i), the average of these sampling distributions over documents. M (i) is our single, final sampling distribution for the query (all runs). (4) Sample K documents with replacement according to M (i) (Figure 3, black b ox) until T unique documents are drawn; judge these documents. (T is the desired a priori 545 Depth pooling MAP estimates, TREC=8 K=29 0.5 0.45 0.4 trainingRMS=0.067717 testingRMS=0.071323 0.35 =0.935188 =0.735349 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 MAP Sampling MAP, TREC=8 K=29 0.5 0.45 trainingRMS=0.026391 testingRMS=0.028177 0.35 =0.967492 =0.800824 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 MAP 0.3 0.4 0.5 0.4 0.5 0.45 0.3 0.4 0.5 0.5 0.45 Depth pooling MAP estimates, TREC=8 K=200 0.5 0.45 Depth Pooling MRP, TREC=8 K=29 0.5 0.45 Depth Pooling MRP, TREC=8 K=200 depth pooled MRP depth pooled MAP depth pooled MAP 0.25 0.2 0.15 depth pooled MRP 0.3 MRP 0.4 0.5 0.4 trainingRMS=0.057039 testingRMS=0.057814 0.35 =0.990995 =0.912055 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 MAP 0.3 0.4 0.5 0.4 trainingRMS=0.037644 testingRMS=0.042596 0.35 =0.928781 =0.710209 0.3 0.4 trainingRMS=0.039552 testingRMS=0.040188 0.35 =0.989286 =0.909278 0.3 0.25 0.2 0.15 0.1 0.05 0.1 0.05 0 0 0.1 0.2 0 0 0.1 0.2 MRP 0.3 0.4 0.5 Sampling MRP estimates, TREC=8 k=29 0.5 0.5 0.45 Sampling MRP estimates, TREC=8 k=200 Sampling MAP, TREC=8 K=200 0.45 0.4 trainingRMS=0.034582 testingRMS=0.028835 0.35 =0.982952 =0.859687 0.3 0.25 0.2 0.15 estimated MRP estimated MRP 0.3 MRP 0.4 0.5 estimated MAP estimated MAP trainingRMS=0.009328 testingRMS=0.005606 0.35 =0.997112 =0.949600 0.3 0.25 0.2 0.4 0.4 trainingRMS=0.009726 testingRMS=0.007316 0.35 =0.997094 =0.941298 0.3 0.25 0.2 0.15 0.1 0.05 0.1 0.15 0.05 0.1 0.05 0 0 0.1 0.2 MAP 0.3 0.4 0.5 0 0 0.1 0.2 0 0 0.1 0.2 MRP 0.3 0.4 0.5 depth pooled MPC100 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 MPC100 Sampling MPC100 estimates, TREC=8 k=29 0.5 0.45 0.4 trainingRMS=0.027276 testingRMS=0.019395 0.35 =0.954947 =0.817609 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 MPC100 0.3 0.4 0.5 0.3 0.4 0.5 depth pooled MPC100 Figure 4: Sampling vs. depth pooling mean average precision estimates at depths 1 and 10 in TREC8. Each dot (·) corresponds to a distributioncontributor run and each plus (+) to a distributionnon-contributor run (there are 129 runs in TREC8.) Figure 5: Sampling vs. depth pooling mean Rprecision estimates at depths 1 and 10 in TREC8. Depth Pooling MPC100 estimates, TREC=8 K=29 0.5 0.45 0.4 trainingRMS=0.159938 testingRMS=0.164225 0.35 =0.905450 =0.655833 0.3 0.5 0.45 0.4 trainingRMS=0.058984 testingRMS=0.062682 0.35 =0.964205 =0.862276 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 MPC100 Sampling MPC100 estimates, TREC=8 k=200 0.5 0.45 0.4 trainingRMS=0.005145 testingRMS=0.006775 0.35 =0.996809 =0.943060 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 MPC100 0.3 0.4 0.5 0.3 0.4 0.5 Depth Pooling MPC100 estimates, TREC=8 K=200 Note that in contrast to the RMS error, Kendall's and do not measure how much the estimated values differ from the actual values. Therefore, even if they indicate p erfectly correlated estimated and actual values, the estimates may still not b e accurate. Hence, it is much harder to achieve small RMS errors than to achieve high or values. Because of this, we mainly focus on the RMS error values when evaluating the p erformance of the sampling method. Since the p erformance of the sampling method varies dep ending on the actual sample, we sampled 10 times and picked a representative sample that exhibited typical p erformance based on the three evaluation statistics used. We rep ort the results of the exp eriments for MAP, MRP, and MPC(100) on TREC8 in Figure 4, Figure 5, and Figure 6, resp ectively. As can b e seen, on TREC8 , for b oth depth=1 (on avg 29 judgments/query) and depth=10 (on avg 200 judgments/query), there is a significant improvement in all three statistics when sampling is used versus the TREC-style p ooling for all the measures. The sampling estimates have reduced variance and little or no bias compared to depth p ooling estimates. This can b e seen from the great reduction in the RMS error when the estimates are obtained via sampling. Furthermore, the b ottom-right plots of all three figures show that with as few as 200 relevance judgments on average p er query, the sampling method can very accurately estimate the actual measure values which were obtained using 1,737 relevance judgments on average p er query. Figure 7 illustrates how MAP estimates using TRECstyle depth p ooling compare in terms of and Kendall's with those obtained using sampling as the depth of the p ool changes. For depths 1 to 10, we first calculated the numb er of documents required to b e judged using TRECstyle depth p ooling. Then, for each depth, we formed 10 different samples of the same size as the required judgment estimated MPC100 Figure 6: Sampling vs. depth pooling mean prec at cutoff 100 estimates at depths 1 and 10 in TREC8. set for each corresp onding depth and calculated the average (left column) and (right column). As can b e seen in the figure, for all TRECs the sampling method significantly outp erforms the TREC-style depth p ooling method at all depths. For comparison purp oses, we also include the average Kendall's value of bpref obtained using random samples [5] of the given size to the plots in the second column. The Kendall values for bpref are the average values computed over 10 different random sampling distributions. 3.1 Per query and per run results While the MAP (and MRP and MPC) show improved p erformance over depth-style p ooling in b oth mean and bias, there are certain situations when one needs the results of a single query, hence not taking advantage of the massive 546 estimated MPC100 Corr. coeff. for AP estimates TREC7 1 0.99 0.98 0.97 corr. coeff. 0.96 0.95 0.94 0.93 0.92 0.91 0 2 4 6 8 percentage of pool judged Corr. coeff. for AP estimates TREC8 1 0.99 0.9 0.98 corr. coeff. 0.97 0.96 0.95 0.75 0.94 0.93 0 0.7 0 depth pooling sampling (10 runs avg) Kendall's tau 0.85 0.95 10 12 0.7 0 0.75 depth pooling sampling (10 runs avg) Kendall's tau 0.85 0.9 0.95 Kendall's tau for AP estimates TREC7 Sampling AP,TREC 8 , all Q, run=Sab8A1 K=29 1 0.9 0.8 RMS=0.162226 =0.736732 0.7 =0.590204 estimated AP 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 trec AP 0.7 0.8 0.9 1 estimated AP 1 0.9 Sampling AP,TREC 8 , all Q, run=Sab8A1 K=200 0.8 RMS=0.089635 =0.930377 0.7 =0.840000 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 trec AP 0.7 0.8 0.9 1 0.8 depth pooling sampling (10 runs avg) b-pref with random judgments 5 10 percentage of pool judged Kendall's tau for AP estimates TREC8 15 0.8 depth pooling sampling (10 runs avg) b-pref with random judgments 5 10 percentage of pool judged Kendall's tau for AP estimates TREC10 15 2 4 6 8 percentage of pool judged Corr. coeff. for AP estimates TREC10 10 12 Figure 9: Sampling estimates for a fixed typical run (Sab8A1) with MAP = 0.25, all queries. . Each dot (·) is an AP for a query estimate (total 50); MAP estimate is plotted as "×". depth 1 to depth 10 equivalent for training and testing systems is shown in Figure 10. On x-axis the units are the depth-p ool equivalent numb er of judgments converted into p ercentages of depth-100 p ool. 0.99 0.98 0.97 corr. coeff. depth pooling sampling (10 runs avg) Kendall's tau 0.96 0.95 0.94 0.93 0.92 0.91 0 2 4 6 8 10 percentage of pool judged 12 14 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0 depth pooling sampling (10 runs avg) b-pref with random judgments 5 10 percentage of pool judged 15 4. CONCLUSIONS AND FUTURE WORK Figure 7: Linear correlation coefficient and Kendall's error comparisons for mean average precision, in TRECs 7, 8 and 10. Sampling AP, TREC 8, Query 46, all runs, K=29 1 0.9 0.8 0.7 estimated AP 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 trec AP 0.7 0.8 0.9 1 trainingRMS=0.096698 testingRMS=0.071029 =0.860963 =0.744727 estimated AP 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 trec AP 0.7 0.8 0.9 1 trainingRMS=0.028061 testingRMS=0.031735 =0.982351 =0.912914 Sampling AP, TREC 8, Query 46, all runs, K=200 Figure 8: Sampling estimates for a query with mixed system performance. Dots (·) represent training runs; pluses (+) represent testing runs. . variance reduction achieved by averaging over 50 queries. It is certainly not exp ected to see the same kind of p erformance on p er query basis; however our results show definite usable query estimates (Figure 8). The method describ ed in this pap er is self-contained for a query, i.e. estimates for a query are not dep endant on any data from other queries. On a different setup, one may want to analyze only one run over all queries (Figure 9) Note that this setup is not self contained as the sampling method requires a set of runs (not only the one plotted) and associated judgments. We prop ose a statistical technique for efficiently and effectively estimating standard measures of retrieval p erformance from random samples, and we demonstrate that highly accurate estimates of standard retrieval measures can b e obtained from judged subsamples as small as 4% of the standard TREC-style depth 100 p ool. This work leaves op en a numb er of question for further research: (1) In standard TREC settings, all documents in the depth 100 p ool are judged and no documents outside the p ool are judged. Our work indicates that more judging effort should b e placed on documents near the top of ranked lists (they have high sampling probabilities) and less judging effort should b e placed on documents near the b ottom of ranked lists (they have low sampling probabilities). What is the optimal sampling distribution, and how does it change as a function of the collection or systems to b e evaluated? Consider evaluating retrieval system on the web or with resp ect to the TREC Terabyte track, for example. (2) Given that our technique is based on random sampling, one could in principle derive high probability confidence intervals for the estimates obtained, and such confidence intervals would b e quite useful in practice. (3) Finally, we have preliminary results which show that given accurate estimates of retrieval measures obtained from a judged subsample, one can accurately infer relevance assessments for the remaining unjudged documents. Such a technique would have obvious b enefits in the production of large test collections. 5. REFERENCES [1] E. C. Anderson. Monte carlo methods and imp ortance sampling. Lecture Notes for Statistical Genetics, Octob er 1999. [2] J. A. Aslam, V. Pavlu, and R. Savell. A unified model for metasearch, p ooling, and system evaluation. In O. Frieder, J. Hammer, S. Quershi, and L. Seligman, editors, Proceedings of the Twelfth International Conference on Information and Know ledge Management, pages 484­491. ACM Press, Novemb er 2003. 3.2 Generalization on new runs It is imp ortant that the p erformance of sampling over the testing runs is virtually as good as the p erformance over the training runs. Note that the testing systems do not directly contribute to the sampling p ool; their documents are sampled only b ecause they happ en to app ear in training runs. The trend of RMS error, as sample size increases from 547 AP RMS error for AP estimates, TREC7 0.11 0.07 1 2 2 3 4 5 6 RMS error 7 8 depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err 9 10 0.05 0.06 1 3 RP RMS error for RP estimates, TREC7 0.16 4 5 6 7 8 9 RMS error 10 0.04 depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err 0.12 0.14 1 PC(100) RMS error for PC100 estimates, TREC7 depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err 2 0.1 0.08 0.06 0.04 3 4 5 6 7 8 9 10 TREC7 0.1 0.09 0.08 RMS error 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0.03 0.02 0.02 2 4 6 8 percentage of pool judged 10 12 0 0 2 4 6 8 percentage of pool judged 10 12 2 4 6 8 percentage of pool judged 10 12 0.01 0 RMS error for AP estimates, TREC8 0.07 0.045 0.04 0.035 RMS error 1 RMS error for RP estimates, TREC8 RMS error for PC100 estimates, TREC8 0.16 1 8 TREC8 0.06 0.05 RMS error 0.04 0.03 0.02 0.01 0 0 3 2 4 5 6 7 0.14 9 10 0.12 RMS error depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err 2 depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err 3 4 5 6 7 8 9 10 depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err 0.03 0.025 0.02 0.015 0.01 0.1 0.08 0.06 0.04 0.02 2 4 6 8 percentage of pool judged 10 12 0.005 0 2 4 6 8 percentage of pool judged 10 12 0 0 2 4 6 8 percentage of pool judged 10 12 RMS error for AP estimates, TREC10 TREC10 0.1 0.09 0.08 0.07 RMS error 0.06 0.05 0.04 0.03 0.02 0.01 0 2 4 6 8 10 percentage of pool judged 12 14 2 3 RMS error RMS error for RP estimates, TREC10 0.055 depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err 3 4 RMS error for PC100 estimates, TREC10 0.12 1 2 0.08 RMS error 3 4 0.06 5 6 7 0.04 8 9 10 depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err 1 depth pooling train err sampling (10 runs avg) train err sampling (10 runs avg) test err 0.05 0.045 0.04 1 2 0.1 4 5 6 7 8 9 10 5 0.035 0.03 0.025 0.02 6 7 8 9 10 0.02 0.015 0.01 0 2 4 6 8 10 percentage of pool judged 12 14 0 0 2 4 6 8 10 percentage of pool judged 12 14 Figure 10: RMS error train/test crossvalidation comparisons for MAP, RP, PC(100),in TRECs 7, 8 and 10. Equivalent depths are indicated on the plot. [3] J. A. Aslam, V. Pavlu, and E. Yilmaz. Measure-based metasearch. In G. Marchionini, A. Moffat, J. Tait, R. Baeza-Yates, and N. Ziviani, editors, Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 571­572. ACM Press, August 2005. [4] J. A. Aslam, V. Pavlu, and E. Yilmaz. A sampling technique for efficiently estimating measures of query retrieval p erformance using incomplete judgments. In Proceedings of the 22nd ICML Workshop on Learning with Partial ly Classified Training Data, August 2005. Copyright held by authors. [5] C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 25­32, New York, NY, USA, 2004. ACM Press. [6] G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient construction of large test collections. In Croft et al. [7], pages 282­289. [7] W. B. Croft, A. Moffat, C. J. van Rijsb ergen, R. Wilkinson, and J. Zob el, editors. Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melb ourne, Australia, Aug. 1998. ACM Press, New York. D. Harman. Overview of the third text REtreival conference (TREC-3). In D. Harman, editor, Overview of the Third Text REtrieval Conference (TREC-3), pages 1­19, Gaithersburg, MD, USA, Apr. 1995. U.S. Government Printing Office, Washington D.C. P. Kantor, M.-H. Kim, U. Ibraev, and K. Atasoy. Estimating the numb er of relevant documents in enormous collections. In D. Cfd, editor, Proceedings of tthe 62nd Annual Meeting of the American Sociaty for Information Science, volume 36, pages 507 ­ 514, 1999. J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth and Brooks/Cole, 1988. E. M. Voorhees and D. Harman. Overview of the seventh text retrieval conference (TREC-7). In Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 1­24, 1999. J. Zob el. How reliable are the results of large-scale retrieval exp eriments? In Croft et al. [7], pages 307­314. [8] [9] [10] [11] [12] 548