Statistical Precision of Information Retrieval Evaluation Gordon V. Cormack and Thomas R. Lynam David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada gvcormac@uwaterloo.ca, trlynam@uwaterloo.ca ABSTRACT We introduce and validate b ootstrap techniques to compute confidence intervals that quantify the effect of test-collection variability on average precision (AP) and mean average precision (MAP) IR effectiveness measures. We consider the test collection in IR evaluation to b e a representative of a p opulation of materially similar collections, whose documents are drawn from an infinite p ool with similar characteristics. Our model accurately predicts the degree of concordance b etween system results on randomly selected halves of the TREC-6 ad hoc corpus. We advance a framework for statistical evaluation that uses the same general framework to model other sources of chance variation as a source of input for meta-analysis techniques. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Systems and Software ­ p erformance evaluation General Terms Exp erimentation, Measurement Keywords b ootstrap, confidence interval, precision validity, the generalizability of the results to other situations ([8], pp. 115-134). Our primary concern is the statistical precision of information retrieval exp eriments. If an exp eriment were to b e rep eated with different but materially similar data, how similar would the results b e? Is it p ossible, when the test is conducted, to predict accurately this degree of similarity? While these questions are largely amenable to statistical inference, they may b e understood only in the context of a general investigative framework that includes questions of validity which are not necessarily statistical. Instead they are addressed by the tools of scientific inquiry ­ observation, induction, deduction and exp eriment. We argue here that one source of random error ­ that associated with the test corpus ­ should b e considered in assessing IR evaluation methods. We develop and evaluate b ootstrap methods that estimate this source of statistical imprecision. We further argue that another source of random error ­ that associated with the topics ­ is p oorly modelled by regarding the set of test topics as a random sample of some "true" p opulation; exp eriments assuming such a model are uncomp elling to establish either statistical precision or validity. Instead, we prop ose, each topic should b e regarded as a separate test, and the results of these tests should b e combined using meta-analysis techniques ([8], pp. 643-673). 1. INTRODUCTION 2. TREC AD HOC RETRIEVAL The ob ject of our study is the TREC ad hoc retrieval evaluation technique[19]. Given a topic T and a set of documents D, each tested IR system returns an ordered subset S = s1 s2 ...sn of D, ranked by the system's estimate of the likelihood that each document is relevant to T . Several effectiveness measures are computed, including average precision (AP ), precision at k returned documents (P @k), and R-precision (P @R) defined as The purp ose of IR evaluation is to measure the effectiveness, or relative effectiveness, of information retrieval systems. Statistical precision 1 is the degree to which the measurement is free from random error; validity is the degree to which the measurement truly reflects retrieval effectiveness. Validity may b e further qualified as internal validity, the aptness of the measure under test conditions, or external 1 Known simply as precision in the statistics literature; denoted statistical precision here to distinguish it from the IR effectiveness measure of the same name. AP = Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. |S | X r el(sk ) P @k/R k X i=1 k =1 P @k = r el(si )/k R= X di D r e l (d ) 533 rel(d) = 1 if document d is relevant to T 0 otherwise ff . The test is rep eated for several topics; the effectiveness measures from these tests are rep orted separately and also averaged across topics. In particular, mean average precision (M AP ) is typically used and accepted as a valid effectiveness measure. The evaluation measures dep end on the truth value of if document d is relevant to T, which must b e adjudicated. TREC uses the p ooling method in which the top-ranked t documents from a set of systems (p erhaps the systems under test) are combined, eliminating duplicates, and presented in random order to a human assessor. The assessor records a judgement of relevant or not relevant for each document in the p ool. Documents not in the p ool are assumed to b e not relevant. TREC evaluations have typically tested ab out 50 systems using 50 topics, n = 1000, |D| = 500 000, t = 100, and a p ool size of 40 000 (for all 50 topics).[19] 3. PHILOSOPHICAL FRAMEWORK The notion of population has b een the sub ject of historical and current philosophical debate [7]. We adopt Fisher's view [4] of an infinite hyp othetical p opulation: If, in a Mendelian exp eriment, we say that the probability is one half that a mouse b orn of a certain mating shall b e white, we must conceive of our mouse as one of an infinite p opulation of mice which might have b een produced by that mating. The p opulation must b e infinite for in sampling from a finite p opulation the fact of one mouse b eing white would affect the probability of others b eing white, and this is not the hyp othesis which we wish to consider; moreover, the probability may not always b e a rational numb er. Being infinite the p opulation is clearly hyp othetical, for not only must the actual numb er produced by any parents b e finite, but we might wish to consider the p ossibility that the probability should dep end on the age of the parents, or their nutritional conditions. We can, however, imagine an unlimited numb er of mice produced up on the conditions of our exp eriment, that is, by similar parents, in the same age, in the same environment. The prop ortion of white mice in this imaginary p opulation app ears to b e the actual meaning to b e assigned to our statement of probability. Briefly, the hyp othetical p opulation is the conceptual resultant of the conditions which we are studying. The probability, like other statistical parameters, is a numerical characteristic of that p opulation. A model is a characterization with a few parameters that abstracts the hyp othetical p opulation, removing irrelevant information while preserving that which reflects measures of interest; in this instance, measures of retrieval effectiveness. As such a model is a scientific theory that is p osed to explain known facts; its worth is judged by its simplicity, its ability to explain existing observations and its ability to predict new ones. A theory is never "proved" as it is simply a model, but our confidence in it builds as these criteria (degree of abstraction, explanatory ability, predictive ability) are demonstrated. With resp ect to IR evaluation, it is p ossible to identify several hyp othetical target p opulations: the topics that might b e presented to a system, the corp ora from which the system may b e exp ected to retrieve documents relevant to the topic, the set of relevance assessments for the topics; even the set of systems that might b e sub ject to test may b e considered to b e a hyp othetical p opulation of interest. However, these p opulations are exceedingly difficult to sp ecify, let alone model with a small numb er of parameters. And sampling them would b e a hop eless task as many memb ers of the p opulation exist only in the future. Instead, we select readily available data and observations, and treat them as representing the hyp othetical p opulation of all data like that which we collected ­ the source p opulation. The meaning of the word "like" must b e considered carefully in modelling such a p opulation; a narrow definition may improve statistical precision while a broad one may improve external validity. External validity ­ the applicability of the model to the target p opulation ­ is established, not by statistical inference, but by scientific inquiry in which (a) predictions ab out other data are made and tested by exp eriment, and (b) sources of p ossible systematic or random error are identified and tested by exp eriment. 4. STATISTICS IN IR EVALUATION Tague-Sutcliffe [12] argues that of validity, reliability and efficiency should b e considered in a qualitative assessment of various design issues. Validity is used to mean internal validity; reliability2 subsumes (statistical) precision and external validity; efficiency relates to the resources that are exp ended in achieving validity and reliability. Tague-Sutcliffe [13] p erformed a statistical analysis of the TREC-3 results, under the assumption that the set of topics was a random sample of "all p ossible queries that might b e asked of the database." Paired testing was rejected so as to avoid the fallacy of multiple hyp othesis testing (cherrypicking); analysis of variance (ANOVA) was used to compute significant differences among systems according to a numb er of p erformance measures. Very large differences in p erformance ­ spanning approximately three-quarters of the tested systems ­ were necessary to distinguish systems with 95% confidence (i.e. p < .05). Neither the choice of measure nor an arcsine transformation had substantial impact on the results. Savoy [11], under the assumption that topics are a random sample, examines the use of classical and b ootstrap methods [3] to test the relative p erformance of pairs of systems. The b ootstrap builds a concrete model for a hyp othetical p opulation in which each element of the sample is replicated an equal and infinite numb er of times; this p opulation may, in effect, b e sampled any numb er of times by drawing elements from the original sample, with replacement. Savoy p erforms significance tests to supp ort the prop osition that 2 In testing, reliability is the degree to which the same test, administered to the same sub ject, will yield a consistent score ([8], p. 507). Assuming one interprets "the same" literally, IR tests are 100% reliable. Figurative interpretations are captured by Fisher's hyp othetical p opulation. 534 the b ootstrap yields higher statistical precision than parametric approaches, and that median, as opp osed to mean, is a b etter summary statistic. Voorhees and Buckley [17] explore the effect of topic set size, also assuming the topics to b e a random sample. Rather than building a statistical model, they measure the prop ortion of discordant results b etween evaluations p erformed using disjoint sample subsets. Results are stratified by the difference in evaluation measure b etween each pair of systems. For each stratum an exp onential curve on two parameters is used to estimate the prop ortion of discordant pairs. Sanderson and Zob el [10] measure discordance prop ortion stratified by p-value of a significance test and at the same time by the magnitude of the difference b etween MAP scores, and observe that a large difference coupled with a small p-value predicts low discordance. Several studies [15, 2, 20, 9, 18, 10] have considered the effect of variations in relevance judgments and judging p ools on retrieval evaluation. Buckley and Voorhees [1] consider the effect of using differently formed queries to represent the same information need. Rep orts on IR evaluations often3 include standard tests such as paired t-tests, Wilcoxon signed-ranked tests, sign tests, or analysis of variance, notwithstanding questions as to their applicability [6]. Rep orts typically include significance judgements based on a fixed threshold; p-values are less common and confidence intervals are rarer still. The vast ma jority, if not all, assume that topic variation is the only source of random error. Statistical hyp othesis testing in general, particularly that based on a fixed threshold, has come under criticism lately ([8], pp. 183-199). H0 ­ the null hyp othesis that two p opulations are the same ­ is a strawman that is too easy to refute. In the real world, no two distinct things are the same[5], and a large enough sample will show this. Such a hyp othesis should b e replaced by an estimate of the magnitude of the difference and an argument as to whether or not that difference is imp ortant. Let D b e some other collection of the same size from the same p opulation as D, and AP b e the average precision from applying the same IR system to D . We wish to compute a 95% confidence interval ­ a range of p ossible values such that, with 95% probability, contains the exp ected value E (AP ). We are not aware of any direct parametric method of estimating this confidence interval; therefore we use the b ootstrap to sample the p opulation to which D and D b elong. By rep eated sampling, we may estimate the variance of AP and compute parametric confidence intervals assuming a normal (Gaussian) distribution. Or we may estimate the variance of a monotonic transform t(AP ) which is b etter distributed, in effect computing confidence limits for E (t(AP )) which may b e more accurate. Or we may compute confidence intervals nonparametrically by selecting the 2.5th through 97.5th p ercentile of the b ootstrap samples. A bias-corrected variant of p ercentile method is known as B Ca [3]. We used three methods for our pilot: variance of AP values; variance of log it(AP )4 ; B Ca . The b ootstrap constructs rep eated examples of D by rea sampling D. That is, the elements of each example of D re selected from D, with replacement. For this application we assume that |D | is large compared to n, the size of the ranked list of documents to b e retrieved (typically |D | > 100n). This assumption allows us to use the Poisson distribution to generate S , the list of documents retrieved from D , without considering the irrelevant elements of D . Sp ecifically, each document of S (the retrieved set of documents) is assumed to b e replicated k times in S with probability e·1 ! . k So to construct S we take each si S in rank order, generate a random k according to the Poisson distribution, and replicate the element k times. We assume that the replicated elements all receive comparable scores from the IR system and thus are consecutively ranked in S . Using this construction, |S | |S |. The difference in sizes is inconsequential to the AP calculation. It is also necessary to compute R for the b ootstrap sample. To do this we partition R: R = Rret + Rnot Rret = |{d S | r el(d)}| / Rnot = |{d S | r el(d)}| R et is determined directly from S p ost-hoc as: Rnot = |R-Rret | r ; 5. COLLECTION VARIABILITY To measure statistical imprecision due to collection variability, we use the hyp othetical p opulation of all collections that are materially similar to the test collection. We formulated a simple characterization of this p opulation and conducted a pilot exp eriment in which we constructed confidence intervals for AP using a b ootstrap estimation of model parameters, and predicted the numb er of AP values in a second test that should fall in this interval, according to the model. The results show good precision but for some outliers which led us to examine the sp ecial cases which they represent, and to adapt the model to account for them. Our initial model assumes that D, the set of documents in the test collection, is an indep endent and identically distributed (i.i.d.) sample of a p opulation of similar documents. By similar, we mean having the same relevance value, and yielding a comparable score (or at least a comparable ranking relative to other documents) when retrieved by the IR system under test. The hyp othetical p opulation to which D b elongs is the set of all such samples. Not often enough, according to Sanderson and Zob el [10], who surveyed published SIGIR pap ers and found that 14 of 28 claiming retrieval results rep orted no statistical tests. 3 Rnot is computed X i=1 ki where each ki is randomly generated according to the Poisson distribution. The net effect is that we may compute as many examples of AP as necessary to compute model parameters for our hyp othetical p opulation. For our untransformed parametric confidence interval estimate, we compute the standard deviation of the AP values. The 95% confidence interval is AP ± 1.96 . For the logit-transformed parametric estimate, we first replace AP values of 0 and 1 by 4 x log it(x) = log ( 1-x ) 535 and 1 - resp ectively, and compute the standard deviation logit of log it(AP ). The 95% confidence interval is log it-1 (log it(AP ) ± 1.96logit ). B Ca confidence intervals were computed directly using the System R implementation of Efron's S-Plus code ([3], pp. 402-403). 6. PILOT EXPERIMENT We used the raw results from the 74 IR system runs evaluated over 50 topics in the the TREC 6 ad hoc task[14] ­ 3652 non-empty ranked-result lists in total. The corpus documents were split into two subsets, A and B, of roughly equal size using an MD5 hash on the document identifier. Similarly, each result list was split into two ­ one representing the documents retrieved from A; the other from B . These two sets of result lists were assumed to represent the retrieval results on two indep endent corp ora drawn from a common source p opulation. We used 2000 b ootstrap samples of the A corpus and the three b ootstrap techniques to compute 95% confidence intervals. But we don't know E (AP ); therefore, we validate the model by using it to predict how many times APB (AP computed on the B corpus) should fall within the confidence interval created by b ootstrapping the A corpus. Note that this frequency is not 95%, as commonly assumed, but considerably lower. For the parametric models, recall that the interval is APA ± 1.96 . We wish to predict how likely APB is to fall in this interval; more precisely APA - 1.96 APB APA + 1.96 , or -1.96 APB - APA 1.96 . Our model predicts that APA and APB b oth have standard deviation , so AB , the standard deviation of APB - APA is given by AB = 2 . Substituting, we have -1.39AB APB - APA 1.39AB . This range b ounds 83.5% of the area under the normal curve for APB - APA and hence we exp ect the inequalities to b e satisfied, i.e. APB to fall within the confidence interval, 83.5% of the time. Furthermore, APB should fall ab out equally to the left and to the right of the interval. The same argument holds for the logit-transformed parametric model. There is no similar mechanism for making direct prediction from the non-parametric confidence intervals. However, we may still generally compare the results to those of the parametric methods. 6.1 Prediction Recall that the confidence interval is defined to b e an interval within which contains the true value E (AP ) with 95% probability. If we knew the value of E (AP ) we could simply count the prop ortion of times E (AP ) fell within the computed confidence interval, exp ecting this prop ortion to b e ab out 95% if the intervals were accurate. Similarly, if the intervals were unbiased, we would exp ect an equal prop ortion (ab out 2.5%) to fall ab ove as b elow the interval. All Runs (n=3652) b elow in ab ove 3.7% 77.8% 11.1% 11.1% 76.5% 12.5% 7.4% 77.1% 15.5% 8.25% 83.5% 8.25% Only R > 5 (n=3068) b elow in ab ove 2.8% 82.0% 15.1% 9.4% 81.4% 9.1% 6.7% 80.8% 12.5% 8.25% 83.5% 8.25% 6.2 Pilot Results Confidence intervals were computed using the A subset of the results for each topic within each run (n = 3652). APB was computed for the corresp onding B subset and compared to the confidence interval. Table 1 rep orts the prop ortion of APB values ab ove, within, and b elow the intervals. Figure 1 shows (for the B Ca method only) the distribution of APB - APA , normalized so that the confidence interval occupies the range -1 .. 1. Figure 1 makes it apparent that there are a large numb er of outliers at the high end of the distribution. Further investigation reveals that these extreme values are almost entirely accounted for by small-sample effects. Most of these arise when R 5 (R 1 in particular). Selecting only those topics for which R > 5 gives the prop ortions listed in second half of Table 1. The numb er of APB values within the predicted interval approaches, but does not quite reach, the predicted 83.5%. Furthermore, the linear and B Ca estimates show evidence of bias. Even when we restrict our attention to the situation in which R > 5, a handful of outliers remain. These consist mainly of cases with APA = 0 or APA = 1, where the Bootstrap erroneously rep orts = 0. In this situation, APB falls within the interval only if it is exactly equal to APA , which occurs in substantially less than 95% of the cases. Further investigation revealed that the high error rate among the cases with R 5 was also largely due to cases with APA = 0 or APA = 1. linear logit B Ca model Table 1: Proportion of APB within interval Frequency 0 100 200 300 400 -2 -1 0 1 2 3 4 5 6.3 Pilot Conclusions From the pilot we conclude that the model works well for the ma jority of the situations, but sp ecial attention needs to b e paid to situations in which R is small, or in which APA = 0 or APA = 1. APB - APA Figure 1: Distribution of Normalized APB - APA 536 200 Frequency 150 Frequency 4 4 100 0 0 4 50 100 50 200 -2 0 APB - APA 2 -2 0 APB - APA 2 4 No Transform logit Transform Figure 2: Normalized Small-R Corrected Distribution (c.l. = [-1, 1]) It is well known that many statistical methods are inapplicable to small samples. One way to deal with this problem is to design studies and exp eriments that yield sufficient numb ers. In studying the prevalence of disease, for example, an epidemiologist would ensure that the sample p opulation would b e exp ected to contain a sufficient numb er of p ositive examples to b e amenable to statistical inference. TREC's corpus design process pays careful attention to this consideration, rejecting topics exp ected to yield very low or very high values of R. The case of R = 0 is avoided in particular, b ecause AP is undefined in this situation. The TREC-6 corpus that we used had 5 of its fifty topics with R 5 (3 with R = 5; 1 with R = 4; 1 with R = 3; none with R 2. Our split-sample technique created A and B corp ora which effectively halved R, giving rise to substantially more (9 and 8 resp.) with R 5 and also to several (3 and 4 resp.) with R 2, smaller than any in the TREC-6 corpus. Therefore we would exp ect that the estimation techniques used in our pilot would b e considerably more accurate were they applied to the full corpus. Exp erimental design issues notwithstanding, the results of our pilot prompted us to investigate ways to augment our model to handle cases with small R. under the assumption that u is the true prop ortion. For example, if R = 4, the 95% confidence limit for the prop ortion of silver bullets is [0, 0.53]. That is, we cannot statistically reject the hyp othesis that 53% of all relevant documents are silver bullets! We take 53% (or whatever the appropriate fraction is for a given R) to represent the extreme case that results in our upp er confidence limit. But the confidence limit must b e expressed as an AP value, not a prop ortion of silver bullets. To compute AP , we assume that the retrieved rank of a silver bullet is uniformly distributed b etween 1 and n (i.e. could app ear anywhere in the retrieved list) and, using dynamic programming, compute by enumeration the resulting E (AP ). This value is chosen as the upp er confidence limit in place of 0. APB w.r.t. b elow in 2.1% 82.8% 8.5% 83.6% A ab ove 15.0% 7.9% APA w.r.t. b elow in 2.4% 83.9% 8.7% 82.7% B ab ove 13.6% 8.7% linear logit Table 2: Small-R corrected within interval A similar model was used for the cases of AP = 1. We use binomial probabilities to establish b ounds for the the prop ortion of lead bal loons ­ relevant documents that the system is unable to retrieve, but are not represented in the R relevant documents sampled for this particular test. A similar dynamic approach converts the worst-case lead-balloon prop ortion to a lower confidence limit. The same corrections were applied to cases with AP 0 and AP 1; for such values, we use the larger of the original confidence interval and the interval under the assumption that AP = 0 (AP = 1, resp.). Table 2 shows the result of applying small-R correction to the results of the two b ootstrap methods.5 The left 5 We did not use the computationally intensive B Ca method as it showed p oor results in the pilot and was not amenable 7. SMALL-R CORRECTION We modelled cases with small R (and some cases with larger R) by considering AP = 0 and AP = 1 as sp ecial cases. AP = 0 means that, among the R relevant documents in the collection, the retrieval system found none. But it may b e that in the source p opulation ­ the set of materially similar collections ­ there exist silver bul lets ­ relevant documents that the system could retrieve but did not happ en to b e among the R sampled for this particular test. Using exact binomial probabilities, we establish confidence limits for the prop ortion of such silver bullets that may exist in the source p opulation. The lower confidence limit is 0 and the upp er confidence limit is the smallest u such that the probability (1 - u)R of having no silver bullets in our sample does not exceed the significance threshold, 537 half shows the fraction of APB values that fall within the confidence intervals computed from A; the right half shows the the fraction of APA values that fall with intervals computed from B . In b oth cases, b oth methods yield in-interval fractions that are extremely close to those predicted by the model. However, the linear model exhibits considerable bias, as evidenced by the fact that the fraction ab ove is roughly six times larger than the fraction b elow in b oth tests. The logit model demonstrates no apparent bias. Figure 2 shows normalized APB - APA for the linear and logit models. Logit is clearly more symmetric with fewer outliers. All of the (few) logit outliers are cases with APA = 0, for which a nonparametric model was used. Such a model predicts only the numb er in-interval, not the distribution of those outside. The outliers should therefore not b e considered to contradict the model. 8. MEAN AVERAGE PRECISION Mean Average Precision (MAP) ­ the average of AP values over several topics ­ is commonly rep orted as an overall summary measure. We investigated methods to compute the sensitivity of MAP to corpus variation. We applied the same methods to a summary measure we call logistic MAP (L-MAP), which averages log it(AP ) instead of AP . L-MAP is closely related (and nearly identical for small values) to geometric MAP (G-MAP) ­ the average of log (AP ) values ­ which has b een prop osed recently to increase the contribution of low AP values to the overall measure. [16] (L-)M APB w.r.t. A b elow in ab ove 27.0% 68.9% 4.0% 1.3% 83.7% 14.8% (L-)M APA w.r.t. B b elow in ab ove 21.6% 75.6% 2.7% 16.2% 82.4% 1.3% MAP L-MAP should b e taken as indications to b e confirmed by a larger exp eriment. Table 4 shows the results of using a parametric approach to combine the 74 separate AP confidence intervals into a single MAP (L-MAP) confidence interval. To compute the MAP confidence interval, we averaged the variances derived from the logistic model, but weighted them according to their relative contribution to MAP statistic. We used a multiplicative weight of AP - AP 2 , the derivative of AP with resp ect to log it(AP ). To compute the L-MAP confidence interval, we simply averaged the unweighted variances of the 74 separate estimates. We observe that the parametric estimates for MAP are considerably b etter than the b ootstrap estimates, further validating the logit model. However, they app ear to b e slightly optimistic, yielding ab out 80% ininterval as opp osed to the predicted 83.5%. On the other hand, the parametric estimate for L-MAP is much worse than the b ootstrap estimate. We attribute this error to the fact that the estimate weights heavily the small-R-corrected estimates, which are themselves non-parametric and therefore not suitable variance estimates. This error is much less imp ortant for MAP b ecause most of these estimates receive extremely low weight. Figure 3 shows the MAP parametric confidence intervals computed from A, along with the M APA and M APB values, marked x and o resp ectively. This graphic shows that the confidence intervals generally do a good job of predicting the range of p ossible M APB values, and the out-of-interval values are near-misses rather than outliers. Figure 3 also shows the corresp onding L-MAP b ootstrap confidence intervals, and reflects the same general observation. Figure 4 shows the MAP and L-MAP values based on the full corpus (as opp osed to the A subset). As exp ected, the confidence intervals are smaller, typically with a width of ab out 0.05. Table 3: Bootstrap 50 topic mean within interval 9. DISCUSSION Exp erimental evidence suggests that our model for corpus variability aptly predicts confidence intervals for individual AP values. log it(AP ) has b etter algebraic prop erties than AP and therefore yields a b etter model from which AP confidence intervals can b e derived. AP values close to 0 and 1 are problematic, and arise often when R ­ the numb er of relevant documents ­ is small. These anomalies may b e addressed by using a non-parametric binomial model to predict silver bul lets and lead bal loons ­ relevant documents whose prop erties are not represented at all in the corpus. M AP exhibits the same algebraic anomalies as AP ; in this situation, a weighted average of log it(AP ) variances is used to predict indirectly the effect of averaging non-logittransformed AP values. L-MAP, on the other hand, may b e estimated directly using the b ootstrap. We exp ect that G-MAP would exhibit similar prop erties to L-MAP, as they differ substantially only for values close to 1 ­ values which occur rarely in IR evaluation. Our framework and validation technique applies equally to models for evaluating the relative effectiveness of a pair of IR systems. One simply has to model the difference d b etween the two systems according to some measure of interest; for example d = APx - APy . If we construct separate models for APx and APy with standard deviations p 2 2 x and y we may estimate d = x + y , and a 95% confidence interval of ±1.96d . Our results indicate that defining d = log it(APx ) - log it(APy ) would yield a b etter estimate. MAP L-MAP (L-)M APB w.r.t. A b elow in ab ove 1.3% 78.3% 20.2% 2.7% 74.3% 22.9% (L-)M APA w.r.t. B b elow in ab ove 16.2% 81.0% 2.7% 22.9% 74.3% 2.7% Table 4: Parametric 50 topic mean within interval The first row of table 3 shows the fraction of the 74 M APB (M APA resp.) fractions falling within the interval computed by b ootstrap sampling A (B resp.). That is, M AP was computed for each of the 2000 b ootstrap samples, and the variance of these values was used to estimate the standard error. The second row shows the same method applied to L-MAP. In the case of MAP, we see that the in-interval fraction falls considerably b elow that predicted by the model, suggesting that the model is inappropriate. The L-MAP in-interval fraction, on the other hand, suggests that in this case the model is appropriate. The imbalance b etween ab ove and b elow fractions suggests random skew b etween A and B rather than a systematic bias in the model. Note that these fractions are determined from 74 data p oints, as opp osed to 3652 for the AP computations. Therefore these observations to the further exp eriments detailed b elow. 538 - x 0 .5 - x 0.4 - - x- 0.3 - -x x- - --- -x x -- - xx --- - xx - -- xxx -- --- --- --- xx --- --- xxx xxxxx ------ - xx --- --- - - -----xxxxxxx ------ xx -- - --- ------xxxxxxx - ---- ---- xxx ------xxxx------ x -----xxxxx--- --- -xxx-- - - ----x-x--- xx -----x-- - xx x - xx- -----x x------ xxxxx 0 20 40 Runs 60 L-MAP 0.0 0.1 0.2 20 40 Runs 60 0.5 0 .4 -- x- x - - - - -- x - x xx -- x -- - - - xx - -- --- - x -- - - - xxxxxxxxxx ----- -- - --- xxx -- xxx - - - -------- xxx- --- -- -- -- - xxxxxx- --- - -- - -- xxxxx ---- xxxx -- - - ---- xxxxxxx -- - ------- xx - -- - - - -- - --------- xxxxxx xx- - ----- ---- x -x- x - x- x -- --- -xx --x - ---x xx x ---- - 0 MAP 0.0 0.1 0.2 0.3 MAP Parametric Intervals L-MAP Bootstrap Intervals Figure 3: M APB and L-M APB with respect to A confidence intervals 0.5 0.5 0.4 - x - - x - 0.4 - x- -x -- - x -- xx - x---- - -- x - - xxxx ----- -- - - -- - xxxxxx xxx --- - -- ---- ----xxx - --- - xxxx - --- - -- x --- - - - ---- xxxxx xxxx - -- ----- -- xxxxx ----- --- x --- - - -- xxxxxxxx--- xx --- - - ------ ---- xxx --- - xx ---- xx--- -- x -- -xx - x ---x -- --- xx --x -x- - x--x xx ----- 0 20 40 Runs 60 - - x 0.3 -x- -x- -x --- -- xx - x -- -- xxx -- x -- - - --- xx - -- -- - xx -- -- xxxx -- --- - xxxxx- ---- - --- -- -- xxxxxxx- -------- --- - xxxxx -- ----- xxxxx----- -- - - ---xxxxx------- - xxxx ---------xxx- - x ----xx--- xxx ---------- xxxx ------ x-- xx- xxxxxxx -------- 0 20 40 Runs 60 0.2 0.0 0.1 0.3 0.0 0.1 0.2 MAP Parametric Intervals L-MAP MAP L-MAP Bootstrap Intervals Figure 4: Full corpus Confidence Intervals Confidence intervals for the difference contain strictly more information than fixed-threshold hyp othesis tests; the difference b etween APx and APy is significant (two-tailed test, = 0.05) exactly when 0 does not fall within the confidence interval for d. The same approach may b e applied to the difference b etween MAP, L-MAP or G-MAP scores. A more p owerful estimate for d ­ which takes into account correlations b etween the two systems' results ­ may b e effected by b ootstrapping d. For each b ootstrap sample, directly compute APx , APy [log it(APx ), log it(APy )] and hence d ; then estimate d from the various d values. 539 Our analysis indicates that current methods6 p oorly model random error due to topic variability. By the argument in section 6.1, if a test concluded from one sample that APx > APy (p = 0.05) we would exp ect that for some other sample APx > APy would occur with probability 0.835, (not 0.95). Furthermore, this prediction should b e insensitive to the sample size and the magnitude of the difference b etween APx and APy . Exp erimental results [10] are inconsistent with these predictions, contradicting the validity of the tests. We conjecture that using the log it transform would mitigate but not overcome the shortcomings of the underlying model. If the log it transform were to yield a reasonable model for topic variability ­ a prop osition the exp erimental investigation of which we leave to future work ­ it would b e a simple matter to use the b ootstrap method develop ed here to model it, or to model b oth topic and collection variability at once. One must simply compute L-MAP (or the difference b etween L-MAPs) using b ootstrap resampling to select b oth the topic and the corpus for each sample. The underlying foundation is the same. A more promising approach, we argue, is to regard the results from each topic as separate tests and to combine them using meta-analysis ([8], pp. 643-673; [5]). For a simple paired hyp othesis test, one may simply combine the values of di and i arising from k separate tests to compute an overP all single-tailed p-value p = 1 - k=1 di where is the i ki cumulative normal distribution. More sophisticated metaanalysis involves identifying a quantitative "effect" and measuring it with confidence intervals. While it is difficult to argue that d = APx - APy is a meaningful quantity, the value d = log it(APx ) - log it(APy ) represents the logarithm of the ratio of the effectiveness of the two systems, which we advance as a worthwhile measure. Meta-analysis may also b e used to estimate the p erformance of a single system over several tests. The most obvious measure is simply AP but our results indicate that log it(AP ) would b e more appropriate. Even more appropriate would b e a measure that comp ensated for topic difficulty; we suggest d = log it(AP ) - log it(X ) where X is the p erformance of some baseline system or some other estimate of "normal" system p erformance. Fixed-effect model meta-analysis computes the effect d and standard error as follows: k X - d i i 2 k X i=1 d= i=1 k X i=1 = - i 2 - i 2 !- 1 2 The overall effect estimate is the average of the individual estimates, weighted by their statistical precision. Randomeffect model meta-analysis further comp ensates for heterogeneity of tests such as might occur when using diverse topics or corp ora. 10. REFERENCES [1] Buckley, C., and Voorhees, E. M. Evaluating evaluation measure stability. In SIGIR Conference 2000 (Athens, Greece, 2000). 6 In particular, a t-test which tacitly models d = APx - APy over the p opulation of all topics with a fixed corpus. [2] Cormack, G. V., Palmer, C. R., and Clarke, C. L. A. Efficient construction of large test collections. In SIGIR Conference 1998 (Melb ourne, Australia, 1998). [3] Efron, B., and Tsibirani, R. J. An Introduction to the Bootstrap. Chapman and Hall, New York, 1994. [4] Fisher, R. A. Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society 22 (1925), 700­725. [5] Glass, G. V. Meta-analysis at 25. http://glass.ed.asu.edu/gene/pap ers/meta25.html, 2000. [6] Hull, D. A. Using statistical testing in the evaluation of retrieval exp eriments. In Research and Development in Information Retrieval (1993), pp. 329­338. [7] Lenhard, J. Models and statistical inference: The controversy b etween Fisher and Neyman-Pearson. British Journal for the Philosophy of Science (2006). [8] Rothman, K. J., and Greenland, S. Modern Epidemiology. Lippincott Williams & Wilkins, 1998. [9] Sanderson, M., and Johno, H. Test collections with no system p ooling. In SIGIR Conference 2004 (Sheffield, UK, 2004). [10] Sanderson, M., and Zobel, J. Information retrieval evaluation: Effort, sensitivity, and reliability. In SIGIR Conference 2005 (Salvador, Brazil, 2005). [11] Savoy, J. Statistical inference in retrieval effectiveness evaluation. Information Processing and Management 33, 4 (1997), 495­512. [12] Tague-Sutcliffe, J. The pragmatics of information retrieval exp erimentation, revisited. Information Processing and Management 28, 4 (1992), 467­490. [13] Tague-Sutcliffe, J., and Blustein, J. A statistical analysis of the TREC-3 data. In Proceedings of TREC-3, The Third Information Retrieval Conference (1994), pp. 385­398. [14] Voorhees, E., and Harman, D. Overview of the Sixth Text REtrieval Conference (TREC-6). In 6th Text REtrieval Conference (Gaithersburg, MD, 1997). [15] Voorhees, E. M. Variations in relevance judgements and the measurement of retrieval effectiveness. In SIGIR Conference 1998 (Melb ourne, Australia, 1998). [16] Voorhees, E. M. Overview of the TREC-2004 robust track. In 13th Text REtrieval Conference (Gaithersburg, MD, 2004). [17] Voorhees, E. M., and Buckley, C. The effect of topic set size on retrieval exp eriment error. In SIGIR Conference 2002 (Tamp ere, Finland, 2002). [18] Voorhees, E. M., and Buckley, C. Retrieval evaluation with incomplete information. In SIGIR Conference 2004 (Sheffield, UK, 2004). [19] Voorhees, E. M., and Harman, D. K., Eds. TREC - Experiment and Evaluation in Information Retrieval. MIT Press, 2005. [20] Zobel, J. How reliable are the results of large-scale information retrieval exp eriments? In SIGIR Conference 1998 (Melb ourne, Australia, 1998). 540