SIGIR 2007 Proceedings Poster Validity and Power of t-Test for Comparing MAP and GMAP Gordon V. Cormack and Thomas R. Lynam David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada gvcormac@uwaterloo.ca, trlynam@uwaterloo.ca ABSTRACT We examine the validity and p ower of the t-test, Wilcoxon test, and sign test in determining whether or not the difference in p erformance b etween two IR systems is significant. Empirical tests conducted on subsets of the TREC 2004 Robust Retrieval collection indicate that the p-values computed by these tests for the difference in mean average precision (MAP) b etween two systems are very accurate for a wide range of sample sizes and significance estimates. Similarly, these tests have good p ower, with the t-test proving sup erior overall. The t-test is also valid for comparing geometric mean average precision (GMAP), exhibiting slightly sup erior accuracy and slightly inferior p ower than for MAP comparison. 50k number of pairs expected discordance actual discordance 5k # of pairs 500 50 5 total p<.01 .01 MB ?1 " Any particular exp eriment yields an estimate MA > MB of the truth value MA > MB . Classical statistical tests quantify the significance of this estimate as a p-value. p is the likelihood that a similar exp eriment might by chance produce the same estimate MA > MB even though the opp osite (MA MB ) were true. The validity of a statistical test may b e characterized by the accuracy with which it estimates this likelihood. However, it is imp ossible to measure this likelihood directly by exp eriment, as we can never compute MA or MB in order to know the truth value of MA > MB . It is furthermore difficult to construct a large numb er of similar exp eriments, as would b e necessary for direct empirical validation. We therefore validate the statistical tests by comparing the results of pairs of similar but indep endent exp eriments, 1 The true MAP value MX is defined to b e the mean average precision for system X over the universe of all topics. The most commonly rep orted measure for TREC exp eriments is Mean Average Precision (MAP), the mean of Average Precision (AP) scores achieved by a particular system for a set of different information needs (topics). The aptness with which MAP characterizes the intended purp ose of IR systems is debatable; however, in our exp eriments we shall assume that MAP (or a similar measure, GMAP [1]) aptly reflects differences in system effectiveness, and consider only the validity and p ower of statistical tests for the significance of the difference in MAP (or GMAP) b etween two systems. Although they are based on questionable assumptions [3], the commonest tests used in comparing MAP are the paired t-test, Wilcoxon signed-rank test, and the sign test. The choice of test is typically based on an uncalibrated tradeoff b etween validity and p ower; the assumption b eing that the t-test is the least valid but the most p owerful, the sign test Copyright is held by the author/owner(s). SIGIR`07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 753 SIGIR 2007 Proceedings Poster pairs p < .01 .01 p < .02 .02 p < .05 .05 p < .10 .10 p < .20 .20 p < .50 RMS error 32275 (68.6%) 1688 (3.6%) 2646 (5.6%) 2363 (5.0%) 2835 (6.0%) 5273 (11.2%) predicted discordance 166 101 252 357 641 2029 actual discordance 193 98 223 338 617 1971 error +14.0% -3.1% -13.0% -5.6% -4.2% -2.9% 8.6% p ower 0.686 0.721 0.778 ~ ~ ~ Table 1: t-test Discordance p ower = .01 0.654 0.686 0.713 0.625 p ower = .05 0.757 0.778 0.801 0.735 the chosen value; for = 0.01, p ower is 0.68, for = .05, p ower is 0.778. Table 2 shows RMS error and p ower for four statistical tests: t-test (rep eated from table 1), Wilcoxon test and signed test applied to difference in MAP, as well as t-test applied to difference in GMAP. We note that the sign test has higher error and lower p ower than the t-test while the Wilcoxon test has higher error and marginally higher p ower. t-test applied to GMAP shows a comparable error rate to t-test applied to MAP, and somewhat lower p ower. Topics 25 50 75 124 249 rms error 14.7% 13.0% 7.0% 8.6% p ower = .01 0.394 0.533 0.606 0.686 0.775 p ower = .05 0.555 0.664 0.716 0.778 0.844 rms error sign-test t-test wilcoxon-test t-test (GMAP) 16.4% 8.6% 13.8% 8.3% Table 2: Validity vs Power derived by splitting the set of topics used in a larger exp eriment ­ the TREC 2004 robust retrieval track. We use each statistical test to predict d, the probability of a discordant result b etween the split samples. Note that d > p, as d accounts for the sampling error from b oth splits; p for only one. Over many predictions, the exp ected numb er of discordant results is simply the sum of the d values, and, if the test is valid the observed numb er should b e close to this value, invariant when stratified by factors such as the value of p or the magnitude of MA - MB (contrary to the apparent results of some previous tests [2]). It is common practice to deem significant an exp erimental result with p < for some fixed threshold (typically = 0.05). The p ower of an exp erimental design is the probability that it will compute a true result with p < . For a valid test, p ower may b e estimated empirically by simulating several exp eriments and measuring the prop ortion that yield a correct significant result. The validity of p should b e indep endent of sample size, magnitude of the difference b etween the results b eing compared, and so on. Power, on the other hand, dep ends directly on b oth. A larger sample will in general result in lower p-values, and hence increase p ower. Exp erimental design must optimize the tradeoff b etween p ower and the cost of conducting larger exp eriments. Table 3: t-test Validity vs Power A second set of exp eriments used unequal splits to measure the sensitivity of the t-test to the numb er of topics sampled. For this exp eriment, we assumed that the t-test for the larger sample was accurate (as evidenced by the first exp eriment) and combined it with the t-test for the smaller sample to estimate d. Any increased error, therefore, could b e attributed to the smaller sample size. Table 3 shows error and p ower as a function of sample size. As exp ected, error rates were somewhat higher for smaller sample sizes but overall predicted discordance agrees very well with actual discordance. Power increases with sample size, as exp ected. 4. REFERENCES [1] Robertson, S. On GMAP and other transformations. In CIKM '06: Proceedings of the 15th ACM international conference on Information and know ledge management (New York, NY, USA, 2006), ACM Press, pp. 78­83. [2] Sanderson, M., and Zobel, J. Information retrieval evaluation: Effort, sensitivity, and reliability. In SIGIR Conference 2005 (Salvador, Brazil, 2005). [3] Van Rijsbergen, C. J. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979. 3. EXPERIMENTS The TREC 2004 Robust Track evaluated 110 systems on 249 topics. For each pair of systems we constructed several random equal splits with 124 topics p er split, and applied three statistical tests ­ paired t-test, Wilcoxon signed-rank test, and sign test ­ to one of the splits. Using the test we computed b oth p and d. We summed the values of d, stratified by p, and also counted the numb er of discordant results b etween the two splits. The results for the t-test are presented in figure 1 and table 1. Of the 47080 t-tests, 32275 (68.6%) yielded p < .01. Of these tests, predicted and actual discordances totalled 166 and 193, a difference of 14%. Other strata of p contained fewer tests and resulted in smaller errors. The RMS error over all strata was 8.5%. We take this low value to validate the t-test. Power dep ends on 754