SIGIR 2007 Proceedings Poster A Comparison of Pooled and Sampled Relevance Judgments Ian Soboroff National Institute of Standards and Technology Gaithersburg, Maryland, USA Test collections are most useful when they are reusable, that is, w hen they can b e reliably used to rank systems that did not contribute to the pools. Pooled relevance judgments for very large collections may not be reusable for two reasons: they will be very sparse and not sufficiently complete, and they may be biased in the sense that they will unfairly rank some class of systems. The TREC 2006 terabyte track judged both a pool and a deep random sample in order to measure the effects of sparseness and bias. Categories and Sub ject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software--Performance evaluation General Terms: Exp erimentation, Measurement. Keywords: test collections, p o oling, sampling. 1. RELEVANCE-BASED SAMPLING The relevance assessments were divided into two phases. In the first phase, a subset of the runs submitted to the terabyte track were pooled to rank 50. This pool has an average of 640 documents per topic, 118 of which are relevant. Judgments from this pool are adequate to compute MAP to a re a s o n a b l e p r e c i s i o n f o r t h e p a r t i c i p a t i n g r u n s . In the second phase, a random sample was drawn from the pool runs starting from rank 1 and reaching to a topicspecific depth. The sampling strategy, relevance-based sampling, has two parameters: the maximum sample depth, and the sampling rate. These are estimated using the judgments from the depth-50 pool with the goal of finding 20 new relevant documents out of 200 additional judgments. This goal is called the "relevant-percent-target", or rpt = 20/200. The depth for a single topic is computed as follows. We first compute the probability of relevance in the pool at depth 50 as P (rel ) = |R|/|J |, where |R| is the number of relevant documents and |J | is the number of documents judged. As we go deeper into the pool, this probability of relevance drops off exponentially, but for simplicity and because we are focused on relatively early ranks, we approximate this trend with a linear fit with a fixed slope representing P (rel) per 50 ranks, and compute a rank x where the relevant fraction y = rpt : x = For topics with very few relevant documents in the depth50 pool, this size can by small or even negative, so we additionally require that the estimated pool size have 200 more documents to judge. We then estimate the depth for a pool of that size to be poolsize · /|J | where is 50. Once this depth is determined, we pool the runs to that depth, and compute the sampling rate as 200 divided by the number of unjudged documents in that pool. The maximum depth of the sample for the 2006 terabyte topics varies from 57 to 1252, with an average of 314. Topics with very few relevant documents in the depth-50 pool have a shallow sample depth and a sampling rate close to 100% because the best chance to find relevant documents is in the next few ranks. Topics with many relevant in the pool have a deep sample depth and a lower sampling rate. The sample has an average of 492 documents per topic, of which 36 are relevant. This includes 211 documents on average per topic not in the depth-50 pool, of which 14 are relevant. This shows that our estimation procedure was in fact quite accurate given the simplistic fit. We found at least one new relevant document for 46 out of 50 topics, and 20 or more for 10 topics. 2. RANKING DIFFERENCES The systems were scored using mean average precision (MAP) with the pooled judgments, and inferred average precision (infAP) [3] with the sampled judgments. InfAP is an estimate of average precision. When judgments are complete, infAP and MAP are equal. In the presence of unjudged documents, if they were in the original pool but were not sampled for judging, infAP estimates the precision at those ranks using the precision at earlier ranks. Otherwise, they are treated as nonrelevant as in MAP. Figure 1 plots each run's MAP score based on the depth50 p o ol against its infAP score based on the sampled judgments. For all but four runs, the infAP score is lower than the MAP score. The Kendall's between the two rankings is 0.8, implying that the rankings have notable differences d e s p i t e b ei n g hi g h l y c o r r e l a t e d . S i n c e t h o s e t o p i c s w i t h f e w relevant documents in the depth-50 pool are represented by a nearly 100% sample, any difference must come from those topics where we sampled deeply. The most likely reason is that very highly-ranked documents were missed. If a relevant document is retrieved at rank 1, this has a very large effect on MAP, but if that document is not sampled, infAP will necessarily be less than MAP. This problem is particularly acute in topics with a very low sampling rate. (rpt - slope - P (rel )) · |J |/(-slope ) Following this fit, we estimate the size of a pool that contains 20 additional relevant documents per 200 judged as poolsize = |J | + (x - |J | ) · 2 Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 785 SIGIR 2007 Proceedings Poster 0.3 F infAP 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 To measure any effect that random sampling itself might have on infAP scores, we drew 100 random subsamples of the depth-50 pool, and used these qrels subsets to score the runs using infAP. The samples were drawn at the same rates as were used above, but we sampled within the pool only so that every document would be judged. An analysis of variance of the infAP scores as a function of topic and sample showed that the sample was not a significant effect (at = 0.05) but that topic was significant for 45 out of 50 topics. Incorporating the runs into the model is complicated simply because runs normally vary in effectiveness in a topic-dependent fashion. We computed a second ANOVA of infAP score by topic and sample within each run. For two runs, sample was moderately significant (p = 0.016 and 0.014); these were two runs from the same group, and their maximum average infAP scores in any sample were 0.1002 and 0.0866. From this, we conclude that the variance across samples such as those we are drawing should not be a worry. MAP igure 1: MAP vs. infAP scores. 5. FUTURE WORK We have a number of unanswered questions. Are the sampled judgments are "more fair" to future runs than the pooled judgments? This is actually quite a difficult question to answer definitively. In smaller TREC collections, we have a sufficiently complete set of judgments that can be thought of as "truth", but no such set exists for the terabyte collections. Furthermore, it's hard to know for sure if the sample is less biased than the pool because of the large topic effect in titlestat. Whether we can sample deeply to overcome bias depends on whether the bias exists for the topic. Is there a simple, optimal sampling strategy that balances meaningful measures of effectiveness with reusability and low bias? Uniform random sampling is not usable because it will not select enough relevant documents at any reasonable sampling rate. If the sample rate is to o sparse, then it is likely we will miss judging documents from the first two or three ranks, which are critical to MAP. Our relevancebased sampling strategy suffers from this. O n the other hand, sampling strategies such as [1] focus too strongly on these early ranks, and as such fall prey to title-word bias. One option would be to always judge the first one or two ranks, then sample. Our depth-50 p o ol, while go o d for estimating MAP, is almost certainly more than we needed to pick a sampling depth. Additionally, at the "deep end" of the sample, we can't be certain that we've located enough low-titlestat documents to make the collection sufficiently more fair. Currently, we are investigating whether stratified sampling strategies can solve these problems. 3. REDUCED BIAS Another goal of sampling deeply was to try and get different relevant documents than those higher in the rankings. Buckley et al. observed that in very large collections, pools can be dominated by documents containing the title words of the topics; this bias in the judged documents could be unfair towards retrieval approaches that do not focus exclusively on the topic title. They proposed a measure, titlestat rel, which is the occurrence of the average title word in the judged relevant documents [2]. The titlestat rel of the depth-50 pool and the sample are 0.93 and 0.899 respectively. When we consider just the sampled documents below rank 50, the titlestat rel is 0.851. A topic-by-topic analysis reveals that the sampling scheme indeed found lower titlestat documents when we sampled deeply but not in every topic, and sometimes they were found without needing to search so deeply. This illustrates that title-word bias has a strong topic effect. For some topics, the title words are really the best indicators of document relevance. For others, there are other useful words not in the topic title. 4. REUSABILITY We next looked to see if the sampled judgments are any more or less reusable than the depth-50 pooled judgments. When a group does not contribute to the pool, documents retrieved only by their system are never judged, resulting in a less accurate score. We removed each group's runs from both the pool and the sample, and measured the score difference and movement in the ranking for those runs. The following table shows that a held-out systems' MAP and infAP scores change by a similar amount, but the infAP ranking changes less, indicating that the sampled judgments are more reusable. maximum maximum Qrels Measure abs. diff. rank change Depth-50 MAP 0.03 -15/ + 2 bpref 0.13 -0/ + 29 Sample infAP 0.02 -8/ + 2 6. REFERENCES [1] J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proceedings of SIGIR 2006, pages 541­548, Seattle, Washington, July 2006. [2] C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling. In Proceedings of SIGIR 2006, pages 619­620, Seattle, Washington, July 2006. [3] E. Yilmaz and J. A. Aslam. Estimating average precision with incomplete and imperfect judgments. In Proceedings of CIKM 2006, pages 102­111, Arlington, Virginia, November 2006. 786