SIGIR 2007 Proceedings Poster Modelling Epistemic Uncertainty in IR Evaluation Murat Yakici, Mark Baillie, Ian Ruthven Dept. of Computer and Information Sciences University of Strathclyde, UK {my,mb,ir}@cis.strath.ac.uk Fabio Crestani Faculty of Informatics University of Lugano, Switzerland fabio.crestani@unisi.ch Categories and Sub ject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software Performance Evaluation General Terms: Measurement,Reliability Keywords: Performance Evaluation, Metrics, Uncertainty 1. MOTIVATION Modern information retrieval (IR) test collections violate the completeness assumption of the Cranfield paradigm [2]. In order to maximise the available resources, only a sample of documents (i.e. the p ool) are judged for relevance by a human assessor(s). The subsequent evaluation protocol does not make any distinctions b etween assessed or unassessed documents, as documents that are not in the p ool are assumed to b e not relevant for the topic. This is b eneficial from a practical p oint of view, as the relative p erformance can b e compared with confidence if the exp erimental conditions are fair for all systems. However, given the incompleteness of relevance assessments, two forms of uncertainty emerge during evaluation. The first is Aleatory uncertainty, which refers to variation in system p erformance across the topic set, which is often addressed through the use of statistical significance tests. The second form of uncertainty is Epistemic, which refers to the amount of knowledge (or ignorance) we have ab out the estimate of a system's p erformance [1]. Epistemic uncertainty is a consequence of incompleteness and is not addressed by the current evaluation protocol. In this study, we present a first attempt at modelling b oth aleatory and epistemic uncertainty associated with IR evaluation. We aim to account for b oth the variability associated with system p erformance and the amount of knowledge known ab out the p erformance estimate. to IR evaluation, the precision of p erformance metric can b e modelled sp ecifically as the prop ortion of documents assessed for relevance in a systems ranked list, which we define as assessment [1]. By accounting for the epistemic uncertainty associated with incomplete collections, we model b oth p erformance and knowledge of the exp erimental conditions when comparing systems. Rather than testing whether "system A is b etter than system B under certain conditions", we refer to these uncertainties with resp ect to the p erformance of "system A and system B". The Dempster-Shafer theory of evidence is a framework for representing this situation. 2. MODELLING UNCERTAINTY IR evaluation can b e thought of as a probabilistic reasoning and decision process. Conventional Bayesian decision process represents knowledge in the form of sp ecified distributions. This is always true even if the source of information ab out this knowledge is weak. In this convention, P (A) + P (ŽA) = 1 where P (A) represents the probability in favour of the prop osition and P (ŽA) represents the opp osite argument. However, information utilised for this process can b e imprecise and/or ambiguous. With resp ect Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. Dempster-Shafer theory (DS). DS provides a flexible way of representing and reasoning with imp erfect information [3], providing a foundation for a non-Bayesian theory of probability in order to represent epistemic uncertainty. Instead of a Bayesian distribution, the probabilities are represented by a belief function and assigned to sets rather than to single events. Therefore, supp orting evidence is calculated and represented by a b elief interval to show how close the evidence is to determining the truth of a hyp othesis. The framework is designed to address varying levels of precision regarding its set theoretical nature. Evidences from different sources can b e combined without a need for any a-priori distributions. DS introduces a concept called frame of discernment which contains the set of elements that we are interested in = {1 , 2 , 3 , .., n }. The p ower set contains all 2n subsets. The mass function, denoted by m, is a mapping b etween the p ower set to the interval of 0 and 1. The sum of the masses of all the subsets of the p ower set equals to 1 (see equation 1). Although there are some interpretations in the literature, the mass function does not refer to probability in the classical sense. X m(A) = 1. (1) m : () [0, 1], m() = 0 and A A probability assignment may not always b e p ossible to all subsets of . In this case, an uncertainty measure can b e modelled on the set . This assignment can b e interpreted as the amount of information that we are not able to load to any of the subsets or hyp otheses. In the Bayesian viewp oint, this has only one interpretation: the opp osing prop osition. DS defines two imp ortant representations of uncertainty associated with a set of p ossible hyp otheses; a Belief and a Plausibility function. Belief is a measure the extent to which all evidences implies that the true value is contained in the set under consideration defined in . In short, it is 769 SIGIR 2007 Proceedings Poster the smallest p ossible probability that is consistent and supp orted by all available information. Conversely, plausibility is a measure of the extent to which all evidences implies that true value might be in the set under consideration defined in . In this regard, plausibility is the largest p ossible probability that is consistent and supp orted by all available information which cannot b e disproved. These functions may b e viewed as lower and upp er b ounds on probabilities, resp ectively, where the actual probability is contained in the interval describ ed by these b ounds. The lower b ound Belief for a set A is defined as the sum of all the basic probability assignments of the prop er subsets (B ) of the set of interest (A) (B el(A)). The upp er b ound then can b e calculated as the sum of all the basic probability assignments of the sets (A) that intersect the set of interest (P l(A)). Formally, for all A , B e l (A ) = P l (A ) = X B | B A Table 1: Description of the example Mass assignment to subsets of m(A) B el(A) P l(A) I g n(A) m({R}) 0.1 0.1 0.4 0.3 m({N }) 0.6 0.6 0.9 0.8 m({R, N }) 0.3 1 1 0 [B el({R}), P l({R})] = [0.1, 0.4]. [B el({N }), P l({N })] = [0.6, 0.9]. [B el({R, N }), P l({R, N })] = [1, 1]. m (B ) . m (B ) . (2) (3) X A|AB = It is also p ossible to obtain one from the other by P l(A) = 1 - B el(ŽA) and derive the amount of ignorance by I g n(A) = P l(A) - B el(A). Thus, we can write B el(A) + B el(ŽA) <= 1, which is one of the main differences b etween DS and classical probability. The theory also suggests combination of two or more b odies of indep endent evidences defined with in the same frame of discernment into one b ody of evidence. Dempster's combination rule is defined as P m 1 (B )m 2 (C ) B P C =A . 1 - B C = m 1 (B )m 2 (C ) Experiments. We now demonstrate the utility of DS by comparing two systems using the Robust 2005 TREC collection. An analysis by Buckley et al. [2] highlighted a p otential bias towards documents containing the topic title keywords in this collection, although relevant documents exist that do not contain these terms. For example, the sab05ror1 run, using existing relevance judgements to generate an optimal query, rep orted a large change in p erformance when removing unique documents p ooled by the system from the relevance assessments. In this demonstration (see Figure 1), we compare this system against another run, pircRB05t2, which used an external corp ora to expand the title query. The aim is to analyse whether our evaluation model can identify p otential problems with this comparison by identifying lower and upp er b oundaries of p erformance. The p erformance of sab05ror1 app ears to b e b etter than pircRB05t2. However, as we compare p erformance over an increasing measurement depth the confidence interval starts to spread, particularly b eyond the p ool depth. The amount of information ab out the systems p erformances decreases which is a p otential indication of the epistemic uncertainty. m (A ) = m 1 m 2 = (4) DS in IR Evaluation. In IR evaluation, incompleteness is caused by lack of evidences and thus give rise to uncommitted b elief. This uncommitted mass contains the probability neither b elongs to a relevant set or non-relevant set. We regard this as the uncertainty over the exp eriments which should b e considered during the comparison of system p erformance. Therefore, we model relevant, non-relevant and unassessed p ortions of a result set of a system A which is run over a topic set T = {t1 , ..., tn }. These result sets can b e considered as different and indep endent sources of evidences during the evidence combination phase. We now provide a concrete example of how uncertainty can b e modelled when evaluating the p erformance of a system. In order to demonstrate the reasoning let = {R, N } where R represents the relevant p ortion of the collection which system A returns w.r.t a topic t and N represents nonrelevant p ortion (see Table 1). The masses of m({R}) and m({N }) together represents our true b elief on the amount of assessment (p ercentages of R and N retrieved). Conversely, m({R, N }) represents the amount of unassessment (uncommitted b elief ) which is also the frame of discernment itself. Then, b elief intervals for each argument are given b elow [Bel({A}),Pl({A})]. Measurement depth Figure 1: Lower and Upper boundaries for both systems across the measurement depth We are currently in the progress of completing an in-depth study of this framework, comparing a large numb er of systems across various test collections and metrics, as well as extending this model for graded relevance assessments. 3. REFERENCES [1] M. Baillie, L. Azzopardi and I. Ruthven. Evaluating epistemic uncertainty under incomplete assessments, To Appear: Information Processing and Management (2007). [2] C. Buckley, D. Dimmick, I. Sob oroff, and E. Vo orhees. Bias and the limits of p o oling. In SIGIR '06: Proceedings of the 29th ACM SIGIR, pages 619­620, Seattle, 2006. [3] G. Shafer. A Mathematical Theory of Evidence. Princeton University Press, 1976. 770