SIGIR 2007 Proceedings

Session 15: Evaluation II

On the Robustness of Relevance Measures with Incomplete Judgments
Tanuja Bompada Chi-Chao Chang John Chen Ravi Kumar Rajesh Shenoy
Yahoo! 701 First Avenue Sunnyvale, CA 94089. {tanuja,chichao,jianchen,ravikumar,rshenoy}@yahoo-inc.com

ABSTRACT
We investigate the robustness of three widely-used IR relevance measures for large data collections with incomplete judgments. The relevance measures we consider are the bpref measure introduced by Buckley and Voorhees [7], the inferred average precision (infAP) introduced by Aslam and Yilmaz [4], and the normalized discounted cumulative gain (NDCG) measure introduced by J¨rvelin and Kek¨l¨inen a aa [8]. Our main results show that NDCG consistently performs better than both bpref and infAP. The experiments are performed on standard TREC datasets, under different levels of incompleteness of judgments, and using two different evaluation methods, namely, the Kendall correlation measures order between system rankings and pairwise statistical significance testing; the latter may be of independent interest.

Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval

General Terms
Experimentation, Measurements

Keywords
Relevance Measures, Incomplete Judgments, Cranfield Methodology, bpref, Inferred Average Precision, infAP, Normalized Discounted Cumulative Gain, NDCG

measure: (i) the reliability and correctness of the judgments, and (ii) the cost of collecting them. Correctness is largely addressed through expert judges with proper guidelines, judge training and some level of redundancy, all of which inevitably drives up the cost of collecting judgments. This has been a perennial issue for information retrieval researchers who have relied on the basic Cranfield model for relevance evaluation [16]. Seeking cost-effective relevance evaluation, recent papers have focused on relaxing the completeness assumption by judgment pooling [13, 18], by introducing more robust measures [7], by introducing alternate testing paradigms [2, 1], and by using implicit feedback such as click data [9] and eye movement tracking [10]. With the advent of domain-specific search engines in the Internet that rely heavily on the relevance of their results to gain and sustain user market share, this cost -- predominantly human cognitive load and time -- can be prohibitively expensive as frequently changing, large data collections are needed for evaluation. One relevance measure that has recently received a lot of attention is the discounted cumulative gain (DCG) and its normalized version (NDCG) [8]. A key aspect of DCG is that it accommodates "graded" judgments. Graded judgments are often preferred to binary ones as a way to address judge sub jectiveness, cognitive stress, and often times plain confusion. Another important aspect of DCG is that judgments of higher ranked documents weigh more than the lower ranked ones. This aspect is also very intuitive: in most information retrieval systems, particularly in modern search engines, the likelihood of a user noticing the top ranked results is substantially higher than lower ranked results and hence it is natural that their relevance judgments are weighed heavier. How do these two aspects contribute to the robustness of NDCG in the presence of incompleteness? This paper addresses these questions by building upon previous research done on incomplete metrics -- bpref [7] and infAP [4]. Bpref measure is inversely related to the fraction of judged nonrelevant documents that are retrieved before relevant documents. Buckley and Voorhees showed that the Kendall correlations between system rankings using complete and incomplete judgments was higher for the bpref measure than for precision-based measures. InfAP is the inferred average precision, defined to be the average precision as the outcome of a random experiment; it estimates the average precision from the pool sub sample directly. Aslam and Yilmaz [4] showed that infAP is better correlated with mean average

1. INTRODUCTION
Human relevance judgments have been an integral, necessary ingredient for evaluating information retrieval systems. There are two chief concerns when collecting human judgments that directly affect the robustness of a relevance

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00.

359


SIGIR 2007 Proceedings

Session 15: Evaluation II

precision (MAP) than bpref. From this, they concluded infAP to be a robust measure for incomplete and imperfect relevance judgments, assuming MAP is the gold standard for evaluating retrieval effectiveness. In this paper we show that NDCG using binary judgments consistently performs better than both bpref and infAP. We use the experimental setting as in Buckley and Voorhees [7] and base our experiments on the TREC-8, TREC-10, TREC-12, and TREC-14 datasets. We use their methodology to simulate incompleteness and to measure Kendall distance between the orders of evaluated systems. Note that neither bpref nor infAP accounts for either rank adjustments or graded judgments. To corroborate these findings, we use pairwise statistical significance tests [11, 12] to compare the accuracy and robustness of the metrics. We once again demonstrate that NDCG is a very accurate measure compared to both bpref and infAP at most of the incompleteness levels. The underlying simulation framework is similar to the one used above. The difference being instead of computing the distance between the two lists of tested systems/configurations, we pair up two systems at random and perform a non-parametric statistical test to accept or reject the hypothesis that the two systems are equally relevant. The accuracy values are derived from the confusion matrix at various levels of incompleteness. The paper is organized as follows. Section 2 introduces the evaluation measures, Section 3 briefly describes the TREC datasets we used in our study, Section 4 explains the simulation framework and the robustness measures -- the Kendall correlation and the significance tests, Section 5 contains the experimental results, and Section 6 provides concluding remarks.

the retrieved list and rel(x) denotes the relevance of the document. When we work with binary judgments, we assume rel(x)  {0, 1} such that rel(x) = 0 denotes x is non-relevant to the topic and rel(x) = 1 denotes x is relevant to the topic. When we work with graded judgments, we use the value of rel(x) > 0 to indicate the degree of relevance. When we work with incomplete judgments, we will permit rel(x) =  meaning that x has not been judged against the topic. Bpref. Buckley and Voorhees introduced a new measure called bpref for evaluating systems under incomplete judgments [7]. The basis of this measure is to use binary relevance judgments, with the end goal of measuring the effectiveness of an IR system on the basis of judged documents only. It is a function of the number of times judged nonrelevant documents are retrieved before relevant documents. They called the measure bpref because it uses binary relevance judgments to define the preference relation. Using the notation set up earlier, the formal definition of bpref is as follows. Let R be the total number of relevant documents for the topic and N be the total number of irrelevant documents for the topic. Let N (x, p) be the set of first p judged non-relevant documents that are ranked above x. Formally, for a relevant document x and a threshold position p, we write N (x, p) = {y : rel(y ) = 0  rank(y )  p  rank(y ) < rank(x)}. Then, bpref (X ) = 1 |R| X
x:rel(x)>0

« ,, |N (x, |R|)| , 1- min(|R|, |N |)

Note that the above definition ensures that bpref is always between 0 and 1. InfAP. Aslam and Yilmaz proposed the inferred average precision (infAP) measure that collapses to the mean average precision when the judgments are complete [4, 3]. The estimated precision at a particular position is a sum of the document being relevant at that position and the expected precision of the documents ranked above that position. They show infAP to be highly correlated with the average precision and also robust to incomplete and imperfect relevance judgments. InfAP is computed as follows. First, the expected value of precision for a relevant document x with rank(x) = p is computed as eprec(x) = |J (x)| | |R(x)| + 1 + · rank(x) rank(x) R(x)| + |N (x)| + 2 ,

2.

EVALUATION MEASURES

In this section we discuss the various evaluation measures that are prevalent in IR systems. The choice of an evaluation measure is crucial since it directly impacts the effectiveness of an retrieval system. The problem is made more complicated by the availability of different retrieval measures with different desirable properties -- ranging from easy interpretability to robustness under incomplete judgments. For a detailed discussion of which measures are appropriate for IR systems, the readers are referred to the chapter on Evaluation in the book by van Rijsbergen [14], chapter on retrieval evaluation in the book by Baeza-Yates and RibeiroNeto [5], and the paper by Buckley and Voorhees [7]. Below, we focus on the three evaluation measures we care about in this paper, namely, the measure introduced by Buckley and Voorhees called bpref [7], the measure introduced by Aslam and Yilmaz called the inferred average precision (infAP) [4], and the measure introduced by J¨rvelin a and Kek¨l¨inen called the normalized discounted cumulaaa tive gain (NDCG) [8]. These measures are all inspired by the classical notions of precision and recal l : to recap, precision captures how many retrieved documents are actually relevant to the given topic and recall captures what fraction of relevant documents were actually retrieved. To define these measures precisely, we first introduce some notation. Given a topic, let X = x1 , x2 , . . . , be the ranked list of retrieved documents. Each document x  X has two attributes: rank(x) denotes the position of the document x in

where J (x) is the set of documents in the judged pool that are ranked above x, i.e., J (x) = {y : rel(y ) =   rank(y ) < rank(x)}, R(x) and N (x) are the set of judged relevant and judged non-relevant documents in J (x), and is a small value added to avoid the 0/0 case. InfAP is computed as the average of these estimated precisions for each relevant document: X 1 eprec(x). infAP(X ) = |R|
x:rel(x)>0

Normalized DCG. The discounted cumulative gain (DCG) measure was introduced by J¨rvelin and Kek¨l¨inen to take a aa

360


SIGIR 2007 Proceedings

Session 15: Evaluation II

into account the grades of relevance (highly relevant vs. marginally relevant) and the actual ranked position of a relevant document [8]. The first aspect is captured by actually using the grade of relevance in the measure and the second aspect is captured by discounting the value of a document as its rank increases. Informally, DCG is a summation of the scaled inverse log rank for all relevant documents. While traditionally DCG is defined as a vector of values, we focus only on the -th entry in the vector by defining DCG as follows. X el(x) . dcg(X, ) = log(rank(x) + 1)
x:rel(x)>0

TREC coll. 8 10 12 14

documents # ×106 Gb 0.528 1.9 1.7 10.0 0.528 1.9 0.331 5.7

# 50 50 100 60

topics avg rels 94.6 67.3 60.7 58.4

# 124 77 73 57

runs groups 38 26 16 18

Table 1: Prop erties of datasets used in the exp eriments. track, and TREC-14 is the enterprise search track. Table 1 shows the properties of the three test collections. Besides the obvious statistics such as the number of documents and size of the collection, the table also shows the number of topics and the average number of relevant documents per topic. The last two columns shows the number of runs (retrieval methods) and the number of participating groups that submitted the runs. In selecting the runs, we performed the same preprocessing as [7]: we discarded runs that retrieved less than 95% of the maximum number of documents that could be retrieved. For details, see [7]. Note that the TREC-8 collections has the most runs associated with it as well as the most participating groups. The TREC-10 collection has the largest document set, and the TREC-12 collection has the largest topic set. We also include the TREC-14 collection because it has twice more highly relevant than relevant documents. Specifically, it had 2004 highly relevant judgments, 1441 somewhat relevant judgments, and 27813 non-relevant judgments. This was used to study the effect of multiple grades of relevance on DCG.

rank(x)

r

When we use a binary grade, i.e., the grade value for a relevant document is 1 and the grade value for a non-relevant document is 0, the DCG measure degenerates to the following. X binary-dcg(X, ) = . log(rank(x) + 1)
x:rel(x)>0

rank(x)

1

The above DCG measure is normalized by the ideal ranking of the documents for a query. To do this, we compute the DCG up to the position for all the relevant documents (across multiple systems) ranked ideally; this constitutes the idealized DCG (IDCG). Normalized DCG (NDCG) is then DCG divided by IDCG. Normalizing DCG eliminates large values for topics with many relevant documents and will average well across all the topics for a system. Comparing multiple systems necessitates such a normalization. Unless otherwise stated, we use binary DCG for most of the paper. We also set = 1000 so that we perform our analysis on the top 1000 results for the TREC datasets. For infAP, bpref, and NDCG measures, we ignore documents x that are not judged and are by default considered irrelevant, i.e., rel(x) = , under various levels of incompleteness. For infAP and NDCG, when relevance for every document retrieved for a system is unknown, the value of the measure is zero. For bpref, we use the same restriction used in [7], i.e., for each topic, there has to be at least one relevant document and ten non-relevant documents.

4. EXPERIMENTAL FRAMEWORK 4.1 Constructing incomplete judgments
Recall that we are interested in evaluating the measures when more and more of the judgments become incomplete. To do this, we adopt the framework as in [7], which we outline below. The judgment sets are called qrels and those in the TREC collection are considered to be 100% complete, since they have no incomplete judgments; we call this the 100%-qrels. Let R (resp., N ) be a random permutation of the relevant (resp., non-relevant) documents in the 100%qrels. Let f be a specified fraction. From each topic in the 100%-qrels, we construct a qrels corresponding to f by the following. We select min(1, f |R| ) documents from the prefix of R and min(10, f |N | ) documents from the prefix of N ; these documents constitute the qrels corresponding to f , called f -qrels. By construction, if f > f , then the f -qrels is a superset of f -qrels. We do this for each f in the set F = {0.01, 0.02, . . . , 0.05, 0.1, 0.2, . . . , 0.9}. This results in 17 qrels for each dataset. It is well-known that all evaluation measures are non-robust when the number of relevant documents is very small [6]. Therefore, the results are less meaningful for very small values of f .

3. DATASETS
We use the three TREC test collections, which were used in Buckley and Voorhees paper [7]. In our context, a test collection consists of a set of statements of information need called topics, a set of documents, and a set of relevance judgments for the documents. Runs are submitted to the corresponding TREC track, where each run corresponds to a retrieval method. Relevance measures are applied to the runs to determine which retrieval method has the best relevance. We used these test collections for the following three reasons. First, they are the largest available test collections with pooled relevance judgments. Second, by using these test collections, we can quantitatively compare our methods to those in [7]. Last, these test collections exhibit a range of diverse properties and it is important that an evaluation measure be tested against a wide spectrum of different data. TREC-8 is the ad hoc task, TREC-10 is the ad hoc task within the TREC 2001 (web track), TREC-12 is the robust

4.2 System rankings
To make a fair comparison with bpref, we adopt the simulation framework and system rankings measure used in [7]. We briefly outline their methodology.

361


SIGIR 2007 Proceedings

Session 15: Evaluation II

A system ranking is defined as an ordered list of runs sorted by decreasing evaluation measure score. The similarity of two rankings is defined to be the Kendall  distance between them. Recall that the Kendall  distance the number of pairwise disagreements between the two rankings. Formally, if  and  are two rankings, then the Kendall  between them is given by K (,  ) = |{(i, j ) :  (i) <  (j )   (i) >  (j )}|. Alternately, Kendall  distance is the minimum number of pairwise adjacent swaps to turn one ranking into the other. The Kendall distance can be normalized to be a Kendall correlation between -1 and 1. It also has the nice property that the expected distance between two randomly chosen rankings is 0. The Kendall correlation computed over the same set of runs can be meaningfully compared. In an earlier work, Voorhees [15] determines rankings to be equivalent if they have normalized Kendall correlation value at least 0.9 and to have noticeable differences if the Kendall correlation value is less than 0.8. We adopt a similar approach. Thus, it makes sense to compare the system ranking for those f -qrels corresponding to increasing values of f . The plot of this curve will show the Kendall correlation between the runs as a function of f , which is the fraction of complete judgments.

H0 accepted at f -qrels H0 accepted at 100%-qrels H0 rejected at 100%-qrels C (1, 1) C (2, 1)

H0 rejected at f -qrels C (1, 2) C (2, 2)

Table 2: The confusion matrix. The accuracy may not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases. We hereby define geometric mean (g-mean) as an alternative metric. The true positive rate (TP) is the fraction of null hypothesis accepted cases that were correctly predicted and is given by TP = C (1, 1) . C (1, 1) + C (1, 2)

The predicted positive rate (P) is defined as P= C (1, 1) C (1, 1) + C (2, 1)

4.3 Significance tests
We also propose a new way of testing the robustness of measures under incomplete judgments. This method is based on statistical hypothesis testing of two systems. For a pair of runs, the null hypothesis H0 is that the relevance measures (bpref, infAP, NDCG) are equal for both runs. If the confidence in this hypothesis, reported as a p-value, is 0.05 (at 95% confidence level), then the null hypothesis is typically rejected. That is, the two runs have statistically significant difference based on the relevance measure. To test the hypothesis, we calculate the value of a relevance measure for each topic. For each run, we have a vector of relevance measure values for all the topics. The vectors of relevance measure values from two runs are compared using the standard pairwise Wilcoxon test to determine the p-value [17]. The framework of simulating incompleteness is kept similar to the one used for computing Kendall correlation. A pair of systems are randomly chosen and the judgments are looked up from the f -qrels. This methodology maintains uniform recall across different systems and evaluates the systems based on a common pool of data. Confusion matrix. We construct a confusion matrix based on the p-values from the hypothesis testing at 100%-qrels and f -qrels corresponding to each value of f  F . For each such value f , there are four cases depending on whether H0 is accepted at 100%-qrels and at f -qrels. This results in the 2 × 2 confusion matrix, shown in Table 2. The false positive errors -- H0 is rejected at 100%-qrels but accepted at f -qrels -- are given by the entry C (2, 1). The false negative errors -- H0 is accepted at 100%-qrels but rejected at f -qrels -- are given by C (1, 2). The accuracy is the fraction of the total number of cases that were predicted correctly and is determined as follows accuracy = C (1, 1) + C (2, 2) . C (1, 1) + C (1, 2) + C (2, 1) + C (2, 2)

and the geometric mean (g-mean) is defined as  g-mean = T P × P . In this study, we use the accuracy measure and the geometric mean to compare bpref, infAP, and NDCG. We randomly select around 300 pairs of runs from a test collection, and use the judgment pool from both runs for all topics as the 100%-qrels. We then create the same 16 additional smaller qrels as mentioned above. For each measure, a confusion matrix is constructed based on the comparisons at each qrels. The accuracy values are then calculated from the confusion matrix.

5. EXPERIMENTAL RESULTS 5.1 Stability of system rankings
The Kendall correlation between the system rankings at 100%-qrels and at f -qrels for each of the f  F are plotted in a chart. Figure 1 shows the Kendall correlation values for TREC-8, TREC-10, TREC-12, and TREC-14 datasets. These graphs show several interesting phenomenon. (1) Across all the four datasets, infAP consistently performs better than bpref for each of the f -qrels for TREC-8, TREC-10 and TREC-12 datasets. This finding is in agreement with the findings of Aslam and Yilmaz [4]. The gap between infAP and bpref is smaller in TREC-8 dataset and increases significantly in TREC-12 and TREC-14 datasets. (2) For all the four datasets, the NDCG measure has the highest Kendall values for more than a ma jority of f -qrels. Thus, the NDCG measure outperforms both infAP and the bpref measures, regardless of the dataset used and the value of f . NDCG has correlation as high as 0.9 at low levels of f rels (40%). In fact, the gap is most pronounced for the noisy collection TREC-10, suggesting that NDCG is more robust to the noise of the dataset. It is interesting to note that even at extremely low values of f (say, smaller than 0.05) and even for the nosiest dataset of TREC-10, the Kendall correlation value for NDCG is above 0.5.

362


SIGIR 2007 Proceedings

Session 15: Evaluation II

Figure 1: Kendall correlation values for various TREC datasets. bpref 0.4 1.0 0.7 0.4 infAP 0.3 0.7 0.5 0.4 NDCG 0.3 0.4 0.3 0.3

TREC-8 TREC-10 TREC-12 TREC-14

5.2 Accuracy of significance tests
As mentioned earlier, for each measure, we construct a confusion matrix based on the comparisons of 300 randomly selected runs for each f -qrels. The accuracy values are then calculated from the confusion matrix and plotted. Table 4 provides an example of a confusion matrix from TREC-12, and also illustrates a situation where C (2, 2) is substantially larger than C (1, 1). Figure 2 shows the accuracy values for TREC-8, TREC-10, and TREC-12 datasets. The findings are stated below. (1) InfAP has a higher accuracy than the bpref measure in TREC-10 and TREC-12 but bpref has a higher accuracy in the TREC-8 dataset although the gap is small. This shows that, even by a statistical measure, infAP is more robust than bpref, lending further evidence to the results in [4, 3]. (2) Once again, consistent with system rankings findings, NDCG has better or as good accuracy as infAP across all f -qrels for all TREC datasets. For each of the datasets, the accuracy for NDCG is above 90% even at the extremely low values of f (below 0.05). This demonstrates the ability of NDCG to make a highly accurate prediction as to whether the two runs are statistically significantly different even at low f -qrels.

Table 3: Knee values for Kendall correlation curves for different measures on the datasets. (3) The plots for the NDCG measure is flatter than the plots for the other measures, for all the datasets. A flatter curve for a measure is preferable since it means that the measure continues to rank different systems in the same relative order as when using complete judgments for higher levels of incompleteness, i.e., lower values of f . A measure of flatness can be quantified by looking at the lowest f value for which the Kendall correlation exceeds 0.9; we call this the knee of the curve. Table 3 shows the knee values for all the datasets and all the measures. It is easy to see that the knee for NDCG is consistently smaller than those of bpref and infAP.

363


SIGIR 2007 Proceedings

Session 15: Evaluation II

Figure 2: Accuracy and geometric mean values for various TREC datasets.

364


SIGIR 2007 Proceedings

Session 15: Evaluation II

NDCG H0 accepted at 100%-qrels H0 rejected at 100%-qrels bpref H0 accepted at 100%-qrels H0 rejected at 100%-qrels InfAP H0 accepted at 100%-qrels H0 rejected at 100%-qrels

H0 accepted at 30%-qrels 62 22 H0 accepted at 50%-qrels 66 37 H0 accepted at 50%-qrels 59 30

H0 rejected at 30%-qrels 5 176 H0 rejected at 50%-qrels 6 157 H0 rejected at 50%-qrels 3 174

We also use two additional weights, 10 and 100, for highly relevant documents to study the effect of weights on DCG; this is to boost the documents that are highly relevant to the point that the rest do not matter.

Table 4: TREC-12: an example of confusion matrix at 0.3-qrels bpref 0.3 0.5 0.5 infAP 0.3 0.5 0.5 NDCG 0.3 0.1 0.3 Figure 3: Kendall correlation values for binary vs tertiary DCG. Figure 3 shows the Kendall correlation values for binary and tertiary NDCG. From the plot it is easy to see that the difference between the two measures is very negligible; in fact, the difference between the Kendall correlations is only ±1%. This suggests that a tertiary NDCG provides lesser advantage than a binary NDCG in preserving system ranking under incompleteness. This evidence is further strengthened by observing the accuracy values (not shown). Once again, the difference between the binary and tertiary NDCG values, even with different weights for highly-relevant documents, is insignificant. This is counterintuitive, but perhaps an artifact of the dataset we used.

TREC-8 TREC-10 TREC-12

Table 5: Knee values for accuracy curves for different measures on the datasets. (3) The accuracy curve of NDCG is flatter than those of bpref and infAP. Once again, we tabulate the knee of this curve to quantitatively observe the improvements. Table 5 shows the knee values. (4) NDCG has lower false positive rate than bpref and infAP. This is quite significant since false positive errors are very undesirable. To make the exposition less tedious, we do not show the plots corresponding to false positive errors. (5) In addition to the accuracy measure, we also plot the gmean values, to mitigate the effects of large values of C (2, 2). Figure 2 shows the g-mean values for the datasets. The gmean plots are consistent with the accuracy plots for TREC8, TREC-10, and TREC-12 and show NDCG to be either as good as infAP or better at low f -qrels.

6. CONCLUSIONS
In this paper we show that NDCG is a very robust measure under incomplete judgments. We use the TREC-8, TREC-10, TREC-12, and TREC-14 datasets in our study. We study robustness of three relevance measure, namely, bpref, infAP, and NDCG. Robustness is measured by the Kendall correlation value between the system ranking under complete judgments and the one under 16 different levels of incomplete judgments. We also use statistical hypothesis testing for randomly selected pairs of runs to check whether the same accept or reject conclusion at complete judgments can be reached under incomplete judgments. An accuracy measure is derived from a confusion matrix for experiments with 300 pairs of randomly selected runs. Our results indicate that compared to bpref and infAP, NDCG consistently has the highest Kendall correlation values across most of the different levels of incompleteness for all TREC datasets. NDCG also consistently has the highest accuracy values based on the statistical significance tests. Our experiments also suggest that tertiary NDCG is only as robust as binary NDCG under both evaluation methods. Future work involves extending these results to every large corpora such as the Web, where incomplete judgments are more the rule than an exception. It will also be interesting

5.3 Binary vs. tertiary DCG
In this experiment we study the effect of multiple grades of relevance on NDCG. Our goal is to understand if three grades of relevance will give the NDCG measure higher Kendall values in preserving system rankings and better accuracy values, when compared to a binary relevance grade. For the purpose of this experiment, we use the TREC-14 dataset because it has the most number of tertiary judgments amongst our datasets. As we mentioned earlier, TREC14 uses three grades of relevance, namely, highly relevant, relevant, and non-relevant. A highly relevant document x has a weight of rel(x) = 2, a relevant document x has a weight of rel(x) = 1, and a non-relevant document x has a weight of rel(x) = 0. We use the general formula for NDCG, using the actual value of weights rel(x) for each document.

365


SIGIR 2007 Proceedings

Session 15: Evaluation II

to analyze the measures under some generative model and analytically show that NDCG is more robust than bpref and infAP.

Acknowledgments
We thank Ellen Voorhees for providing us the datasets and clarifications on the bpref measures and the simulation framework. We also thank Jan Pedersen for his insightful remarks.

7.

REFERENCES

[1] K. Ali, C. Chang, and Y. F. Juan. Exploring cost-effective approaches to human evaluation of search engine relevance. In Proc. 27th European Conference on Information Retrieval, pages 360­374, 2005. [2] E. Amitay, D. Carmel, R. Lempel, and A. Soffer. Scaling IR-system evaluation using term relevance sets. In Proc. 27th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 10­17, 2004. [3] J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proc. 29th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 541­548, 2006. [4] J. A. Aslam and E. Yilmaz. Estimating average precision with incomplete and imperfect judgments. In Proc. 15th ACM CIKM International Conference on Information and Know ledge Management, pages 102­111, 2006. [5] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman, Boston, 1999. [6] C. Buckley and E. M. Voorhees. Evaluating evaluation measure stability. In Proc. 23rd ACM SIGIR Conference on Research and Development in Information Retrieval, pages 33­40, 2000. [7] C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proc. 27th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 25­32, 2004.

[8] K. J¨rvelin and J. Kek¨l¨inen. Cumulated gain-based a aa evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422­446, 2002. [9] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proc. 28th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 154­161, 2005. [10] K. Puolamaki, J. Salo jarvi, E. Savia, J. Simola, and S. Kaski. Combining eye movements and collaborative filtering for proactive information retrieval. In Proc. 28th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 146­153, 2005. [11] M. Sanderson and J. Zobel. Information retrieval system evaluation: effort, sensitivity, and reliability. In Proc. 28th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 162­169, 2005. [12] J. Savoy. Statistical inference in retrieval effectiveness evaluation. Information Processing and Management, 33(4):495­512, 1997. [13] K. Sparck-Jones and C. J. van Rijsbergen. Report on the need for and provision of an `ideal' information retrieval test collection. Technical Report 5266, Computer Laboratory, Cambridge University, 1976. [14] C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979. [15] E. M. Voorhees. Evaluation by highly relevant documents. In Proc. 24th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 74­82, 2001. [16] E. M. Voorhees. The philosophy of information retrieval evaluation. In Proc. 2nd Workshop of the Cross-Language Evaluation Forum, pages 355­370, 2001. [17] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80­83, 1945. [18] J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proc. 21st ACM SIGIR Conference on Research and Development in Information Retrieval, pages 307­314, 1998.

366