SIGIR 2007 Proceedings

Session 15: Evaluation II

Test Theory for Assessing IR Test Collections
David Bodoff
Graduate School of Business University of Haifa Haifa, Israel ++972-4-8249193

Pu Li
State University of New York, at Buffalo 325 Jacobs Management Center, SUNY, Buffalo, NY, USA ++1-716-645-3285

dbodoff@gsb.haifa.ac.il

puli@buffalo.edu

ABSTRACT
How good is an IR test collection? A series of papers in recent years has addressed the question by empirically enumerating the consistency of performance comparisons using alternate subsets of the collection. In this paper we propose using Test Theory, which is based on analysis of variance and is specifically designed to assess test collections. Using the method, we not only can measure test reliability after the fact, but we can estimate the test collection's reliability before it is even built or used. We can also determine an optimal allocation of resources before the fact, e.g. whether to invest in more judges or queries. The method, which is in widespread use in the field of educational testing, complements data-driven approaches to assessing test collections. Whereas the data-driven method focuses on test results, test theory focuses on test designs. It offers unique practical results, as well as insights about the variety and implications of alternative test designs.

Categories and Subject Descriptors:
H.3 Information Storage and Retrieval; H.3.4 Systems and Software: Performance Evaluation

General Terms: Measurement Keywords:
information retrieval, test collections, test theory

1. INTRODUCTION
Research in information retrieval has benefited significantly from the availability of standard test collections, which allow direct comparisons between algorithms on a single set of data. Two questions arise: How reliable is a given performance comparison? And, how good is the test collection? Most research that has analyzed test collections as a whole, has been based on ex-post, data-driven sensitivity analysis. In 1998, both Zobel [1] and Voorhees [2] independently reported data-driven methods for

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007...$5.00.

measuring the average consistency of the performance comparisons done on a given test collection. This approach implicitly connects the question of how good the test collection is, with the question of how reliable the individual performance comparisons are. We present some details, and then the limitations we see in this data-driven approach, in order to motivate the alternative we propose. We take Voorhees' method as an example of the datadriven approach, because that method was used in subsequent studies [3, 4], and represents in our view the most thorough body of work in the data-driven approach to analyzing IR test collections. The method is based on finding the "swap rate" of individual performance comparisons. This is calculated as follows: Consider all the cases where an algorithm A out-performs another algorithm B by x% on some set of n queries from the collection. Then consider other possible sets of n queries from within the collection. On how many of those other sets would B have out-performed A [by any amount]? That is the `swap rate'. Since most researchers nowadays anyhow don't make claims about A's superiority over B without invoking a statistical test of some kind, Sanderson and Zobel [5] introduced a modified version of the swap rate calculation, which begins by considering only those cases where an algorithm A out-performed B by x% and with a certain p-value for the statistical difference. This approach has two defining features, each of which has strengths but in our view, also significant limitations. The first feature is that it is data driven. A data-driven method has the strength that it faithfully summarizes the available data and cannot be challenged in terms of the probability of the result. But the flip side is that in the absence of theory, the swap rate can only be interpreted literally. There is no basis for interpreting it as indicating that the superiority of A was "real" in any other sense, and there is no basis for selecting that particular measure of consistency over others (e.g. the probability that B would have out-performed A by x% on other queries). There is also no basis for extrapolating the reliability result to other queries, or to larger numbers of queries, although the authors of [3, 4] allow themselves this extrapolation. This is actually a significant limitation, because the split-halves approach can only be calculated up to half the actual collection size. As a result, the whole benefit of the data-driven approach ­ that it is perfectly faithful to the data ­ is lost, because extrapolation is required even to "extend" the analysis from half the collection's size to its full size. But these limitations are not our primary motivation for suggesting our alternative, because in our view these concerns can be overcome by replacing the data-driven swap rate calculation

367


SIGIR 2007 Proceedings

Session 15: Evaluation II

with a statistical version based on p-values. Specifically, we can take the same set of cases in which A out-performed B by x%, and calculate the average p-value of a statistical test on those cases, instead of the average swap rate. This can be done using a p-value of a parametric or a non-parametric test. We have done this analysis in terms of the p-value of a parametric t-test, and depict the results in Figure 1. The results are very close to the swap rate results reported in [4], except that because our y-axis measures pvalues instead of swap rates, the results have clear statistical meaning that justifies extrapolations beyond the full data set to the whole population of queries.

good-ness of a test collection in terms of the reliability of individual performance comparisons, we may risk overemphasizing the horse-racing aspect of performance comparisons. Of course, we do want that performance comparisons that announce a winner should hold up against swaps, but are there no other qualities that a test collection should strive for? In summary, progress in IR has been immeasurably facilitated by the test collections and by the deep understanding that their developers and guardians have gained through rigorous data analysis. Nevertheless, we have presented our view of the limitations of the data-driven approach, because we believe that it can be profitably augmented by another method. The approach we recommend is new to IR, but is in widespread use in the field of educational testing. Its chief advantages are that (1) it is based on theory, and (2) the theory is specifically developed to assess a test collection as a whole. This allows us to address the question of "how good is the test collection?" on an appropriately higher level of abstraction, distinct from individual performance comparisons. The key to working at this higher level of analysis is that we abandon the hypothesis-testing paradigm and adopt instead an analysis-ofvariance paradigm. As we will see, the use of theory leads to numerous practical benefits, such as the ability to calculate how good a planned test collection is likely to be, even before it is constructed, and the ability to save on resources. In helping to save resources, our work is similar in motivation to the recent work by Carterette et al (SIGIR 2006).

Figure 1: Swap rate results based on [1] But there is a second prominent feature of the swap-rate approach, which motivates our proposal of a very different method to complement it. The swap rate approach, even in the pvalue version of figure 1, evaluates the test collection in terms of the robustness of (pairwise) comparisons on a set of performance results. Because the robustness of performance comparisons depends on how those comparisons were made, the result is not a single number that characterizes the collection per se, but a graph that shows robustness of performance comparisons as a function of how they were made, e.g. as a function of the magnitude of difference and number of queries. Using this method to compare two test collections ­ e.g. to see if we have improved the collection from a previous year -- would therefore entail comparing two 3-dimensional graphs. Unless one graph completely dominates the other, we couldn't say which test collection is more reliable. Moreover, separate graphs are needed for reporting robustness with respect to the choice of queries, relevance judges, performance measures, etc., so this adds even more dimensions; a comparison of two test collections entails comparing two sets of 3-D graphs. The practical difficulty of comparing two sets of graphs reflects the more important underlying fact that this approach does not really characterize the test collection, but instead characterizes a particular set of performance comparisons. Moreover, the swap rates with respect to topic set size, assessors, measures, etc. are measured and reported separately, and this begs the question, doesn't consistency of results depend on all the above simultaneously? Rather than measuring inconsistency with respect to each factor separately, we really want to know the consistency rate, with respect to all factors simultaneously. Finally, by measuring the

2. New Approach: Test Theory
The field of Test Theory has two streams: (1) Classical Test Theory (CTT) and Generalizability Theory (GT), and (2) Item Response Theory (IRT), sometimes known as Latent Trait Theory. This paper focuses on CTT and GT, which to our knowledge have not been previously applied to IR. There has been some work using IRT in IR [6, 7] though not to assess test collections' reliability.

2.1 Hypothesis Testing vs. ANOVA
A hypothesis-testing approach asks whether data supports a hypothesis such as "There is a difference in true means between two systems' performance". The data-driven method we have seen fits in this tradition, except that it uses swap rate in lieu of the statistical concept of a difference in true means. Unlike the hypothesis testing paradigm, the field of Test Theory has no concept of true means. It does have the concept of a true score, but this is defined as the average over all observable situations; there is no such thing as a hidden or latent parameter such as a mean. Thus, the question of whether two algorithms have different true means, is not a sensible question in these theories. And since there is no hypothesis, there is no hypothesis testing. Rather, CT and GTT provide tools to answer the question, to what extent does test score variance reflect variance of examinees' true observable scores? This is a ratio. Technically it is a kind of Rsquare, and it is instructive to ponder the different perspectives provided by p-values and R-square. Hypothesis testing is concerned with p-values, and this a natural approach when applied to an individual performance comparison: Is A really better than B? Analysis of variance is concerned with R-square, and we believe this is a natural approach when applied to the test as a whole: To what extent does the test capture differences

368


SIGIR 2007 Proceedings

Session 15: Evaluation II

between the examinees, as opposed to other sources of variance? Unlike the hypothesis-testing paradigm, Test theory makes no distributional assumptions, so this is not a reason to prefer a datadriven method. Test Theory is traditionally used to estimate the reliability (defined below) of tests, such as standardized collegeentrance exams. In applying test theory to the case of IR, we are envisioning each algorithm as a student facing an exam, and each query as a question on the exam. CTT and GT ask, to what extent does a given test collection capture differences between algorithms, as opposed to other sources of variance?

2.2 Classical Test Theory
Classical testing theory [8] adopts the view that each student's (algorithm's) observed test score Xj is the realization of a random variable with a true mean Tj: Xj = Tj + Ej. However, this ``true'' value is not a mysterious hidden parameter, but just an average over all possible test conditions. CTT addresses the question of the test's `reliability', defined as the proportion of variance in examinees' scores that is due to differences in true scores, as opposed to random error:

high variances means that all questions should be of medium difficulty. High covariances also result when each query correlates with the others and with the total score for each participant. A negative item-total correlation is strong evidence that the item is testing something completely different from what the test is intending. We will report results of such analysis in the Results section below. CTT's weakness is that it is a simple model. Besides differences in examinees' true scores, it allows only a single residual term. The model makes no allowance for item variance, or other individuated sources of variance. Also as a result of its simplicity, CTT calculations only regard relative reliability (the meaning of this will be more clear after we show GT's greater flexibility). A second important limitation is that CTT calculations are possible only after the test has been used. Generalizability theory differs in that it has a built-in method for doing prospective test planning.

2.3 Basic Concepts in Generalizability Theory
GT [9, 10] is a dramatic generalization of CTT. We intend to provide sufficient details to serve as a functional tutorial for anyone interested in using GT. Whether or not testbed designers actually use GT calculations, we believe that GT concepts help to broaden our understanding of the question, how good is a test collection? GT considers that a test has many facets in addition to the examinees, which in this case are the algorithms being tested. In the IR case, let us suppose that these facets include query topics and relevance assessors. The score of a participating algorithm on a single query as measured using the relevance assessments of one judge, would be modeled as the sum of a grand mean, a person effect, a query effect, an assessor effect, all interactions, and a residual effect. We might denote this as The grand mean  is a constant, but each of the other separate effects is modeled as a random variable with its own mean and variance. We denote the variances of these effects as
2 2 2 2 2  p ,  i2 ,  a ,  pi ,  pa ,  i2 ,  pia ,e . In GT these are known a

 XX '

2 =T 2 X

. One analytical result has

made CTT practically useful and widely used. It turns out that the reliability coefficient can be estimated from a single administration of a test, by analyzing the variance of individual test items and total test scores. Cronbach's alpha is the best-known measure of this kind. The calculation of Cronbach's alpha is:
   i2 ^ k ^ = 1- i 2  ^ k -1 X      

where k is the number of items on the exam,
2 X

^  i2

is the estimated

^ variance for item i, and  is the estimated variance of the total scores. In this way, CTT analytically relates a measure of consistency to a measure of truth. This is one of the advantages to an analytical, as opposed to a completely data-driven, approach to the study of consistency in evaluation.
Figure 2: Example of Cronbach's alpha calculation
Query 1 Query 2 Query 3 al gori thm 1 0.7 0.5 0.6 al gori thm 2 0.8 0.6 0.76 al gori thm 3 0.94 0.82 0.89 al gori thm 4 0.75 0.7 0.5 al gori thm 5 0.75 0.8 0.75 i tem variances 0.00847 0.01828 0.02305 sum of item variances (.00847 + .01828 + .02305) = .0498 total-score variance v ariance{1.8, 2.16, 2.65, 1.95, 2.3} = .10817 al pha (3/2) * (1 - .0498/.10817) = .80942 algorithm totals 1.8 2.16 2.65 1.95 2.3

X pia =  +  p +  i +  a +  pi +  pa +  pia +  pia ,

as "variance components"; the term "error variance" is reserved for variance components that we view as noise (see below). Regarding interpretation,

 i2

for example is the expected

difference between the mean (over participants and assessors) of a single item and the average of all items in the "universe" (like a population). The variance of observed scores is then modeled as the sum of these variance components:
2 2 2  2 (X pia ) =  p +  i2 +  a +  pi + ... 2 2  pa +  i2 +  pia , a

(1)

Because it is theoretically derived, CTT allows a number of further analyses, beyond the mere reporting of reliability coefficients. For example, it follows directly from the analytical formula for Cronbach's alpha that the key to a high reliability is high covariance, which makes

^2 X

greater than

 ^
i

2 i

. High

covariances result when individual items have (1) high variances, and (2) high correlations. Loosely speaking, the requirement of

To use GT, one begins by conducting a G-study, which is an analysis of variance of some actual results from a set of algorithms on an IR test collection. The goal of the analysis is to estimate the variance components of equation (1). We can conduct such a G-study analysis on the basis of new data that we collect specifically for that purpose, or we can use available data, e.g. from previous TREC's, as the basis for a G-study analysis. The G-study does not need to use the same assessors, queries, participants, etc. as will be included in the final test we are

369


SIGIR 2007 Proceedings

Session 15: Evaluation II

planning. The only requirement is that they come from the same "universe of admissible observations" as the test we plan to construct, and to which we ultimately want to generalize. We will assume that the G-study was a sample in which all the facets were crossed, which means that every participant got a score for every assessor on every query. Note that a relatively small sample of this kind is necessary to perform the G-study. Empirical work has shown reasonably small standard errors for designs with just 2 or 3 elements in each facet [11], and this is the case with our data. Often, such crossed data is available as a subset of a larger set of data. For example, if a pool of available relevance assessors has helped judge a set of queries, then it will not be the case that every query was judged by every assessor. But one might be able to find (say) 5 queries, each of which were judged by (say) 2 assessors. If even this small sample of crossed data is not available, and instead some facets are nested within others, then it is still possible to use GT, but with less flexibility because some of the variance components will be conflated; for lack of space, we omit these details. Table 1 provides a (real) example of results of a crossed G-study. Table 1 G-study results based on 50 queries crossed with 2 assessors for 33 participants % of total MSE Variance variance Participant main effect Query main effect Assessor main effect Participant-Query interaction Participant-Assessor interaction Query-Assessor interaction P-Q-A interaction 0.7485 1.1274 0 .0269 .00251 .04884 .00176 .00751 .01596 0 .01258 .00002 .00143 .00176 19.13% 40.67% 0% 32.04% 0% 3.64% 4.48%

task with (say) 30 queries and 1 different assessor per query (1 assessor "nested within" 30 queries); plugging in those numbers to get a reliability coefficient would constitute a second D-study. In general, a D study allows us to calculate reliability based on two aspects of a proposed test design: (1) how many items we plan to have in each facet ­ e.g. how many queries, how many assessors, etc.; and (2) how the two facets will be related structurally ­ e.g. will they be "crossed", in which case each query gets relevance assessments from each of the assessors? Or are "assessors nested within queries", which means there are different assessors for each query? Or vice versa? When we are satisfied by our armchair D-study calculations that we have found a design that meets our cost and reliability criteria, we are finished. We then go out and construct that test collection. Thus, a primary purpose of GT is to help design a test collection a priori, rather than to assess an existing test collection after the fact. The ability to work a priori is just one of many advantages that GT offers. We now complete the technical and notational details of Dstudies. Whereas the G-study was interested in estimating the variance that comes from a single query, assessor, measure, etc. the D-study asks, given those results, what will be the effect of using a set of n queries, a set of m assessors, etc. By convention, these sets are denoted by capital letters such as I,A. The variances of the separate effects are denoted as
2 2 2 2 2  p ,  I2 ,  A ,  pI ,  pA ,  I2A ,  pIA,

. The meaning of

 I2 ,

for example, is the expected squared difference between a (single) participant's observable score on a set of items I, and his/her grand mean. Note that (only) p is still in lower case, because we are still interested in variance due to each single participant. Each of the set-variances such as  I is calculated by dividing the corresponding G-study result by the number of individuals we are considering for that facet. For example, if our G-study showed
2

variance

 i2

of .01596 for a single question, the variance

 I2

in

a design with 4 queries would be .01596/4. Similar calculations can be done for interactions. The straightforward formulas are:

 I2 =  i2 / ni'
2 2 '  pA =  pa / n a

2 2 '  A =  a / na

2 2  pI =  pi / ni'

Once we have derived these estimates of variance components, the hard work is done, and we begin to reap the benefits of GT. We proceed now to perform "D studies". Dstudies do not involve collecting any data or administering any tests. It just means plugging in numbers -- results of the G-study ­ into theoretically derived formula, to calculate the reliability coefficients (defined below) of various test designs that we may be considering. We can consider the reliability of many different test designs, even designs that differ radically from the G-study. To repeat this important point, there is no one-to-one connection between the G-study design and the D-study designs. The G-study results in an estimate of the magnitude of variance components, and we can then use these estimates to calculate the reliability of a wide variety of proposed test collection designs, even those that have no resemblance to the G-study. For example, we might plug in G-study results into formulas to calculate the reliability of a proposed task with 20 queries and the same 2 assessors providing judgments for all 20 (20 queries "crossed with" 2 assessors). Doing that calculation is called a D-study. Then we could compare those results against another proposal to have the same

'  I2A =  i2 / ni' n a a

(2)

where

' ni' , na denote the number of queries and assessors in the

D-study being considered. Finally, GT defines absolute error variance as
2 2 2   =  I2 +  A +  pI +  2A +  I2A +  2IA, p p

(3)

This is the variance of all effects, except the main participant effect. The corresponding ratio of absolute reliability is
2 p = 2 2  p +

(4)

A larger reliability coefficient means that for that proposed design, a higher proportion of total variance in observed scores is due to participants, and not to other sources. As an example using the G-study results of table 1, we plug those variance estimates into the D-study formulas (2)-(4) to estimate what the absolute reliability would be if we create a new test collection with (say) 20 queries and 3 assessors per query:

370


SIGIR 2007 Proceedings

Session 15: Evaluation II

2   = .01596 / 20 + .0 / 3 + .01258 / 20 + .00002 / 3 +

00143 / (20  3) + .00176 / (20 * 3) = .00149

Then  =

2 p
2  2 +  p

=

(.00751 + .00149 )
n,n
' i

.00751

= .835
' a

of reliability before the test collection exists, based on an analysis of variance with a sample of data. This is the ultimate contrast to the data-driven approach; the two can be used to complement one another.

Using the same G-study results and same D-study formula, we could plug in different numbers for to estimate what the reliability would be if the new collection has instead 50 queries and only one assessor (the answer is .920). Until now, we have only demonstrated how to calculate the reliability of crossed D-study designs. But it can also be used to consider the reliability of nested designs, in which, for example, there will be different relevance judges for each query. The formulas are a bit different, and we do not present them here for lack of space. Another powerful feature of GT is that it distinguishes between absolute and relative error. Absolute error variance, which we have discussed till here, includes all error components besides the participant main effect. This kind of error is important regarding absolute decisions, such as whether a production system reaches an acceptable level of performance for deployment in a real setting. But if our main purpose is to gain insights into which methods work better than others, then we care only about relative error. Any source of variance that is equal to all participants -e.g. variance of item (difficulty), etc. -- does not affect such comparisons, and so is excluded from the relative error variance. To formulate relative error variance, simply remove from the corresponding equation of absolute error variance, any component that has no reference to "p". For example, the formula for relative error variance of a fully crossed test design is (compare with eq. (3))
2 2 2  2 =  pI +  pA +  pIA,

3. Results
With this basic background in CTT and GT, we proceed to report results on some TREC data. We will introduce additional details of GT as needed to explain results.

3.1 CTT Results
3.1.1 Ex-post reliability of various TREC's
We calculated Cronbach's alpha for the adhoc task in TREC3TREC8, and Web track in TRECs 9 and 10. Table 2 shows results. Table 2. Cronbach's alpha for TREC3-TREC10 TREC # TREC 3 TREC 4 TREC 5 TREC 6 Conbach's alpha .9326 .9007 .8545 .8980 TREC # TREC 7 TREC 8 TREC 9 TREC 10 Cronbach's alpha .9192 .9049 .8720 .8570

In absolute terms, a reliability coefficient of .80 or higher is considered as acceptable in most social science applications, but we are not aware of any similar standard for engineering studies. This result is easy to calculate, but it is ex-post. It can serve as a basic measure for test collection developers, to alert them to any macro-level trends in reliability.

(5)

Because


2

includes a subset of the terms in


3.1.2 Elaboration: Item Analysis
, the The more specific insights from CTT come from the item-total correlations which can help to identify problematic queries as described above. We calculated these correlations for each query in ad hoc tracks of TREC3-TREC8 and web tracks from TREC9, TREC10. On average, there was just over one query per TREC with a negative item-total correlation, including a few with strong negative correlations1. We were unable to identify any obvious features that caused any of these queries to be problematic, e.g. none of them had unusually high or low mean (easy or hard queries) or variance. In any case, from the point of view of increasing test reliability, these items should be removed. One may argue that anomalous (non-correlating or negativelycorrelating) queries are useful because they reveal interesting phenomena, and so should not be excluded from use. For example, we may discover that certain low-performing methods suddenly out-perform others on very long queries. But the meaning of such discovery, and our reaction to it, should be studied with care. If, even after observing the anomaly, we still believe that the task involves a single, indivisible challenge on which algorithms are being judged, then a query with a zero or negative correlation with the total can only indicate that it is some sort of trick question where performance depends on luck, or, in the case of negative correlations, that it especially fools the good performers. In these cases, there is a strong argument that the
1

2 

relative error variance is always equal or smaller than the absolute error variance. Finally, the relative reliability coefficient (compare with (4)) is:
2 p E = 2  p +  2 2

(6)

In the example above, with 20 queries and 3 assessors per query in a crossed design,  2 = . 00066
.00751 = .919 .00751 + .00066 This concludes our introduction to GT. GT evaluates the test collection's reliability, without regard to particular performance comparisons. It captures the extent to which the collection captures the variance of interest, as opposed to other sources of variance. It is theory-driven. It accounts for variance from all facets, but unlike the non-theoretical swap rate approach, the ANOVA-based approach calculates reliability with respect to all facets simultaneously. It also differentiates between absolute and relative reliability. Two proposed designs are compared by comparing the reliability of one to the reliability of the other Additional benefits are the possibility of analytical results; we will see these in the results section. Finally, it allows calculation E 2 =

Data available upon request

371


SIGIR 2007 Proceedings

Session 15: Evaluation II

query should be dropped. On the other hand, upon seeing the anomaly we may re-conceptualize that the task involves multiple independent or inversely related aspects. In this case, the query may be a meaningful test, but of some different aspect of ability than the main one being elicited via the other queries. In this case, the item should preferably be relocated to a separate task, together with other queries that test that same sort of ability. Only the collection's gatekeepers can ultimately make these conceptual and practical judgments. But CTT identifies the questionable queries, and informs us that as long as such a query is retained in the original task, it undermines the reliability of the test.

3.2 GT Results
3.2.1 G study results
We use Voorhees' specially constructed data set from [4] as the basis for our GT analysis of IR test designs. This data is unusually rich, as three different assessors ­ the topic author and two secondary assessors -- judged each query. This gives us a score for each participant (algorithm), for each query, for each of three relevance assessor. Table 2 presents a sample of the identity (by letter) of the judges that assessed various queries. Secondary assessors were assigned on the basis of their time-availability. Table 3. Multiple assessors per query Topic 202 Original A SecondaryA G SecondaryB

mean the number of different assessor roles to be used on each query, i.e. whether the design uses only a single kind of assessor (primary or secondary) on each query, or whether it uses both assessor roles (primary and also secondary) to supply judgments for each query. The G-study analyzes the data to estimate the variance components. Table 1 above showed the results. It is also possible to estimate confidence intervals around such G-study results (see [9] pp. 190 ff.). We do not show these confidence intervals, but they are extremely narrow on this G-study. The G-study results of Table 1 are just intermediate numbers, needed to conduct D-study estimations of the reliability of various proposed test designs. Nevertheless, they give some initial insights. At first glance, it is striking to see that in a test whose primary purpose is to compare the performance of algorithms, only 19% of the total variance is due to variance among participants. But this should not be alarming, because this represents the variance due to a single query. For the very purpose of controlling this source of variance relative to the main participant effect, any reasonable planned test collection includes many queries. The D-studies will calculate this.

3.2.2 D-study Results: Absolute and Relative Reliability of Various Proposed Crossed Designs
Based on the G-study results, we can explore the absolute and relative reliability of various test collection designs in which assessors will be crossed with queries. We do this by plugging the number of queries and assessors we are considering into the formulas (3-4) and (5-6). Table 4 shows sample results for some typical numbers of queries and assessors. Table 4 absolute and relative reliability under crossed designs 50 queries 100 queries 1 assessor 2 assessor 1 assessor 2 assessor role roles role roles Abs .920 .925 .958 .960 Rel .961 .964 .979 .981 The table provides a score of the reliability of each design. Beyond that, it shows that with these levels of queries and assessors, doubling the number of queries is much more productive than doubling the number of assessor roles per query; it has a greater benefit on both absolute and relative reliability. In the following section, we will explain why multiplying queries or assessors by an equal factor, is a fair and meaningful comparison. In any case, this result ­ that compounding queries gives better bang for the buck than compounding assessors -- can change, depending on the numbers of queries and assessors involved. This can be found using trial and error, or with an analytical approach as in the next section.

203
204 205 Etc.

B
C D

G
G L

M N
I J

The assessor facet requires comment. We can approach the assessor facet either in terms of differences between individuals or differences between roles. A pure individuals approach would measure the variance that results from differences in the assessments of two people when they are both serving as secondary assessors, e.g. the assessments of person G and person M on query 202. Such an approach would enable us to calculate reliability as a function of how many secondary assessors are used. A pure roles approach would measure the variance that results from differences in the assessments of a given individual when he/she serves as a primary assessor versus when he/she serves as a secondary assessor. However, a pure roles approach is not possible, since a given individual cannot serve as both a primary and a secondary assessor for the same query. In this paper, we pursue a kind of "impure" roles approach by including in the G-study data from the first two columns of Table 3, i.e. assessments from the primary assessor and one secondary assessor for each query. This approach measures the variance due to differences between roles, but some of the variance may still be due to the different individuals that happened to serve in the two roles. Thus, the G-study result for this facet should be interpreted as variance that comes from the difference between primary and secondary assessor roles, but the result generalizes only to other situations where the individuals are recruited for these roles in a quasi-random manner as they were here. Similarly in the D-study results, when we speak of the number of assessors per query, we

3.2.3 D-study Result: Analytical Optimum for Crossed Design
Is there any point at which the returns from adding assessors begin to out-perform returns from adding queries? We can find this using the trial-and-error approach of the previous section. But if we introduce a cost model then we can apply a more focused analysis, justify our comparisons, and find an analytically optimal design. This is a further strength of the theory-based nature of GT. Assume that the total cost of constructing a set of relevance

372


SIGIR 2007 Proceedings

Session 15: Evaluation II

judgments is proportional to

' ni' na , i.e. the number of assessors

per query times the numbers of queries. This model, which omits any reference to the number of documents, is appropriate for pooling methods of relevance assessments, where a fixed number of documents gets manually assessed for each query. This cost model motivates the comparison of the previous section, in which we double either assessors or queries and see which has the greater effect on reliability. The doubling exercise supposes that the budget is fixed, and we want to find the ratio
' ni' / na

2 2 2 2 =  pI +  pA +  pIA, =

2 2  2i  pa  pia p

ni'

+

' na

+

' ni' na

We did not present the full set of formulas for the design with assessors nested within queries, but the corresponding formula is:
2 2  2 =  pI +  pA:I , = 2  pi

ni'

+

2 2  pa +  pia ' ni' na

that

maximizes reliability. In other words, we want to know how to apportion our budget as between queries and assessors-per-query, in a way that minimizes error variance and thereby maximizes test reliability. Recall that by eqs. (2) and (3) above, absolute error variance is:
2  =

 i2
ni'

+

2 a ' na

+

2  pi

ni'

+

2  pa ' na

+

 i2 a
' ni' n a

+

2  pia ' ni' n a

It can be derived that the optimal ratio
*

[n / n ]
' i

' a

to minimize This

this type of error, is :

2 2  ni'   i +  pi = 1826 .  ' = 2 2  na   a +  pa

means, for a fixed budget and this cost model, the optimal ratio of queries to assessor roles is 1826 to 1. This result means that starting from the typical values of about 50 queries and 1 assessor role per query, it is much more important to invest in increasing the number of queries before it becomes worthwhile to consider adding a second assessor role per query. A similar approach can also yield an optimal ratio for minimizing relative error, where the result is

 pi  ni'   '  = 2 = 805 . Of course, these solutions also  na   pa
2

*

It is easy to see that the relative error in the nested design is weakly lower. Thus, for the purpose of distinguishing good methods, it is always advantageous to use different assessors for each query, whether this means different assessor-individuals or different assessor-roles. This may seem counter-intuitive, as one might guess that it is better to stick with the same assessors or assessor types for as many queries as possible. But the intuition fails to distinguish G-study from D-study. For the purposes of estimating error, it is indeed helpful to use the same assessors for each query, i.e. a crossed study. We have already noted that doing so allows us to estimate all the individual various components. But when it comes to controlling the impact of error in the actual test collection, it is better to use different assessors for each query. This is an additional example of the sort of result that GT can offer because it is based in theory. We have reported the following results: CTT reported overall reliability of the TREC's we studied, and identified a small number of queries that should be re-considered. Turning to GT, we calculated the absolute and relative reliability of crossed test designs, and saw that at the levels of queries being considered, more benefit is gained by investing in additional queries than in additional assessors-per-query. Then we were able to formalize this, in an analytical optimal solution, for a given cost function. We were also able to derive an analytical result that in terms of relative reliability, it is better to use different assessors (individuals or roles) per query, not the same assessors for all queries.

solve the dual problem of minimizing cost for a given desired value of each coefficient.

4. Summary and Discussion
This paper proposes a method for answering the question, how good is a test collection? Whereas the data-driven method reports how many performance comparisons were prone to swaps, GT reports the extent to which a planned test design brings out the participant main effect relative to the other sources of score variance. One immediate advantage of GT is that it considers all the sources of variance simultaneously. We are not aware of any attempt within the swap-rate paradigm to consider query and assessor facets simultaneously, although Buckley and Voorhees speak of the interacting effect between the number of queries and the performance measure: "Extrapolating the results here (always dangerous, but there is no other guidance available) doubling the number of queries should suffice" [3]. This is the sort of guidance that GT offers, except with regard to the test facets, not performance measures. Not only does GT support calculation of reliability as a simultaneous function of all facets, it supports comparative statics to directly compare the relative benefits of expanding one facet or the other. As we saw, at the current levels of queries and assessors, it is apparently more worthwhile to invest in additional queries. The most frequently cited advantage of GT is that it is done exante. This is a point worth reiterating. The G-study does require some data, in order to estimate how much variance is contributed

3.2.4 D-study Result: Crossed versus Nested Designs
Until now, we have considered only the reliability of designs in which assessors are crossed with queries. In this section, we discuss designs in which assessors are nested within queries. Recall from section 3.2.1 that it is possible to analyze the "assessor" facet either in terms of the individuals serving as judges for each query, or in terms of the roles (primary and secondary) used to judge each query. In the first approach that focuses on individuals, a design with assessors nested within queries means that different individual(s) serve as judge(s) for each query. But we have instead chosen to pursue the other analysis, in terms of roles. In this case, nesting assessors within queries would mean using a different assessor role for each query, e.g. using a primary assessor in one query, and a secondary in the next. With more than two queries, it is not possible to have a genuine nesting of this kind, since there are only two possible roles to choose from, primary or secondary. We therefore omit presenting actual reliability numbers for designs with assessor roles nested within queries. However we do present an important analytical result that compares nested to crossed designs in general. With the crossed design, we presented above the equation (eq. 5)

373


SIGIR 2007 Proceedings

Session 15: Evaluation II

by each of the identified facets. This data can be historical, or it can be constructed specially for the purpose of conducting the Gstudy. But the identity and number of queries, assessors, etc. involved in the historical data, does not need to be the same as the identity and number of the queries, assessors, etc. that will be deployed in the test collection being planned. GT does not assume that there will be no drift in relevance assessments between the time of the G-study and the time the final collection is deployed. In fact, GT assumes that the queries being used for the G-study are not the actual queries that are going to be used in the final collection. Rather, GT only assumes that the queries and relevance assessments used for the G-study are representative of the queries and relevance assessments that are being envisioned for the final collection. One limitation is that a strict application of GT can only investigate the test design as GT defines this, i.e. the number of items in each facet, and the structural relationship between facets. So, for example, if we want to apply GT to study the effect of performance measures on reliability, a pure use of GT could only investigate the effect on reliability of the number of different performance measures used. This would be comparable to the studies we presented above, which investigate the numbers of queries and assessors. To advise which performance measures are more reliable than others ­ or which type of queries, if we have two or more query types -- we would need to use an "outer loop" that separately applies GT to each performance measure or query type, then compares the reliability that can be achieved with each, for a fixed budget. We have until now ignored one critical facet: documents. But this is not because they are not a source of variance. It is undoubtedly true that some documents present more problems in automatic processing; this is a document main effect. It is undoubtedly true that for a given query, some relevant (nonrelevant) documents are more difficult to find (avoid); this is a document-query interaction. It is undoubtedly true that a document may pose more difficulties for some algorithms than for others; this is a document-participant interaction effect. It is undoubtedly true that an assessor's judgments may be more tough on one algorithm than another; this is an assessor-documentparticipant interaction. And so on. These effects will certainly influence the variance of scores. Why, then, have we avoided it? The reason is that the adhoc task does not give a performance score to a participant with respect to each document, but only a single performance measure with respect to a whole (possibly ranked) list of documents. This makes it impossible and irrelevant to analyze this facet. But in a filtering task, in which algorithms make a binary prediction for each document, it is theoretically possible to give a score (correct/incorrect) for each document. If the task were defined with a per-document performance measure, then we could apply the methods of GT with the additional document facet. This would change the whole picture. One of the main results reported in this paper, that a focus on increasing queries is more worthwhile than a focus on increasing assessors, is to a great extent a result of the fact that the track gives a single performance score for a set of documents, not for each document. Assessors contribute little to total variance in our analysis, because differences between assessors are averaged out over the whole set of documents. But if a task gives scores with respect to each individual document, then the G-study would include the document facet, and results would show increased assessor effects. We could also use the "outer loop" approach

outlined above to compare reliability under the two kinds of performance measures. Numerous considerations may be relevant to deciding whether to use a per-document or per-query measure, but GT can inform the choice by reporting test reliability under each choice.

5. Acknowledgments
We thank Ellen Voorhees and Ian Soboroff from NIST for help with the data sets and for helpful comments on an early draft, and Dr. Hoi K. Suen from PSU for early help with Generalizability Theory. Of course all errors are the authors' sole responsibility.

6. References
1. Zobel, J., How reliable are the results of large-scale information retrieval experiments? , in Proceedings of the 21st ACM SIGIR Conference on Research and Development In Information Retrieval. 1998, ACM: Melbourne. Voorhees, E. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. in Proceedings of the 21st ACM SIGIR Conference on Research and Development In Information Retrieval. 1998. Melbourne: ACM Press. Buckley, C. and E. Voorhees. Evaluating Evaluation Measure Stability. in Proceedings of the 23rd ACM SIGIR conference on Research and Development in Information Retrieval. 2000. Athens, Greece: ACM Press. Voorhees, E.M. and C. Buckley. The Effect of Topic Set Size on Retrieval Experiment Error. in Proceedings of the 25th ACM SIGIR Conference on Research and Development In Information Retrieval. 2002. Tampere: ACM Press. Sanderson, M. and J. Zobel, Information retrieval system evaluation: effort, sensitivity, and reliability, in Proceedings of the 28th ACM SIGIR Conference on Research and Development in Information Retrieval. 2005, ACM: Salvador, Brazil. Banks, D., P. Over, and N.F. Zhang, Blind men and elephants: six approaches to TREC data. Information Retrieval, 1999. 1(1-2): p. 7-34. Lange, R., et al., A probabilistic Rasch analysis of question answering evaluation, in Proceedings of Human Language Technology conference North American chapter of the Association for Computational Linguistics annual meeting, HLT/NAACL 2004. 2004, Association for Computational Linguistics: Boston. Crocker, L. and J. Algina, Introduction to Classical & Modern Test Theory. 1986: Holt, Rinehart, and Winston. Brennan, R.L., Generalizability Theory. Statistics for Social Science and Public Policy, ed. S.E.D. Lievesley and J.R. Feinberg. 2001: Springer-Verlag.

2.

3.

4.

5.

6.

7.

8. 9.

10. Shavelson, R.J. and N.M. Webb, Generalizability Theory: A Primer. 1991, Newbury Park, CA: Sage. 11. Gao, X. and L. Brennan, Variability of Estimated Variance Components and Related Statistics in a Performance Assessment. Applied Measurement in Education, 2001. 14(2): p. 191-203.

374