Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks

Ben Carterette Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA 01003 carteret@cs.umass.edu

Rosie Jones Yahoo! Research 3333 Empire Ave Burbank, CA 91504 jonesr@yahoo-inc.com

Abstract
We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgments paired with click data, we show that our model can predict the relevance score of documents that have not been judged. These predictions can be used to evaluate the performance of a search engine, using our novel formalization of the confidence of the standard evaluation metric discounted cumulative gain (DCG), so comparisons can be made across time and datasets. This contrasts with previous methods which can provide only pair-wise relevance judgments between results shown for the same query. When no relevance judgments are available, we can identify the better of two ranked lists up to 82% of the time, and with only two relevance judgments for each query, we can identify the better ranking up to 94% of the time. While our experiments are on sponsored search results, which is the financial backbone of web search, our method is general enough to be applicable to algorithmic web search results as well. Furthermore, we give an algorithm to guide the selection of additional documents to judge to improve confidence. 1

1 Introduction
Web search engine evaluation is an expensive process: it requires relevance judgments that indicate the degree of relevance of each document retrieved for each query in a testing set. In addition, reusing old relevance judgements to evaluate an updated ranking function can be problematic, since documents disappear or become obsolete, and the distribution of queries entered changes [15]. Click data from web searchers, used in aggregate, can provide valuable evidence about the relevance of each document. The general problem with using clicks as relevance judgments is that clicks are biased. They are biased to the top of the ranking [12], to trusted sites, to attractive abstracts; they are also biased by the type of query and by other things shown on the results page. To cope with this, we introduce a family of models relating clicks to relevance. By conditioning on clicks, we can predict the relevance of a document or a set of documents. Joachims et al. [12] used eye-tracking devices to track what documents users looked at before clicking. They found that users tend to look at results ranked higher than the one they click on more often than they look at results ranked lower, and this information can in principle be used to train a search engine using these "preference judgments"[10]. The problem with using preference judgments inferred from clicks for learning is that they will tend to learn to reverse the list. A click at the
 1

Work done while author was at Yahoo! Submitted for confidential review to be considered for publication in NIPS December 4-9 2007.

1


lowest rank is preferred to everything else, while a click at the highest rank is preferred to nothing else. Radlinski and Joachims [13] suggest an antidote to this: randomly swapping adjacent pairs of documents. This ensures that users will not prefer document i to document i + 1 solely because of rank. A However, we may not wish to show a suboptimal document ordering in order acquire data. Our approach instead will be to use discounted cumulative gain (DC G [9]), an evaluation metric commonly used in search engine evaluation. Using click data, we can estimate the confidence that a difference in DC G exists between two rankings without having any relevance judgments for the documents ranked. We will show how a comparison of ranking functions can be performed when clicks are available but complete relevance judgments are not. After an initial training phase with a few relevance judgments, the relevance of unjudged documents can be predicted from clickthrough rates. The confidence in the evaluation can be estimated with the knowledge of which documents are most frequently clicked. Confidence can be dramatically increased with only a few more judiciously chosen relevance judgments. Our contributions are (1) a formalization of the information retrieval metric DCG as a random variable (2) analysis of the sign of the difference between two DCGs as an indication that one ranking is better than another (3) empirical demonstration that combining click-through rates over all results on the page is better at predicting the relevance of the document at position i than just the click-through rate at position i (4) empirically modeling relevance of documents using clicks, and using this model to estimate DCG (5) empirical evaluation of comparison of different rankings using DCG derived from clicks (6) an algorithm for selection of minimal numbers of documents for manual relevance judgement to improve the confidence in DCG over the estimate derived from clicks alone. Section 2 covers previous work on using clickthrough rates and on estimating evaluation metrics. Section 3 describes the evaluation of web retrieval systems using the metric discounted cumulative gain (DCG) and shows how to estimate the confidence that a difference exists when relevance judgments are missing. Our model for predicting relevance from clicks is described in Section 4. We discuss our data in Section 5 and in Section 6 we return to the task of estimating relevance for the evaluation of search engines. Our experiments are conducted in the context of sponsored search, but the methods we use are general enough to translate to general web search engines.

2 Previous Work
There has been a great deal of work on low-cost evaluation in TREC-type settings ([20, 6, 16, 5] are a few), but we are aware of little for the web. As discussed above, Joachims [10, 12] and Radlinski and Joachims [13] conducted seminal work on using clicks to infer user preferences between documents. Agichtein et al.[2, 1] used and applied models of user interaction to predict preference relationships and to improve ranking functions. They use many features beyond clickthrough rate, and show that they can learn preference relationships using these features. Our work is superficially similar, but we explicitly model dependencies among clicks for results at different ranks with the purpose of learning probabilistic relevance judgments. These relevance judgments are a stronger result than preference ordering, since preference ordering can be derived from them. In addition, given a strong probabilistic model of relevance from clicks, better combined models can be built. Dupret et al. [7] give a theoretical model for the rank-position effects of click-through rate, and build theoretical models for search engine quality using them. They do not evaluate estimates of document quality, while we empirically compare relevance estimated from clicks to manual relevance judgments. Joachims [11] investigated the use of clickthrough rates for evaluation, showing that relative differences in performance could be measured by interleaving results from two ranking functions, then observing which function produced results that are more frequently clicked. As we will show, interleaving results can change user behavior, and not necessarily in a way that will lead to the user clicking more relevant documents. Soboroff [15] proposed methods for maintaining the relevance judgments in a corpus that is constantly changing. Aslam et al. [3] investigated minimum variance unbiased estimators of system performance, and Carterette et al. [5] introduced the idea of treating an evaluation measure as a random variable with a distribution over all possible relevance judgments. This can be used to create an optimal sampling strategy to obtain judgments, and to estimate the confidence in an evaluation measure. We extend their methods to DCG. 2


3 Evaluating Search Engines
Search results are typically evaluated using Discounted Cumulative Gain (DCG) [9]. DCG is defined as the sum of the "gain" of presenting a particular document times a "discount" of presenting it i at a particular rank, up to some maximum rank : DC G = =1 g aini discounti . For web search, "gain" is typically a relevance score determined from a human labeling, and "discount" is the reciprocal of the log of the rank, so that putting a document with a high relevance score at a low rank results in a much lower discounted gain than putting the same document at a high rank.
i

DC G

=

rel1 +
=2

reli log2 i

The constants reli are the relevance scores. Human assessors typically judge documents on an ordinal scale, with labels such as "Perfect", "Excellent", "Good", "Fair", and "Bad". These are then mapped to a numeric scale for use in DCG computation. We will denote five levels of relevance aj , with a1 > a2 > a3 > a4 > a5 . In this Section we will show that we can compare ranking functions without having labeled all the documents. 3.1 Estimating DCG from Incomplete Information

DCG requires that the ranked documents have been judged with respect to a query. If the index has recently been updated, or a new algorithm is retrieving new results, we have documents that have not been judged. Rather than ask a human assessor for a judgment, we may be able to infer something about DCG based on the judgments we already have. Let Xi be a random variable representing the relevance of document i. Since relevance is ordinal, the distribution of Xi is multinomial. We will define pij = p(Xi = aj ) for 1  j  5 with 5 5 j =1 pij = 1. The expectation of Xi is E [Xi ] = j =1 pij aj , and its variance is V ar [Xi ] = 5 2 2 j =1 pij aj - E [Xi ] . We can then express DC G as a random variable:
i

DC G Its expectation and variance are:
i

= X1 + =2

Xi log2 i

E [DC G ] = E [X1 ] +
=2 i

E [Xi ] log2 i V ar[Xi ] +2 (log2 i)2
i

(1) 1 C ov (X1 , Xi ) C ov (Xi , Xj ) +2 - E [DC G log2 i log2 i · log2 j <i<j (2)
2 ]

V ar[DC G ] = V ar[X1 ] +
=2

=1

If the relevance of documents i and j are independent, the covariance C ov (Xi , Xj ) is zero. When some relevance judgments are not available, Eq. (1) and (2) can be used to estimate confidence intervals for DCG. Thus we can compare ranking functions without having judged all the documents. 3.2 Comparative Evaluation

If we only care about whether one index or ranking function outperforms another, the actual values of DCG matter less than the sign of their difference. We now turn our attention to estimating the sign of the difference with high confidence. We redefine DC G in terms of an arbitrary indexing of documents, instead of the indexing by rank we used in the previous section. Let rj (i) be the rank at which document i was retrieved by system j . We define the discounted gain gij of document i to li the DCG of system j as gij = reli if rj (i) = 1, gij = logrerj (i) if 1 < rj (i)  , and gij = 0 if
2

3


document i was not ranked by system j . Then we can write the difference in DC G for systems 1 and 2 as iN D C G = D C G 1 - D C G 2 = gi1 - gi2 (3)
=1

where N is the number of documents in the entire collection. In practice we need only consider those documents returned in the top by either of the two systems. We can define a random variable Gij by replacing reli with Xi in gij ; we can then compute the expectation of DC G: E [DC G ] = iN
=1

E [Gi1 ] - E [Gi2 ]

We can compute its variance as well, which is omitted here due to space constraints. 3.3 Confidence in a Difference in DCG

Following Carterette et al. [5], we define the confidence in a difference in DC G as the probability that DC G = DC G1 - DC G2 is less than zero. If P (DC G < 0)  0.95, we say that we have 95% confidence that system 1 is worse than system 2: over all possible judgments that could be made to the unjudged documents, 95% of them will result in DC G < 0. To compute this probability, we must consider the distribution of DC G. For web search, we are typically most interested in performance in the top 10 retrieved. Ten documents is too few for any convergence results, so instead we will estimate the confidence using Monte Carlo simulation. We simply draw relevance scores for the unjudged documents according to the multinomial distribution p(Xi ) and calculate DC G using those scores. After T trials, the probability that DC G is less than 0 is simply the number of times DC G was computed to be less than 0 divided by T . How can we estimate the distribution p(Xi )? In the absence of any other information, we may assume it to be uniform over all five relevance labels. Relevance labels that have been made in the past provide a useful prior distribution. As we shall see below, clicks are a useful source of information that we can leverage to estimate this distribution. 3.4 Selecting Documents to Judge

If confidence estimates are low, we may want to obtain more relevance judgments to improve it. In order to do as little work as necessary, we should select the documents that are likely to tell us a lot about DC G and therefore tell us a lot about confidence. The most informative document is the one that would have the greatest effect on DC G. Since DC G is linear, it is quite easy to determine which document should be judged next. Eq. (3) tells us to simply choose the document i that is unjudged and has maximum |E [Gi1 ] - E [Gi2 ]|. Algorithm 1 shows how relevance judgments would be acquired iteratively until confidence is sufficiently high. This algorithm is provably optimal in the sense that after k judgments, we know more about the difference in DCG than we would with any other k judgments. Algorithm 1 Iteratively select documents to judge until we have high confidence in DC G. 1: while 1 -   P (D C G < 0)   do 2: i  maxi |E [Gi1 ] - E [Gi2 ]| for all unjudged documents i 3: judge document i (human annotator provides reli ) 4: P (Xi = reli )  1 5: P (Xi = reli )  0 6: estimate P (DC G) using Monte Carlo simulation 7: end while

4 Modeling Clicks and Relevance
Our goal is to model the relationship between clicks and relevance in a way that will allow us to estimate a distribution of relevance p(Xi ) from the clicks on document i and on surrounding 4


documents. We first introduce a joint probability distribution including the query q , the relevance Xi of each document retrieved (where i indicates the rank), and their respective clickthrough rates ci : p(q , X1 , X2 , ..., X , c1 , c2 , ..., c ) = P (q , X, c) (4) Boldface X and c indicate vectors of length . The conditional model we are interested in is that we are given some information about the query and the clicks on documents in the ranking, and we want to infer the relevance of the ranked documents. We are therefore interested in p(X|q , c). 4.1 Predicting Relevance from Clicks

Suppose we have a query for which we have no relevance judgments (perhaps because it has only recently begun to appear in the logs, or because it reflects a trend and numerous new documents concerning the query have appeared in the corpus). We can obtain click-through data, and we would like to use that to predict relevance. In this case we are interested in the conditional probability p(X|q , c). How can we infer the relevance from the query and clicks? Note that X = {X1 , X2 , · · · } is a vector of discrete, nonbinary variables: each Xi may take on five values, and those values are ranked from best to worst. Doing inference in this model is not easy. To simplify, we make the assumption that the relevance of document i and document j are conditionally independent given the query and the clickthrough rates:
i

p(X|q , c) =
=1

p(Xi |q , c)

(5)

This gives us a separate model for each rank, while still conditioning the relevance at rank i on the clickthrough rates at all of the ranks. We do not lose the dependence between relevance at each rank and clickthrough rates on other ranks. We will see the importance of this empirically in section 6. The independence assumption allows us to model p(Xi ) using ordinal regression. Ordinal regression is a generalization of logistic regression to a variable with more than two outcomes that can be ranked by preference. Implementations of proportional odds logistic regression can usually be found in statistical software packages such as R. The proportional odds model for our ordinal response variable is log p(X > aj |q , c) = j +  q + p(X  aj |q , c)
i i

i ci +
=1 <k

ik ci ck

where aj is one of the five relevance levels. The sums are over all ranks in the list; this models the dependence of the relevance of the document to the clickthrough rates of everything else that was retrieved, as well as any multiplicative dependence between the clickthrough rates at any two ranks. After the model is trained, we can obtain p(X  aj |q , c) using the inverse logit function. Then p(X = aj |q , c) = p(X  aj |q , c) - p(X  aj -1 |q , c). A generalization to the proportional odds model is the vector generalized additive model (VGAM) described by Yee and Wild [19]. VGAM has the same relationship to ordinal regression that GAM [8] has to logistic regression. It is useful in our case because clicks do not necessarily have linear relationships to relevance. VGAM is implemented in the R library VGAM. Once the model is trained, we have p(X = aj ) using the same arithmetic as for the proportional odds model.

5 Data
We obtained data from a large search engine company. Although we limited the data to sponsored search (advertisements), there is no reason in principle our method should not be applicable to general web search, since we see the same effects of bias towards the top of search results, to trusted sites and so on. We have a total of 28,961 relevance judgments for 2,021 queries. The queries are a random sample of all queries entered in late 2005 and early 2006. Relevance judgments are based on details of the advertisement, such as title, summary, and URL. 5


We used sponsored search click logs for the month of April 2006. We filtered out queries that we had no relevance judgments for. We then aggregated records into distinct lists of advertisements for a query as follows: Each record L consists of a query, a search identification string, a set of advertisement ids, and for each advertisement id, the rank the advertisement appeared at. We also have information about which advertisements were clicked. Different sets of results for a query, or results shown in a different order, will be treated as distinct lists. We aggregate distinct lists of results to obtain a clickthrough rate at each rank for a given list of results for a given query. The clickthrough rate on each ad is simply the number of times it was clicked when served as part of list L divided by the impressions, the number of times L was shown to any user. 5.1 Dependence of Clicks on Entire Result List

When there is an "Excellent" document at rank 1, its clickthrough rate varies depending on the relevance of the document at rank 2 (Figure 5.1). For example, a "Perfect" document at rank 2 may decrease the likelihood of a click on the "Excellent" document at rank 1, while a "Fair" document at rank 2 may increase the clickthrough rate for rank 1. Clickthrough rate at rank 1 more than doubles as the relevance of the document at rank 2 drops from "Perfect" to "Fair".

relative clickthrough rate at rank 1

0.0

0.2

0.4

0.6

0.8

As we discussed in Section 4.1 our model takes into account the clicks at all ranks to estimate the relevance of the document at position i. Here we show empirically that relevance at a position other than i does affect click-through rate at rank i, for fixed relevance.

1.0

Bad

Fair

Good relevance at rank 2

Excellent

Perfect

6 Experiments
6.1 Fit of Document Relevance Model

We first want to test our proposed model (Eq. (5)) for predicting relevance from clicks. If the model fits well, the distributions of relevance it produces should compare favorably to the actual relevance of the documents. We will compare it to a simpler model that does not take into account the click dependence. The two models are contrasted below: p dependence model: p(X|q , c) = (Xi |q , c) p independence model: p(X|q , c) = (Xi |q , ci ) The latter models the relevance being conditional only on the query and its own clickthrough rate, ignoring the clickthrough rates of the other items on the page. We removed all instances for which we had fewer than 500 impressions, then performed 10-fold cross-validation. For simplicity, the query q is modeled as the aggregate clickthrough rate over all results ever returned for that query. Both models produce a multinomial distribution for the probability of relevance of a document p(Xi ). Predicted relevance is the expected value of this 5 distribution: E [Xi ] = j =1 p(Xi = aj )aj . We measured the correlation between predicted relevance and actual relevance. The correlation starts from 0.754 at rank 1 and trends downward as we move down the list; by rank 5 it is 0.527. Lower ranks are clicked less often; there are fewer clicks to provide evidence for relevance. Correlations for the independence model are significantly lower at each point. Figure 1 depicts boxplots for each value of relevance for both the models. Each box represents the distribution of predictions for the true value on the x axis. The center line is the median prediction; the edges are the 25% and 75% quantiles. The whiskers are roughly a 95% confidence interval, with the points outside being outliers. When dependence is modeled (Figure 1(a)), the distributions are much more clearly separated from each other, as shown by the fact that there is little overlap in the boxes. The correlation between predicted and acutal relevance is 18% higher, a statistically significant difference. 6


3.0

2.5

expected relevance

2.0

expected relevance Bad Fair Good Excellent Perfect

1.5

0.5

1.0

0.0

0.0

0.5

1.0

1.5

2.0

2.5

Bad

Fair

Good

Excellent

Perfect

(a) Dependence model;  0.754

=

(b) No dependence modeled;  = 0.638

Figure 1: Predicted vs. actual relevance for rank 1. Correlation increases 18% when dependence of relevance of the document at rank 1 on clickthrough at all ranks is modeled.
Confidence Accuracy clicks-only Accuracy 2 judgments 0.5 - 0.6 0.522 0.572 0.6 - 0.7 0.617 0.678 0.7 - 0.8 0.734 0.697 0.8 - 0.9 0.818 0.890 0.9 - 0.95 ­ 0.918 0.95 - 1.0 ­ 0.940

Table 1: Confidence vs. accuracy of predicting the better ranking for pairs of ranked lists using the relevance predictions of our model based on clicks alone, and with two additional judgments for each pair of lists. Confidence estimates are good predictions of accuracy.

6.2

Estimating DCG

Since our model works fairly well, we now turn our attention to using relevance predictions to estimate DCG for the evaluation of search engines. Recall that we are interested in comparative evaluation--determining the sign of the difference in DCG rather than its magnitude. Our confidence in the sign is P (DC G < 0), which is estimated using the simulation procedure described in Section 3.3. The simulation samples from the multinomial distributions p(Xi ). Methodology: To be able to calculate the exact DC G to evaluate our models, we need all ads in a list to have a relevance judgment. Therefore our test set will consist of all of the lists for which we have complete relevance judgments and at least 500 impressions. The remainder will be used for training. The size of the test set is 1720 distinct lists. The training sets will include all lists for which we have at least 200 impressions, over 5000 lists. After training the model, we predict relevance for the ads in the test set. We then use these expected relevances to calculate the expectation E [DC G]. We will compare these expectations to the true DC G calculated using the actual relevance judgments. As a baseline for autoc atic evaluation, we will compare to the average clickthrough rate on the m 1 list E [C T R] = k i , the naive approach described in our introduction. We then estimate the confidence P (DC G < 0) for pairs of ranked lists for the same query and compare it to the actual percentage of pairs that had DC G < 0. Confidence be less than or equal to this percentage; if it is, we can "trust" it in some sense. Results: We first looked at the ability of E [DC G] to predict DC G, as well as the ability of the average clickthrough rate E [C T R] to predict DC G. The correlation between the latter two is 0.622, while the correlation between the former two is 0.876. This means we can approximate DCG better using our model than just using the mean clickthrough rate as a predictor. The figure to the right shows actual vs. predicted relevance for ads in the test set. (This is slightly different from Figure 1: the earlier figure shows predicted results for all data from cross-validation while this one only shows predicted results on our test data.) The separation of the boxes shows that our model is doing quite well on the testing data, at least for rank 1. Performance degrades quite a bit as rank increases (not shown), but it is important to note that the upper ranks have the greatest effect on DC G--so getting those right is most important. 7

predicted relevance

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Bad

Fair

Good

Excellent

Perfect


In Table 1, we have binned pairs of ranked lists by their estimated confidence. We computed the accuracy of our predictions (the percent of pairs for which the difference in DC G was correctly identified) for each bin. The first line shows results when evaluating with no additional relevance judgments beyond those used for training the model: although confidence estimates tend to be low, they are accurate in the sense that a confidence estimate predicts how well we were able to distinguish between the two lists. This means that the confidence estimates provide a guide for identifying which evaluations require "hole-filling" (additional judgments). The second line shows how results improve when only two judgments are made. Confidence estimates increase a great deal (to a mean of over 0.8 from a mean of 0.6), and the accuracy of the confidence estimates is not affected. In general, performance is very good: using only the predictions of our model based on clicks, we have a very good sense of the confidence we should have in our evaluation. Judging only two more documents dramatically improves our confidence: there are many more pairs in high-confidence bins after two judgments.

7 Conclusion
We have shown how to compare ranking functions using expected DCG. After a single initial training phase, ranking functions can be compared by predicting relevance from clickthrough rates. Estimates of confidence can be computed; the confidence gives a lower bound on how accurately we have predicted that a difference exists. With just a few additional relevance judgments chosen cleverly, we significantly increase our success at predicting whether a difference exists. Using our method, the cost of acquiring relevance judgments for web search evaluation is dramatically reduced, when we have access to click data.

References
[1] E. Agichtein, E. Brill, and S. T. Dumais. Improving web search ranking by incorporating user behavior information. In Proceedings SIGIR, pages 19­26, 2006. [2] E. Agichtein, E. Brill, S. T. Dumais, and R. Ragno. Learning user interaction models for predicting web search result preferences. In Proceedings SIGIR, pages 3­10, 2006. [3] J. A. Aslam, V. Pavlu, and E. Yilmaz. A sampling technique for efficiently estimating measures of query retrieval performance using incomplete judgments. In Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data, pages 57­66, 2005. [4] A. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3­10, 2002. [5] B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In Proceedings of SIGIR, pages 268­275, 2006. [6] G. V. Cormack, C. R. Palmer, and C. L. Clarke. Efficient Construction of Large Test Collections. In Proceedings of SIGIR, pages 282­289, 1998. [7] G. Dupret, B. Piwowarski, C. Hurtado, and M. Mendoza. A statistical model of query log generation. In SPIRE, LNCS 4209, pages 217­228. Springer, 2006. [8] T. Hastie and R. Tibshirani. Generalized additive models. Statistical Science, 1:297­318, 1986. [9] K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422­446, 2002. [10] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of KDD, pages 133­142, 2002. [11] T. Joachims. Evaluating retrieval performance using clickthrough data. In Text Mining, pages 79­96. 2003. [12] T. Joachims, L. A. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of SIGIR, pages 154­161, 2005. [13] F. Radlinski and T. Joachims. Minimally invasive randomization fro collecting unbiased preferences from clickthrough logs. In Proceedings of AAAI, 2006. [14] M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of WWW 2007, 2007. [15] I. Soboroff. Dynamic test collections: measuring search effectiveness on the live web. In Proceedings of SIGIR, pages 276­283, 2006. [16] I. Soboroff, C. Nicholas, and P. Cahan. Ranking Retrieval Systems without Relevance Judgments. In Proceedings of SIGIR, pages 66­73, 2001. [17] L. Wasserman. All of Nonparametric Statistics. Springer, 2006. [18] S. N. Wood. Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):95­114, 2003. [19] T. W. Yee and C. J. Wild. Vector generalized additive models. Journal of the Royal Statistical Society, Series B (Methodological), 58(3):481­493, 1996. [20] J. Zobel. How Reliable are the Results of Large-Scale Information Retrieval Experiments? In Proceedings of SIGIR, pages 307­314, 1998.

8