Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks Ben Carterette Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA 01003 carteret@cs.umass.edu Rosie Jones Yahoo! Research 3333 Empire Ave Burbank, CA 91504 jonesr@yahoo-inc.com Abstract We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgments paired with click data, we show that our model can predict the relevance score of documents that have not been judged. These predictions can be used to evaluate the performance of a search engine, using our novel formalization of the confidence of the standard evaluation metric discounted cumulative gain (DCG), so comparisons can be made across time and datasets. This contrasts with previous methods which can provide only pair-wise relevance judgments between results shown for the same query. When no relevance judgments are available, we can identify the better of two ranked lists up to 82% of the time, and with only two relevance judgments for each query, we can identify the better ranking up to 94% of the time. While our experiments are on sponsored search results, which is the financial backbone of web search, our method is general enough to be applicable to algorithmic web search results as well. Furthermore, we give an algorithm to guide the selection of additional documents to judge to improve confidence. 1 1 Introduction Web search engine evaluation is an expensive process: it requires relevance judgments that indicate the degree of relevance of each document retrieved for each query in a testing set. In addition, reusing old relevance judgements to evaluate an updated ranking function can be problematic, since documents disappear or become obsolete, and the distribution of queries entered changes [15]. Click data from web searchers, used in aggregate, can provide valuable evidence about the relevance of each document. The general problem with using clicks as relevance judgments is that clicks are biased. They are biased to the top of the ranking [12], to trusted sites, to attractive abstracts; they are also biased by the type of query and by other things shown on the results page. To cope with this, we introduce a family of models relating clicks to relevance. By conditioning on clicks, we can predict the relevance of a document or a set of documents. Joachims et al. [12] used eye-tracking devices to track what documents users looked at before clicking. They found that users tend to look at results ranked higher than the one they click on more often than they look at results ranked lower, and this information can in principle be used to train a search engine using these "preference judgments"[10]. The problem with using preference judgments inferred from clicks for learning is that they will tend to learn to reverse the list. A click at the 1 Work done while author was at Yahoo! Submitted for confidential review to be considered for publication in NIPS December 4-9 2007. 1 lowest rank is preferred to everything else, while a click at the highest rank is preferred to nothing else. Radlinski and Joachims [13] suggest an antidote to this: randomly swapping adjacent pairs of documents. This ensures that users will not prefer document i to document i + 1 solely because of rank. A However, we may not wish to show a suboptimal document ordering in order acquire data. Our approach instead will be to use discounted cumulative gain (DC G [9]), an evaluation metric commonly used in search engine evaluation. Using click data, we can estimate the confidence that a difference in DC G exists between two rankings without having any relevance judgments for the documents ranked. We will show how a comparison of ranking functions can be performed when clicks are available but complete relevance judgments are not. After an initial training phase with a few relevance judgments, the relevance of unjudged documents can be predicted from clickthrough rates. The confidence in the evaluation can be estimated with the knowledge of which documents are most frequently clicked. Confidence can be dramatically increased with only a few more judiciously chosen relevance judgments. Our contributions are (1) a formalization of the information retrieval metric DCG as a random variable (2) analysis of the sign of the difference between two DCGs as an indication that one ranking is better than another (3) empirical demonstration that combining click-through rates over all results on the page is better at predicting the relevance of the document at position i than just the click-through rate at position i (4) empirically modeling relevance of documents using clicks, and using this model to estimate DCG (5) empirical evaluation of comparison of different rankings using DCG derived from clicks (6) an algorithm for selection of minimal numbers of documents for manual relevance judgement to improve the confidence in DCG over the estimate derived from clicks alone. Section 2 covers previous work on using clickthrough rates and on estimating evaluation metrics. Section 3 describes the evaluation of web retrieval systems using the metric discounted cumulative gain (DCG) and shows how to estimate the confidence that a difference exists when relevance judgments are missing. Our model for predicting relevance from clicks is described in Section 4. We discuss our data in Section 5 and in Section 6 we return to the task of estimating relevance for the evaluation of search engines. Our experiments are conducted in the context of sponsored search, but the methods we use are general enough to translate to general web search engines. 2 Previous Work There has been a great deal of work on low-cost evaluation in TREC-type settings ([20, 6, 16, 5] are a few), but we are aware of little for the web. As discussed above, Joachims [10, 12] and Radlinski and Joachims [13] conducted seminal work on using clicks to infer user preferences between documents. Agichtein et al.[2, 1] used and applied models of user interaction to predict preference relationships and to improve ranking functions. They use many features beyond clickthrough rate, and show that they can learn preference relationships using these features. Our work is superficially similar, but we explicitly model dependencies among clicks for results at different ranks with the purpose of learning probabilistic relevance judgments. These relevance judgments are a stronger result than preference ordering, since preference ordering can be derived from them. In addition, given a strong probabilistic model of relevance from clicks, better combined models can be built. Dupret et al. [7] give a theoretical model for the rank-position effects of click-through rate, and build theoretical models for search engine quality using them. They do not evaluate estimates of document quality, while we empirically compare relevance estimated from clicks to manual relevance judgments. Joachims [11] investigated the use of clickthrough rates for evaluation, showing that relative differences in performance could be measured by interleaving results from two ranking functions, then observing which function produced results that are more frequently clicked. As we will show, interleaving results can change user behavior, and not necessarily in a way that will lead to the user clicking more relevant documents. Soboroff [15] proposed methods for maintaining the relevance judgments in a corpus that is constantly changing. Aslam et al. [3] investigated minimum variance unbiased estimators of system performance, and Carterette et al. [5] introduced the idea of treating an evaluation measure as a random variable with a distribution over all possible relevance judgments. This can be used to create an optimal sampling strategy to obtain judgments, and to estimate the confidence in an evaluation measure. We extend their methods to DCG. 2 3 Evaluating Search Engines Search results are typically evaluated using Discounted Cumulative Gain (DCG) [9]. DCG is defined as the sum of the "gain" of presenting a particular document times a "discount" of presenting it i at a particular rank, up to some maximum rank : DC G = =1 g aini discounti . For web search, "gain" is typically a relevance score determined from a human labeling, and "discount" is the reciprocal of the log of the rank, so that putting a document with a high relevance score at a low rank results in a much lower discounted gain than putting the same document at a high rank. i DC G = rel1 + =2 reli log2 i The constants reli are the relevance scores. Human assessors typically judge documents on an ordinal scale, with labels such as "Perfect", "Excellent", "Good", "Fair", and "Bad". These are then mapped to a numeric scale for use in DCG computation. We will denote five levels of relevance aj , with a1 > a2 > a3 > a4 > a5 . In this Section we will show that we can compare ranking functions without having labeled all the documents. 3.1 Estimating DCG from Incomplete Information DCG requires that the ranked documents have been judged with respect to a query. If the index has recently been updated, or a new algorithm is retrieving new results, we have documents that have not been judged. Rather than ask a human assessor for a judgment, we may be able to infer something about DCG based on the judgments we already have. Let Xi be a random variable representing the relevance of document i. Since relevance is ordinal, the distribution of Xi is multinomial. We will define pij = p(Xi = aj ) for 1 j 5 with 5 5 j =1 pij = 1. The expectation of Xi is E [Xi ] = j =1 pij aj , and its variance is V ar [Xi ] = 5 2 2 j =1 pij aj - E [Xi ] . We can then express DC G as a random variable: i DC G Its expectation and variance are: i = X1 + =2 Xi log2 i E [DC G ] = E [X1 ] + =2 i E [Xi ] log2 i V ar[Xi ] +2 (log2 i)2 i (1) 1 C ov (X1 , Xi ) C ov (Xi , Xj ) +2 - E [DC G log2 i log2 i · log2 j aj |q , c) = j + q + p(X aj |q , c) i i i ci + =1