Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen MIT CSAIL Cambridge, MA 02139, USA David R. Karger MIT CSAIL Cambridge, MA 02139, USA harr@csail.mit.edu ABSTRACT Traditionally, information retrieval systems aim to maximize the number of relevant documents returned to a user within some window of the top. For that goal, the probability ranking principle, which ranks documents in decreasing order of probability of relevance, is provably optimal. However, there are many scenarios in which that ranking does not optimize for the user's information need. One example is when the user would be satisfied with some limited number of relevant documents, rather than needing all relevant documents. We show that in such a scenario, an attempt to return many relevant documents can actually reduce the chances of finding any relevant documents. We consider a number of information retrieval metrics from the literature, including the rank of the first relevant result, the %no metric that penalizes a system only for retrieving no relevant results near the top, and the diversity of retrieved results when queries have multiple interpretations. We observe that given a probabilistic model of relevance, it is appropriate to rank so as to directly optimize these metrics in expectation. While doing so may be computationally intractable, we show that a simple greedy optimization algorithm that approximately optimizes the given objectives produces rankings for TREC queries that outperform the standard approach based on the probability ranking principle. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval--Retrieval Models General Terms: Algorithms Keywords: Information Retrieval, Formal Models, Machine Learning, Subtopic Retrieval karger@csail.mit.edu Simply returning as many relevant documents as possible, however, is not the only possible goal. For example, since a single relevant result often provides "the answer" to the user's query, we might be concerned only with whether our system returns any relevant results near the top. This is plausible for question answering, or for finding a homepage. It also captures a notion of "bare minimum success" that can be meaningful for hard queries. The TREC robust track, focusing on such hard queries, defines and uses the %no metric--the fraction of test queries on which a system returned no relevant results in the top ten [20]. As we shall argue below, the probability ranking principle is not optimal in such a case. For if the system's model of relevance is wrong, it will be wrong over and over again, returning an entire list of irrelevant documents--one might say that the expected number of relevant documents is large, but the variance in the outcome is also high. Similarly, it is common wisdom that some queries such as "Trojan horse" can express multiple, distinct information needs--about a type of malware, a Trojan war artifact, or other minor usages. A PRP-based approach may choose one "most likely" interpretation of the query, and provide results that satisfy only that interpretation, leaving users with rarer interpretations unsatisfied. Given that we have stated a clear metric (success in finding at least one relevant document) we argue that under a probabilistic model of document relevance, there is a particularly natural approach to designing a retrieval algorithm for it--namely, to rank documents so as to optimize the expected value of the metric. In particular, we should rank so as to maximize the probability of finding a relevant document among the top n. While exactly optimizing this quantity is NP-hard, we derive a greedy heuristic for approximately optimizing it. Intriguingly, our greedy algorithm can be seen as a kind of blind negative relevance feedback, in which we fill each position in the ranking by assuming that all previous documents in the ranking are not relevant. We demonstrate that our approach is effective in practice. We evaluate the performance of our greedy algorithm on queries from various TREC corpora. We show that it retrieves at least one relevant document more often than the traditional ranking (with statistical significance). We give special attention to the robust track, where one of the goals is to minimize the chance of returning no relevant results, and show that our algorithm does well. In addition to the robust track's %no metric, we consider a number of other standard metrics from the literature. For example, we might be interested in how far down the ranked result list we must go to find the first relevant document. The search length (SL) [4] and reciprocal rank (RR) [15] metrics measure this quantity in different ways. On the other hand, if we believe that the results of a query may have different "subtopics" (facets of the query) or that multiple queriers might have different relevance judgments, we 1. INTRODUCTION It is a common rule of thumb in that the Probability Ranking Principle (PRP) is "optimal." Under reasonable assumptions, one can prove that ranking documents in descending order by their probability of relevance yields the maximum expected number of relevant documents, and thus maximizes the expected values of the well known precision and recall metrics [14]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. 429 might want to ensure that a result set offers good "coverage" of the different possibilities. The instance recall metric [9, 21] measures the number of different subtopics or queriers who are satisfied by a given result set. One can apply our approach, of ranking to optimize the expected value of the metric, to all of these metrics. For each metric the exact optimization problem is different, and in each case it appears intractable. In a fortunate coincidence, however, our greedy algorithm for the %no metric is also a natural greedy algorithm for all of the metrics listed above. We report results for all of these metrics over TREC corpora, and show that our greedy algorithm outperforms the PRP baseline on them. Conversely, our analysis leads us to the possibly surprising conclusion that PRP ranking is not optimal for the heavily used mean average precision metric. We also explore the goal of perfect precision, where the objective is to not retrieve any irrelevant documents. We show how blind positive relevance feedback arises naturally as a greedy heuristic for achieving this goal. To tie together the disparate goals of nonzero precision (at least one relevant document) and perfect precision, we introduce k-call, a class of metrics that ranges smoothly between the two extremes. We argue that it captures a desire to trade "quality" for "diversity," and discuss the application of our approach to these metrics. The broad applicability of using probabilistic models to optimize for specific metrics suggests a general principle, which we call the Expected Metric Principle (EMP). The EMP states that, in a probabilistic context, one should directly optimize for the expected value of the metric of interest. The PRP is a special case of the EMP for the precision and recall metrics. One possible criticism of the EMP is that it "teaches to the test"-- it encourages the algorithm to do well only on the evaluation criterion. This has led to much gaming of the SPECFP benchmarks for numerical computation, for example, as companies incorporated special purpose code for solving only the SPECFP instances. But this gaming is a consequence of the evaluation criterion failing to accurately measure what is wanted out of the system. We argue that the metrics that we study are truly the metrics that matter in certain cases, so that algorithms optimized for them are desirable. turn documents only relevant to the "majority vote" meaning of the query, our approach satisfies that meaning but simultaneously returns results relevant to other, rarer meanings of the query. We follow with more quantitative evidence based on TREC results with multiple raters, where our approach satisfies more raters than PRP, and TREC results with subtopic annotations, where our approach retrieves more subtopics than PRP. 2. RELATED WORK Our discussion of related work splits into three categories: definitions of and motivations for retrieval metrics, algorithms for optimizing those metrics, and approaches to diversity in result sets. 2.1 Beyond Precision and Recall The main metric we examine is essentially the %no metric, which is studied by Voorhees [19]. She finds that the metric was less stable than traditional measures. However, this instability does not affect our ability to probabilistically model and optimize for it. Cooper [4], who introduces the search length metric, argues that trying to retrieve as many documents as possible is not necessarily the appropriate objective for meeting user information need. Cooper explicitly divides an information request into a "relevance description" (i.e., a query) and a quantification that specifies the desired number of relevant results. He defines a class of search length metrics, which measure the number of irrelevant documents a user would have to examine before finding a "sufficient" number of relevant documents. Our paper focuses on the case of "one document sufficiency," though we also touch on the "k document sufficiency" case when we define k-call later in the paper. Shah and Croft [15] explore the problem of high accuracy retrieval, where the objective is to have high precision in the top document ranks. They argue that mean reciprocal rank is a useful metric for this scenario. As previously mentioned, we also demonstrate the applicability of our heuristics to MRR. 2.2 Algorithms Our approach fits within a general risk minimization framework propounded by Zhai and Lafferty [22]. They observed that one could define an arbitrary numeric loss function over possible returned documents rankings, which measures how unhappy the user is with that set. The loss function generally depends on unavailable knowledge about the relevance of particular documents. But given a probabilistic model, one can compute an expected value for the loss function, or expected loss, and return a result that optimizes the expected loss. Much of our paper deals with the loss function that is (say) -1 when the top ten results contain a relevant document (indicating a positive satisfaction) and 0 when it does not. Like us, Gao et al. [6] follow the approach of letting the metric directly drive the retrieval algorithm. However, instead of using a document model from which the optimal algorithm can be determined through analysis, they train a system to weight document features so as to optimize the metric (average precision in their case) and show that such training leads to an algorithm that achieves good performance on the metric with new queries. Bookstein [1] proposes a sequential learning retrieval system that bears some similarity to ours. He argues that a retrieval system should sequentially select documents according to the probability of relevance conditioned on the selection and relevance of previous documents (essentially relevance feedback). However, his procedure requires explicit user feedback after each result retrieved, whereas our system proposes an objective function and then uses a sequential document selection algorithm to heuristically optimize that objective without further user input. 1.1 Retrieving for Diversity Intriguingly, while explicitly aiming only to find one relevant document, we demonstrate the unplanned effect of increasing the diversity of documents at the top. This highlights one way in which seeking one relevant document is different from seeking many. If a query has multiple interpretations (as was the case for "Trojan horse" above), or if there are multiple subtopics, it may be hard to decide which is the proper one. PRP ranking puts all its eggs in one basket--it identifies the most likely interpretation, and finds many results for that one. But an algorithm that needs only one relevant document can do better by retrieving one document for each case, thus satisfying the goal whichever interpretation or subtopic is desired. Recent work [3, 21] has developed heuristics for increasing diversity for this precise purpose, but our approach appears to be the first in which diversity arises automatically as a consequence of the objective function rather than being manually optimized as a proxy for the true objective of interest. As a benefit, there are no new "parameter knobs," beyond those already used in probabilistic document models, that must be tweaked in order to tune our algorithms. We give anecdotal evidence that our approach promotes diversity by looking at ambiguous queries on the Google search engine. We observe that while the probability ranking principle tends to re- 430 Our greedy algorithm for achieving perfect precision seems related to pseudo-relevance feedback, an approach commonly used in the literature to improve overall retrieval performance on standard metrics [2, 5]. Our metric for retrieving at least one relevant document, on the other hand, produces an algorithm that appears to be doing negative pseudo-relevance feedback. In either case, rather than feeding back all of the top documents at once, we progressively feed back more and more top relevant documents in selecting latter-ranked documents. 2.3 Diversity In their subtopic retrieval work, Zhai et al. [21] posit, as we do, that there may be more than one meaningful interpretation of a query. They assume that a query may have different subtopic interpretations, and reorder the results so that the top includes some results from each subtopic. Their system involves separate consideration of novelty and redundancy in a result set, which are then combined via a cost function. Our approach, in contrast, aims directly at the goal of maximizing the chances that the user will get an answer to "their" interpretation of the query. Aiming directly arguably is beneficial in that it reduces the number of system elements, such as novelty and redundancy, whose interactions we have to design and tweak. Conversely, it is possible that by modeling novelty and redundancy richly, the Zhai et al. model can outperform our simpler one. The work of Zhai et al. is in turn based on Carbonell and Goldstein [3]'s maximum marginal relevance (MMR) ranking function. They argue for the value of diversity or "relevant novelty" in the results of a query, and propose MMR as an objective that introduces such diversity in a ranked result set. Our greedy heuristic for optimizing the "one relevant document" objective simplifies to a computation that bears some relation to the MMR computation. However, MMR is advanced as a heuristic algorithm for reducing redundancy and achieving the hard-to-define notion of diversity, which in turn is believed to be related to the desired objective. Our ranking algorithm arises naturally from the application of a simple greedy heuristic to the optimization of a clear, natural, formally defined objective function. In addition, while the iterative greedy approach is implicit in the definition of MMR, our greedy approach is simply one heuristic applied to optimizing our well-defined objective function; we expect that better optimization algorithms such as local search would yield improved values for our objective, which should translate into improved retrieval performance. Our goal of retrieving one relevant document, and its inherent diversifying tendency, bears superficial similarity to clustering, in the sense that clustering is also used as an approach to quickly cover a diverse range of query interpretations [10]. Our technique sidesteps the need for clustering interface machinery, utilizing the standard ranked list of documents instead. Furthermore, we aim to directly optimize the probability that a user finds a relevant document, rather than going through the intermediate notion of separate document clusters. Again, this avoids the need to define and tweak new algorithmic parameters. are retrieved. Put another way, it assigns value to any result set containing at least one relevant document. In line with Cooper's [4] notion of quantification, we generalize the %no metric with a new class of binary metrics under the name k-call at n. Given a ranked list, k-call at n is one if at least k of the top n documents returned by the retrieval system for the given query are deemed relevant. Otherwise, k-call at rank n is zero. In particular, 1-call is one if a relevant document is found and zero otherwise. Averaging over multiple queries yields mean 1-call, which is just one minus the %no metric used in the TREC robust track. On the other hand, n-call at n is a measure of perfect precision: returning only relevant documents. Varying k between n and 1 offers a way to express "risk tolerance": do we wish to aim for many relevant documents and take the chance of finding none, or will we settle for fewer documents if it improves our chances of finding them? When we explicitly have a notion of different subtopics of a query, and of different documents covering different subtopics, we can define instance recall [9] at rank n (also called S-recall [21]) as the number of unique subtopics covered by the first n results, divided by the total number of subtopics. 4. BAYESIAN RETRIEVAL Our work is rooted in standard Bayesian information retrieval techniques [11, 16]. In this approach, we assume that there are two distinct probability distributions that generate the relevant and irrelevant documents respectively. Let d be a document, and r a binary variable indicating the relevance of that document. The probability ranking principle suggests that documents in a corpus should be ranked by Pr[r | d]--that is, the likelihood that a document was generated by the relevant distribution. An application of Bayes' Rule followed by a monotonic transformation gives us a ranking value for documents: Pr[d | r ] . Pr[d | ¬r ] (1) 3. EVALUATION METRICS We consider several metrics in this paper. Search length [4] (cf. section 2) is the rank of the first relevant document in a result list minus one, and reciprocal rank [15] is one over the rank of the first relevant result. With multiple queries we can take the mean of both quantities, yielding the mean search length (MSL) and mean reciprocal rank (MRR) metrics. Introduced for the TREC robust track [20], the %no metric measures the percentage of queries for which no relevant documents Here, Pr[d | r ] and Pr[d | ¬r ] represent respectively the probabilities that the relevant and irrelevant distributions assign to the document. We thus need to compute Pr[d | r ] and Pr[d | ¬r ]. In our paper we emphasize the objective function, rather than the modeling issues associated with Bayesian retrieval. Therefore we use the familiar and simplistic Na¨ve Bayes framework, with multinoi mial models as the family of distributions. A document is thus a set of independent draws from a word distribution over the corpus. A multinomial distribution is described by parameters i , one for each term (word) i in the corpus. A document's probability is the product of each of its term's corresponding i , normalized so the distribution sums to one. In our experiments, we used the heuristic of applying a log-transformation to the term frequencies (that is, substituting log(1 + ti ) for ti ), which has been shown in the literature to improve Na¨ve Bayes performance for text applications [13]. i It remains to determine the parameters i for each distribution. To model the fact that we do not know exactly what terms appear in relevant and irrelevant distributions, we specify a prior probability distribution over the parameters (a distribution over possible document distributions). The prior reflects our initial beliefs (e.g. that a given i parameter is likely to be small). We proceed to revise our beliefs about those parameters by incorporating observed data. We use a standard Dirichlet prior, centered on the background word distribution over the entire corpus. We then take the user query as "training data"--a sample from the relevant document distribution that gives evidence about the parameters of that distribution. This 431 evidence leads us to a new posterior estimate of the probability of parameters of the relevant distribution (in particular, one in which the query term's parameters are likely to be large). Given these two distributions, we are able to measure the probabilities that certain sets of documents are relevant or irrelevant. For our baseline PRP model, this is all the training we do. As we will show later, our new EMP-based algorithms will feed back selected corpus documents into the document distributions as additional "training data." Our use of priors creates a "linkage" between documents. Although each document is assumed to be generated independently from its (relevant or irrelevant) distribution, positing one document to be relevant leads us to believe that the parameters associated with that document's words are stronger in the relevant distribution, which in turn leads us to believe that similar documents are more likely to be relevant. 5. OBJECTIVE FUNCTION In many systems, the evaluation metric (e.g., mean reciprocal rank) is different from the objective function used to rank documents (e.g., probability of relevance). The EMP posits that the right objective function is the (expected value of the) evaluation metric itself. Consider optimizing for the k-call at n metric. Since k-call is always 0 or 1, this is equivalent to maximizing the probability that we find k relevant documents among the first n. Let di denote the ith document of the ranked result set, and ri denote a binary variable indicating that the di is relevant. Result numbering is 0-based, so the first result is d0 . The k = 1 version of our objective function is the probability that at least one of the first n relevance variables be true, that is: Pr [r0 r1 · · · rn-1 | d0 , d1 , . . . dn-1 ] . (2) In general, the objective function for arbitrary k is the probability that at least k documents are relevant, that is: Pr [at least k of r0 , . . . , rn-1 | d0 , d1 , . . . dn-1 ] . (3) Contrast these objectives with the PRP ranking by Pr[r | d]--by defining a objective in line with the metric, we are explicitly aiming for results that the metric rewards. Note that our objectives are indifferent to the ordering of documents within the top n. This is to be expected--because our metric is insensitive to where the relevant results are within the top n, just that there are enough of them, our objective function will be insensitive to the same conditions. The next two sections discuss the optimization and calculation of the objective function in depth. specific weights to the distributions, one can reduce the NP-hard clique problem to optimizing the expected k-call. Since solving our problem would let us solve an NP-hard problem, our problem is NP-hard as well, implying that exactly optimizing our objective is intractable. Therefore we consider a greedy approach that reduces the search space of result sets. A greedy algorithm is an algorithm that always selects a locally optimal intermediary to a solution. They tend to be simple approaches that work well in a variety of contexts. A greedy algorithm for our problem is to successively select each result of the result set. Consider finding the optimal result set for k-call at rank n. We select the first result by applying the conventional probability ranking principle. Each result thereafter is selected in sequence. For the ith result, we hold results 1 through i - 1 to their already selected value, and consider all remaining corpus documents as a possibility for document i. We calculate an expected k-call score for the result set including each such document, and pick the highest scoring document as the ith result. If i < k, we maximize the i-call score instead as a stepping stone towards maximizing k-call. Unlike PRP ranking, optimizing our objective function exactly may require knowing both k and n. That is, the set of 10 documents optimizing the odds of getting a relevant document in the top 10 need not contain the 5 documents optimizing the odds of getting a relevant document in the top 5. Thus, it might not be possibly to simultaneously optimize these two quantities. Our greedy heuristic, on the other hand, is not affected by the value of n that we choose; thus, while it may not be returning the best document subset for any particular n, it may arguably be returning a ranking that is reasonably good for all n. If our goal were to maximize precision and recall, then the natural greedy approach would be to select each successive document to maximize its probability of relevance (independent of the previously selected documents). This is exactly the PRP ranking mechanism. In this sense, the greedy algorithm we have proposed is a generalization of the greedy algorithm that optimizes for PRP. 7. APPLYING THE GREEDY APPROACH In this section we examine how we would use the greedy algorithm described previously to actually optimize our objective function. We focus on the k = 1 and k = n cases, where the greedy algorithm has a particularly simple instantiation. We also touch on the general case for intermediate values of k. k=1 Consider the case where k = 1. The first result is obtained in the conventional fashion--by choosing the document d0 maximizing Pr[r0 | d0 ]. Having chosen the first document d0 , we want to select the second document d1 so as to maximize Pr[r0 r1 | d0 , d1 ]. This is merely an instantiation of equation 2 with n = 2. We can expand this expression by partitioning the event of interest r0 r1 into the disjoint events r0 and r1 ¬r0 : 7.1 6. OPTIMIZATION METHODS Notably, while the PRP objective could be optimized by selecting each document independently, our new objectives (equations 2 and 3) seem to be more complex, requiring us to consider interactions between multiple documents in the result set. It no longer seems possible to judge each document individually. So more complex optimization algorithms are needed. One way to perfectly optimize the k-call at rank n objective function for a corpus of m documents would be to evaluate, for each possible returned sets of n documents, the probability that that set has at least k relevant documents. For any specific set of n docum¡ nts this evaluation is tractable, but the tremendous number of m e distinct subsets make this approach impractical for most rean sonable values of m and n. In general, finding the optimum subset is NP-hard. Space limitations preclude a full proof, but one can show that by assigning Pr[r0 r1 | d0 , d1 ] = Pr[r0 | d0 , d1 ] + Pr[r1 ¬r0 | d0 , d1 ] = Pr[r0 | d0 , d1 ] + Pr[r1 | d0 , d1 , ¬r0 ] · Pr[¬r0 | d0 , d1 ] = Pr[r0 | d0 ] + Pr[r1 | d0 , d1 , ¬r0 ] · Pr[¬r0 | d0 ] where the simplification in the last line follows because r0 is independent of d1 . We wish to choose the document d1 maximizing this quantity. Only one of the three probabilities in the equation depends on d1 , however, so it is sufficient to maximize that term: Pr[r1 | ¬r0 , d0 , d1 ] 432 A similar analysis shows that we can select the third result by maximizing Pr[r2 | ¬r0 , ¬r1 , d0 , d1 , d2 ]. In general, we can select the optimal ith document in the greedy approach by choosing the document di that maximizes: Pr[ri | ¬r0 , . . . , ¬ri-1 , d0 , . . . , di ] This expression tells us that for each new result, we should assume that all the past results were irrelevant, and find the document of greatest relevance conditioned on that assumption. This makes intuitive sense--if a previous document were able to satisfy the user query (i.e., was relevant), then we would not care about what documents were displayed subsequently. Thus we try to select the best new document assuming that all previous documents were irrelevant. This formula also fits nicely with the Bayesian information retrieval model: the assumption that the previous documents are irrelevant is incorporated in a straightforward fashion as an update to the probability distribution associated with the irrelevant documents; the relevance probabilities of new documents are then computed using that updated irrelevant document distribution. 7.2 k=n We can perform a similar greedy-algorithm derivation for the case where k = n. In that case, we find that we should select the ith document according to: Pr[ri | r0 , . . . , ri-1 , d0 , . . . , di ] Again this is intuitive--if we want to maximize the odds of perfect precision then once we select even one irrelevant document we have failed; thus, we must forge ahead on the assumption that all documents ranked so far are relevant. As in the k = 1 case, this leads to a simple update rule for the prior probability distributions. Because these simplified forms for k = 1 and k = n do not involve addition of probabilities, they have the advantage that we can use the ranking value form of Pr[r | d] (equation 1) rather than the full Bayesian expansion, just as in PRP. 7.3 1