On the Consistency of Ranking Algorithms John C. Duchi jduchi@cs.berkeley.edu Lester W. Mackey lmackey@cs.berkeley.edu Computer Science Division, University of California, Berkeley, CA 94720, USA Michael I. Jordan jordan@cs.berkeley.edu Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720, USA Abstract We present a theoretical analysis of supervised ranking, providing necessary and sufficient conditions for the asymptotic consistency of algorithms based on minimizing a surrogate loss function. We show that many commonly used surrogate losses are inconsistent; surprisingly, we show inconsistency even in low-noise settings. We present a new value-regularized linear loss, establish its consistency under reasonable assumptions on noise, and show that it outperforms conventional ranking losses in a collaborative filtering experiment. The goal in ranking is to order a set of inputs in accordance with the preferences of an individual or a population. In this paper we consider a general formulation of the supervised ranking problem in which each training example consists of a query q, a set of inputs x, sometimes called results, and a weighted graph G representing preferences over the results. The learning task is to discover a function that provides a queryspecific ordering of the inputs that best respects the observed preferences. This query-indexed setting is natural for tasks like web search in which a different ranking is needed for each query. Following existing literature, we assume the existence of a scoring function f (x, q) that gives a score to each result in x; the scores are sorted to produce a ranking (Herbrich et al., 2000; Freund et al., 2003). We assume simply that the observed preference graph G is a directed acyclic graph (DAG). Finally, we cast our work in a decisiontheoretic framework in which ranking procedures are evaluated via a loss function L(f (x, q), G). Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). It is important to distinguish between the loss function used for evaluating learning procedures from the losslike functions used to define specific methods (generally via an optimization algorithm). In prior work the former (evaluatory) loss has often been taken to be a pairwise 0-1 loss that sums the number of misordered pairs of results. Recent work has considered losses that penalize errors on more highly ranked instances a aa more strongly. J¨rvelin & Kek¨l¨inen (2002) suggest using discounted cumulative gain, which assumes that each result xi is given a score yi and that the loss is a weighted sum of the yi of the predicted order. Rudin (2009) uses a p-norm to emphasize the highest ranked instances. Here we employ a general graph-based loss L(f (x, q), G) which is equal to zero if f (q, x) obeys the order specified by G--that is, fi (x, q) > fj (x, q) for each edge (i j) G, where fi (x, q) is the score assigned to the ith object in x--and is positive otherwise. We make the assumption that L is edgewise, meaning that L depends only on the relative order of fi (x, q) rather than on its values. Such losses are natural in settings with feedback in the form of ordered preferences, for example when learning from click data. Although we might wish to base a learning algorithm on the direct minimization of the loss L, this is generally infeasible due to the non-convexity and discontinuity of L. In practice one instead employs a surrogate loss that lends itself to more efficient minimization. This issue is of course familiar from the classification literature, where a deep theoretical understanding of the statistical and computational consequences of the choices of various surrogate losses has emerged (Zhang, 2004; Bartlett et al., 2006). There is a relative paucity of such understanding for ranking. In the current paper we aim to fill this gap, taking a step toward bringing the ranking literature into line with that for classification. We provide a general theoretical analysis of the consistency of ranking algorithms that are based on a surrogate loss function. On the Consistency of Ranking Algorithms The paper is organized as follows. In Section 1, we define the consistency problem formally and present a theorem that provides conditions under which consistency is achieved for ranking algorithms. In Section 2 we show that finding consistent surrogate losses is difficult in general, and we establish results showing that many commonly used ranking loss functions are inconsistent, even in low-noise settings. We complement this in Section 3 by presenting losses that are consistent in these low-noise settings. We finish with experiments and conclusions in Sections 4 and 5. while the optimal -risk is R = inf f R (f ). To develop a theory of consistency for ranking methods, we pursue a treatment that parallels that of Zhang (2004) for classification. Using the conditional risk in Eq. (2), we define a function to measure the discriminating ability of the surrogate . Let G(m) denote the set of possible DAGs G over m results, noting that m |G(m)| 3( 2 ) . Let |G(m)| R|G(m)| denote the probability simplex. For , Rm we define Hm () = inf p, 1. Consistency for Surrogate Losses Our task is to minimize the risk of the scoring function f . The risk is the expected loss of f across all queries q, result sets x, and preference DAGs G: R(f ) = EX,Q,G L(f (X, Q), G). (1) pG (, G) - inf GG(m) pG ( , G) GG(m) : (p, ) - inf (p, ) . (3) Given a query q and result set x, we define G to be the set of possible preference DAGs and p to be (a version of) the vector of conditional probabilities of each DAG. That is, p = [pG ]GG = [P(G | x, q)]GG . In what follows, we suppress dependence of p, G, and G on the query q and results x, as they should be clear from context. We assume that the cardinality of any result set x is bounded above by M < . We further define the conditional risk of f given x and q to be (p, f (x, q)) = GG Hm measures surrogate risk suboptimality as a function of true risk suboptimality. A reasonable surrogate loss should declare any setting of {p, } suboptimal that the true loss declares suboptimal, which corresponds to Hm () > 0 whenever > 0. We will see soon that this condition is the key to consistency. Define H() = minmM Hm (). We immediately have H 0, H(0) = 0, and H() is non-decreasing on 0 < , since individual Hm () are non-decreasing in . We have the following lemma (a simple consequence of Jensen's inequality; see Duchi et al., 2010, for proof). Lemma 1. Let be a convex function such that () H(). Then for all f , (R(f ) - R ) R (f ) - R . Corollary 26 from Zhang (2004) then shows as a consequence of Lemma 1 that if H() > 0 for all > 0, there is a nonnegative concave function , right continuous at 0 with (0) = 0, such that R(f ) - R (R (f ) - R ). pG L(f (x, q), G) P(G | x, q)L(f (x, q), G). GG = (2) With this definition, we see the risk of f is equal to EX,Q GG P(G | X, Q)L(f (X, Q), G) = EX,Q (p, f ). (4) We overload notation so that takes the value of f (x, q) in (p, ). The minimal risk, or Bayes' risk, is the minimal risk over all measurable functions, R = inf R(f ) = EX,Q inf (p, ). f Clearly, if limn R (f n ) = R , we have consistency: limn R(f n ) = R . Though it is not our focus, it is possible to use Eq. (4) to get strong rates of convergence if grows slowly. The remainder of this paper concentrates on finding conditions relating the surrogate loss to the risk to make H() > 0 for > 0. It is infeasible to directly minimize the true risk in Eq. (1), as it is non-convex and discontinuous. As is done in classification (Zhang, 2004; Bartlett et al., 2006), we thus consider a bounded-below surrogate to minimize in place of L. For each G, we write (ˇ, G) : R|G| R. The -risk of the function f is R (f ) = EX,Q,G [(f (X, Q), G)] = EX,Q GG We achieve this goal by using conditions based on the edge structure of the observed DAGs. Given a probability vector p R|G| over a set of DAGs G, we recall Eq. (2) and define the set of optimal result scores A(p) to be all attaining the infimum of (p, ), A(p) = { : (p, ) = inf (p, )}. (5) P(G | X, Q)(f (X, Q), G) , The infimum is attained since is edgewise as described earlier, so A(p) is not empty. The following definition captures the intuition that the surrogate loss On the Consistency of Ranking Algorithms should maintain ordering information. For this definition and the remainder of the paper, we use the following shorthand for the conditional -risk: W (p, ) GG 2. The Difficulty of Consistency In this section, we explore the difficulty of finding edgeconsistent ranking losses in practice. We first show that unless P = N P many useful losses cannot be edge-consistent in general. We then show that even in low-noise settings, common losses used for ranking are not edge-consistent. We focus our attention on pairwise losses, which impose a separate penalty for each edge that is ordered incorrectly; this generalizes the disagreement error described by Dekel et al. (2004). We assume we have a set of non-negative penalties aG ij indexed by edge (i j) and graph G so that L(, G) = i j so that the conditional risk (p, ) is not minimal, then the conditional surrogate risk W (p, ) is not minimal. We now provide three lemmas and a theorem that show that if the surrogate loss satisfies edgeconsistency, then its minimizer asymptotically minimizes the Bayes risk. As the lemmas are direct analogs of results in Tewari & Bartlett (2007) and Zhang (2004), we put their proofs in Duchi et al. (2010). Lemma 3. W (p) is continuous on . Lemma 4. Let be edge-consistent. Then W (p, (n) ) W (p) implies that (p, (n) ) inf (p, ) and (n) A(p) eventually. Lemma 5. Let be edge-consistent. For every > 0 there exists a > 0 such that if p , (p, ) - inf (p, ) implies W (p, ) - W (p) . Theorem 6. Let be a continuous, bounded-below loss function and assume that the size of the result sets is upper bounded by a constant M . Then is edgeconsistent if and only the following holds: Whenever f n is a sequence of scoring functions such that R (f n ) R , p aG 1(i j ) + ij i>j aG 1(i j to avoid minor technical issues created by doubly penalizing 1(i =j ) . G If we define aij GG aij pG , then (p, ) = ij aij 1(i 0, then there is some > 0 such that Hm () > 0. Thus H() = minmM Hm () > 0, and Eq. (4) then p immediately implies that R(f n ) R . Now suppose that is not edge-consistent, that is, there is some p so that W (p) = inf {W (p, ) : A(p)}. Let (n) A(p) be a sequence such that W (p, (n) ) W (p). If we simply define the risk to be the expected loss on one particular example x and set f n (x) = (n) , then R (f n ) = W (p, (n) ). Further, by assumption there is some > 0 such that (p, (n) ) inf (p, ) + for all n. Thus R(f n ) = (p, (n) ) R = inf (p, ). h(aG )(i - j ) ij (9) where 0 is a non-increasing function, and h is a function of the penalties aG . In this case, the condiij tional surrogate risk is W (p, ) = i=j hij (i - j ), G where we define hij GG h(aij )pG . On the Consistency of Ranking Algorithms If from Eq. (9) is edge-consistent, then must be differentiable at 0 with (0) < 0. This is a consequence of Bartlett et al.'s (2006) analysis of binary classification and the correspondence between binary classification and pairwise ranking; for the binary case, consistency requires (0) < 0. Similarly, we must have h 0 on R+ and strictly increasing. For the remainder of this section, we make the unrestrictive assumption that decreases more slowly in the positive direction than it increases in the negative. Formally, we use the recession function (Rockafellar, 1970, Thm. 8.5) of , (d) sup t>0 1 2a12 2 2a23 1 a12 2 a23 are that 0 = h12 (1 - 2 ) + h13 (1 - 3 ) - h31 (3 - 1 ) 0 = -h12 (1 - 2 ) + h23 (2 - 3 ). (11) We begin by showing that under Assumption A on , there is a finite minimizer of W (p, ). The lemma is technical and its proof is in Duchi et al. (2010). Lemma 9. There is a constant C < and a vector minimizing W (p, ) with C. We use the following lemma to prove our main theorem about inconsistency of pairwise convex losses. Lemma 10 (Inconsistency of convex losses). Suppose that a13 > a31 > 0, a12 > 0, a23 > 0. Let (p, ) be a12 1(1 2 ) + a13 1(1 3 ) + a23 1(2 3 ) + a31 1(3 <1 ) and W (p, ) be defined as in Eq. (10). For convex with (0) < 0, W (p) = inf {W (p, ) : A(p)} / whenever either of the following conditions is satisfied: Cond 1: h23 < h31 h12 h31 h23 Cond 2: h12 < . h13 + h12 h13 + h23 2a13 3 2 1 2a31 3 (td) - (0) (td) - (0) = lim . t t t a13 - a31 3 The assumption, satisfied for bounded below , is Assumption A. (1) 0 or (-1) = . We now define precisely what we mean by a low-noise setting. For any (G, p), let G be the difference graph, that is, the graph with edge weights max{aij - aji , 0} G on edges (i j), where aij = GG aij pG , and if aij aji then the edge (i j) G (see Fig. 1). We define the following low-noise condition based on self-reinforcement of edges in the difference graph. Definition 8. We say (G, p) is low-noise when the corresponding difference graph G satisfies the following reverse triangle inequality: whenever there is an edge (i j) and an edge (j k) in G, then the weight aik - aki on (i k) is greater than or equal to the path weight aij - aji + ajk - akj on (i j k). It is not difficult to see that if (G, p) satisfies Def. 8, its difference graph G is a DAG. Indeed, the definition ensures that all global preference information in G (the sum of weights along any path) conforms with and reinforces local preference information (the weight on a single edge). Reasonable ranking methods should be consistent in this setting, but this is not trivial. In the lemmas to follow, we consider simple 3-node DAGs that admit unique minimizers for their conditional risks. In particular, we consider DAGs on nodes 1, 2, and 3 that induce only the four penalty values a12 , a13 , a23 , and a31 (see Fig. 1). In this case, if a13 > a31 , any minimizing (p, ) clearly will have 1 > 2 > 3 . We now show under some very general conditions that if is edge-consistent, is non-convex. Let (x) denote an element of the subgradient set (x). The subgradient conditions for optimality of W (p, ) = h12 (1 - 2 ) + h13 (1 - 3 ) (10) Figure 1. The two DAGs above occur with proba1 bility 2 , giving the difference graph G on the left, assuming that a13 > a31 . Proof Lemma 9 shows that the optimal W (p) is attained by some finite . Thus, we fix an satisfy ing Eq. (11), and let ij = i - j and gij = (ij ) for i = j. We make strong use of the monotonicity of subgradients, that is, ij > kl implies gij gkl (e.g. Rockafellar, 1970, Theorem 24.1). By Eq. (11), g13 - g12 g13 - g23 = = h31 h12 g31 - 1 + h13 h13 h23 h31 g31 - 1 + h13 h13 g12 (12a) + h23 (2 - 3 ) + h31 (3 - 1 ) g23 . (12b) On the Consistency of Ranking Algorithms Suppose for the sake of contradiction that A(p). As 13 = 12 + 23 , we have that 13 > 12 and 13 > 23 . The convexity of implies that if 13 > 12 , then g13 g12 . If g12 0, we thus have that g13 0 and by Eq. (11), g31 0. This is a contradiction since 31 < 0 gives g31 (0) < 0. Hence, g12 < 0. By identical reasoning, we also have that g23 < 0. Now, 23 > 0 > 31 implies that g23 g31 , which combined with Eq. (12a) and the fact that g23 = (h12 /h23 )g12 (by Eq. (11)) gives g13 - g12 h31 h12 g12 g23 - 1 + h13 h13 h31 h12 g12 = - h13 - h12 . h23 h13 aG1 , aG1 , aG2 with aG1 > aG1 > 0 and aG1 > aG2 > 0, 31 13 12 13 31 13 12 and let p = (.5, .5). As h is continuous with h(0) = 0, there exists some > 0 such that h() < G 2h31 h12 /(h13 + h12 ), where hij = GG h(aij )pG . G1 G1 G2 Take a23 = min{, (a13 - a12 )/2}. Then we have h23 = h(aG2 )/2 h()/2 < h31 h12 /(h13 + h12 ). 23 Hence Condition 1 of Lemma 10 is satisfied, so is not edge-consistent. Moreover, aG2 (aG1 - aG1 )/2 < aG1 - aG1 implies that 12 13 12 13 23 G is a DAG satisfying the low-noise condition. 2.3. Margin-based inconsistency Given the difficulties encountered in the previous section, it is reasonable to consider a reformulation of our surrogate loss. A natural alternative is a marginbased loss, which encodes a desire to separate ranking scores by a large margins dependent on the preferences in a graph. Similar losses have been proposed, e.g., by Shashua & Levin (2002). In particular, we now consider losses of the form (, G) = (ij)G Since g12 /h13 < 0, we have that g13 -g12 < 0 whenever h31 h12 /h23 > h13 + h12 . But when 13 > 12 , we must have g13 g12 , which yields a contradiction under Condition 1. Similarly, 12 > 0 > 31 implies that g12 g31 , which with g12 = (h23 /h12 )g23 and Eq. (12b) gives g13 - g23 h31 h23 g23 g12 - 1 + h13 h13 g23 h31 h23 - h13 - h23 . = h12 h13 i - j - h(aG ) , ij (13) Since g23 /h13 < 0, we further have that g13 - g23 < 0 whenever h31 h23 /h12 > h13 + h23 . This contradicts 13 > 23 under Condition 2. Lemma 10 allows us to construct scenarios under which arbitrary pairwise surrogate losses with convex are inconsistent. Assumption A only to specify an optimal with < , and can be weakened to W (p, ) as (i - j ) . The next theorem is our main negative result on the consistency of pairwise surrogate losses. Theorem 11. Let be a loss that can be written as (, G) = (ij)G where h is continuous and h(0) = 0. It is clear from the reduction to binary classification that h must be increasing for the loss in Eq. (13) to be edge-consistent. When is a decreasing function, this intuitively says that the larger aij is, the larger i should be when compared to j . Nonetheless, as we show below, such a loss is inconsistent even in low-noise settings. Theorem 12. Let be a loss that can be written as (, G) = (ij)G (i - j - h(aG )) ij for h continuous and increasing with h(0) = 0. Even in the low-noise setting, for convex and satisfying Assumption A, is not edge-consistent. Proof Assume for the sake of contradiction that is edge-consistent. As noted before, (0) < 0, and since is differentiable almost everywhere (Rockafellar, 1970, Theorem 25.3), is differentiable at -c for some c > 0 in the range of h. Considering the four-graph setting with graphs containing one edge each, G1 = ({1, 2, 3}, {(1 2)}), G2 = ({1, 2, 3}, {(2 3)}), G3 = ({1, 2, 3}, {(1 3)}), and G4 = ({1, 2, 3}, {(3 1)}), choose constant edge weights aG1 = aG2 = aG3 = aG4 = h-1 (c) > 0, and 31 23 13 12 set p = (.25, .01, .5, .24). In this setting, ~ ~ W (p, ) = pG1 (1 - 2 ) + pG2 (2 - 3 ) ~ ~ + pG3 (1 - 3 ) + pG4 (3 - 1 ), h(aG )(i - j ) ij for h continuous and increasing with h(0) = 0. Even in the low-noise setting, for convex and satisfying Assumption A, is not edge-consistent. Proof Assume for the sake of contradiction that is edge-consistent. Recall that for convex, (0) < 0, and we can construct graphs G1 and G2 so that the resulting expected loss satisfies Condition 1 of Lemma 10. Let G = {G1 , G2 } where G1 = ({1, 2, 3}, {(1 2) , (1 3)}) and G2 = ({1, 2, 3}, {(2 3) , (3 1)}). Fix any weights On the Consistency of Ranking Algorithms ~ ~ for (x) = (x - c). Notably, is convex, satisfies ~ (0) = (-c) < 0. Moreover, Assumption A, and a13 - a31 = h-1 (c)(pG3 - pG4 ) h-1 (c)(pG1 + pG2 ) = a12 + a23 > 0, so G is a DAG satisfying the low-noise pG p condition. However, pG2 < pG 4 G1 . Hence, by +pG1 3 Lemma 10, W (p) = inf {W (p, ) : A(p)}, a / contradiction. Since r is strictly convex, r is a strictly increasing setvalued map with increasing inverse s(g) = { : g r()}. Optimality is therefore attained uniquely at i = s j aij - aji . (15) Note that for any i, k, i > k if and only if s j aij -aji >s j j akj -ajk j , which in turn occurs akj -ajk . 3. Conditions for Consistency The prospects for consistent surrogate ranking appear bleak given the results of the previous section. Nevertheless, we demonstrate in this section that there exist surrogate losses that yield consistency under some restrictions on problem noise. We consider a new loss-- specifically, a linear loss in which we penalize (j - i ) proportional to the weight aij in the given graph G. To keep the loss well-behaved and disallow wild fluctuations, we also regularize the values. That is, our loss takes the form (, G) = (ij)G if and only if > Hence, the optimal of Eq. (15) is in A(p) if and only if j akj - ajk > when aik > aki . (16) Thus, W (p) = inf {W (p, ) : A(p)} whenever / Eq. (16) is violated. On the other hand, suppose Eq. (16) is satisfied. Then for all satisfying j aij -aji aij - aji - < {i,k:aik >aki } min 1 ( - k ), 2 i aG (j - i ) + ij i r(i ). (14) We assume that r is strictly convex and 1-coercive, that is, that r asymptotically grows faster than any linear function. These conditions imply that the loss of Eq. (14) is bounded below. Moreover, we have the basis for consistency: Theorem 13. Let the loss take the form of a generalized disagreement error of Eq. (7) and the surrogate loss take the form of Eq. (14) where > 0 and r is strictly convex and 1-coercive. If the pair (G, p) induces a difference graph G that is a DAG, then W (p) < inf {W (p, ) : A(p)} / we have A(p), and inf {W (p, ) : A(p)} > / W (p) since is the unique global minimum. We now prove a simple lemma showing that low-noise settings satisfy the conditions of Theorem 13. Lemma 14. If (G, p) is low noise, then for the associated difference graph G, whenever aik > aki , aij - aji > j j akj - ajk . aij - aji > j j akj - ajk for i, k s.t. aik > aki . Proof We first note that G is a DAG if and only if Proof Fix (i, k) with aik > aki . There are two cases for a third node j: either aij - aji > 0 or aij - aji 0. In the first case, there is an edge (i j) G. If (k j) G, the low-noise condition implies aij - aji akj - ajk + aik - aki > akj - ajk . Otherwise, akj - ajk 0 < aij - aji . In the other case, aij - aji 0. If the inequality is strict, then (j i) G, so the low-noise condition implies that aji - aij < aji - aij + aik -aki ajk -akj , or akj -ajk < aij -aji . Otherwise, aij = aji , and the low-noise condition guarantees that (j k) G, so akj - ajk 0 = aij - aji . / The inequality in the statement of the lemma is strict, because aik - aki > 0 = akk - akk . The converse of the lemma is, in general, false. Combining the above lemma with Theorem 13, we have Corollary 15. The linear loss of Eq. (14) is consistent if (G, p) is low noise for all query-result pairs. With the above corollary, we have a consistent loss: the value-regularized linear loss is edge (and hence asymptotically) consistent in low-noise settings. It is not difficult to see that the value regularization from r is necessary; if r is not in the objective in Eq. (14), then (ˇ, G) can be sent to - with A(p). A(p) = { : i > j for i < j with aij > aji , i j for i > j with aij > aji }. (See Lemma 16 of Duchi et al. (2010), though essentially all we do is write out (p, ).) We have that W (p, ) = i i j (aji - aij ) + r(i ) . Standard subgradient calculus gives that at optimum, r (i ) = j aij - aji . On the Consistency of Ranking Algorithms 3.1. Relationship to prior work One of the main results on consistency to date is due to Cossock & Zhang (2008), who work in a setting in which each item xi to be ranked is associated with a score yi . In this setting we can show that the resulting graphs (G, p) satisfy our low-noise condition. Indeed, for every observed pair of results and scores (xi , yi ), (xj , yj ) , we can construct a graph G = ({xi , xj }, {(i j)}) and set aG = yi - yj . Then ij in the limit, we have aij = yi - yj , where yi is the true ¯ ¯ ¯ score of item xi , and clearly aik = aij + ajk so that G satisfies the low-noise condition. Another related line of work is due to Xia et al. (2008), who introduce a notion of order-preserving probability spaces. These are inherently different from our work, which we show by considering graphs on nodes 1, 2, and 3. First, consider a low-noise setting in which the difference graph G consists of edges (1 2) and (1 3). Our losses are indifferent to whether we order result 2 ahead of or behind 3, and this cannot be captured by an order-preserving probability space. Conversely, consider an order-preserving probability space over the three nodes, where the data we receive consists of full orderings of 1, 2, 3. To translate this into our framework, we must convert each of these orderings into a DAG G with associated edge weights. We assume that the weight on each edge is only a function of the distance between the entries in the ordering. Suppose we observe two orderings = {1 2 3} and = {2 3 1} with probabilities p > p 0 and p + p = 1, which is an order preserving probability space (see Def. 3 and Theorem 5 in Xia et al., 2008). If we assume w.l.o.g. that any adjacent pair in the list has edge weight equal to one and that pairs of distance equal to two in the list have edge weight w2 , then there is no way to set w2 so that the resulting (G, p) satisfies the low-noise condition in Def. 8. The associated difference graph G will have edge weights a12 = p - w2 p , a13 = w2 p - p , and a23 = p + p . To satisfy the low-noise condition in Def. 8 and have the ordering minimize the true loss, we must have w2 p - p = a13 a12 + a23 = p - w2 p + p + p so that w2 p + w2 p 2p + 2p . That is, w2 2. On the other hand, we must have a12 > 0 so that p > w2 p or w2 < p /p ; taking p .5 and p .5, we have w2 1. Thus no construction that assigns a fixed weight to edges associated with permutations can transform an order-preserving probability space into graphs satisfying the low-noise conditions here. Nonetheless, our general consistency result, Theorem 6, implicitly handles order-preserving probability spaces, which assume that graphs G contain all results and the loss L(f (x, q), G) = 0 if f agrees with G on all orderings and is 1 otherwise. 4. Experiments While the focus of this work is a theoretical investigation of consistency, we have also conducted experiments that study the value-regularized linear loss our analysis suggests. We perform experiments on a collaborative filtering task in which the goal is to recommend movies to a user based on the user's and other users' movie ratings. We use one of the MovieLens datasets (GroupLens Lab, 2008), which consists of 100,000 ratings, on a scale of 1 to 5, for 1682 different movies by 943 users. In this case, our "query" is a user u, and the set of possible results consists of all 1682 movies. We learn a linear model so that fi (x, u) = wT d(xi , u), where d is a mapping from movie xi and user u to a feature vector. We use features that have proven successful in settings such as the Netflix challenge, including the age of the movie, its genre(s), the average rating of the user for other movies in the same genre(s), the average rating of the movie, and ratings given to the movie by users similar to and dissimilar from u in rating of other movies. To create pairs to train our models, we randomly sample pairs (xi , xj ) of movies rated by a user. Each sampled pair of rated movies then gets a per-user weight au that we set to be the difference in their ratings. ij As discussed in Sec. 3.1, this guarantees that (G, p) is low noise. We sample across users to get n samples total. We then learn the weight vector w using one of three methods: the value-regularized linear method in this paper, a pairwise hinge loss (Herbrich et al., 2000), and a pairwise logistic loss (Dekel et al., 2004). Specifically, the surrogates are au wT (d(xj , u) - d(xi , u)) + ij i,j,u i,u (wT d(xi , u))2 + au ij i,j,u 1 - w (d(xi , u) - d(xj , u)) T T au log 1 + ew ij i,j,u (d(xj ,u)-d(xi ,u)) , where [z]+ = max{z, 0}. We set = 10-4 (it needed simply be a small number), and also added 2 regularization in the form of w 2 to each problem. We cross-validated separately for each loss. We partitioned the data into five subsets, and, in each of 15 experiments, we used one subset for validation, one for testing, and three for training. In every experiment, we subsampled 40,000 rated movie pairs from the test set for final evaluation. Once we had learned On the Consistency of Ranking Algorithms Table 1. Test losses for different surrogate losses. Train pairs 20000 40000 80000 120000 160000 Hinge .478 (.008) .477 (.008) .480 (.007) .477 (.008) .474 (.007) Logistic .479 (.010) .478 (.010) .478 (.009) .477 (.009) .474 (.007) Linear .465 (.006) .464 (.006) .462 (.005) .463 (.006) .461 (.004) Ben-Tal, A. and Nemirovski, A. Lectures on Modern Convex Optimization. SIAM, 2001. Cossock, D. and Zhang, T. Statistical analysis of Bayes optimal subset ranking. IEEE Transaction on Information Theory, 16:1274­1286, 2008. Dekel, O., Manning, C., and Singer, Y. Log-linear models for label ranking. In Advances in Neural Information Processing Systems 16, 2004. Duchi, J., Mackey, L., and Jordan, M. On the consistency of ranking algorithms. Technical Report EECS-2010-56, UC Berkeley, 2010. Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. Efficient boosting algorithms for combining preferences. Journal of Machine Learning Research, 4: 933­969, 2003. GroupLens Lab. MovieLens dataset, 2008. URL http://www.grouplens.org/taxonomy/term/14. Herbrich, R., Graepel, T., and Obermayer, K. Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classifiers. MIT Press, 2000. J¨rvelin, K. and Kek¨l¨inen, J. Cumulated gain-based a aa evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422­446, 2002. Karp, R. M. Reducibility among combinatorial problems. In Complexity of Computer Computations, pp. 85­103. Plenum Press, 1972. Rockafellar, R.T. Convex Analysis. Princeton University Press, 1970. Rudin, C. The p-norm push: a simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10: 2233­2271, 2009. Shashua, A. and Levin, A. Ranking with large margin principle: Two approaches. In Advances in Neural Information Processing Systems 15, 2002. Tewari, A. and Bartlett, P. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007­10025, 2007. Xia, F., Liu, T. Y., Wang, J., Zhang, W., and Li, H. Listwise approach to learning to rank ­ theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning, 2008. Zhang, T. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5:1225­1251, 2004. a vector w for each of the three methods, we computed its average generalized pairwise loss (Eq. (7)). We show the results in Table 1. The leftmost column contains the number of pairs that were subsampled for training, and the remaining columns show the average pairwise loss on the test set for each of the methods (with standard error in parentheses). Each number is the mean of 15 independent training runs, and bold denotes the lowest loss. It is interesting to note that the linear loss always achieves the lowest test loss averaged across all tests. In fact, it achieved the lowest test loss of all three methods in all but one of our experimental runs. (We use these three losses to focus exclusively on learning in a pairwise setting-- Cossock & Zhang (2008) learn using relevance scores, while Xia et al. (2008) require full ordered lists of results as training data rather than pairs.) Finally, we note that there is a closed form for the minimizer of the linear loss, which makes it computationally attractive. 5. Discussion In this paper we have presented results on both the difficulty and the feasibility of surrogate loss consistency for ranking. We presented the negative result that many natural candidates for surrogate ranking are not consistent in general or even under low-noise restrictions, and we have given a class of surrogate losses that achieve consistency under reasonable noise restrictions. We have also demonstrated the potential usefulness of the new loss functions in practice. This work thus takes a step toward bringing the consistency literature for ranking in line with that for classification. A natural next step in this agenda is to establish rates for ranking algorithms; we believe that our analysis can be extended to the analysis of rates. Acknowledgments JD and LM were supported by NDSEG fellowships. We thank the reviewers for their helpful feedback. References Bartlett, P., Jordan, M., and McAuliffe, J. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138­156, 2006.