An Efficient Reduction of Ranking to Classification Nir Ailon Google Research 76 Ninth Ave, 4th Floor New York, NY 10011 nailon@google.com Mehryar Mohri Courant Institute and Google Research 251 Mercer Street New York, NY 10012 mohri@cims.nyu.edu Abstract This paper describes an efficient reduction of the learning problem of ranking to binary classification. The reduction is randomized and guarantees a pairwise misranking regret bounded by that of the binary classifier, improving on a recent result of Balcan et al. (2007) which ensures only twice that upper-bound. Moreover, our reduction applies to a broader class of ranking loss functions, admits a simple proof, and the expected time complexity of our algorithm in terms of number of calls to a classifier or preference function is also improved from (n2 ) to O(n log n). In addition, when the top k ranked elements only are required (k n), as in many applications in information extraction or search engine design, the time complexity of our algorithm can be further reduced to O(k log k + n). Our reduction and algorithm are thus practical for realistic applications where the number of points to rank exceeds several thousands. Much of our results also extend beyond the bipartite case previously studied. To further complement them, we also derive lower bounds for any deterministic reduction of ranking to binary classification, proving that randomization is necessary to achieve our reduction guarantees. 1 Introduction The learning problem of ranking arises in many modern applications, including the design of search engines, information extraction, and movie recommendation systems. In these applications, the ordering of the documents or movies returned is a critical aspect of the system. The problem has been formulated within two distinct settings. In the score-based setting, the learning algorithm receives a labeled sample of pairwise preferences and returns a scoring function f : U R which induces a linear ordering of the points in the set U . Test points are simply ranked according to the values of f for those points. Several ranking algorithms, including RankBoost (Freund et al., 2003; Rudin et al., 2005), SVM-type ranking (Joachims, 2002), and other algorithms such as PRank (Crammer & Singer, 2001; Agarwal & Niyogi, 2005), were designed for this setting. Gener- alization bounds have been given in this setting for the pairwise misranking error (Freund et al., 2003; Agarwal et al., 2005), including margin-based bounds (Rudin et al., 2005). Stability-based generalization bounds have also been given in this setting for wide classes of ranking algorithms both in the case of bipartite ranking (Agarwal & Niyogi, 2005) and the general case (Cortes et al. 2007b; 2007a). A somewhat different two-stage scenario was considered in other publications starting with (Cohen et al., 1999), and later (Balcan et al., 2007), which we will refer to as the preference-based setting. In the first stage of that setting, a preference function h : U × U [0, 1] is learned, where values of h(u, v ) closer to one indicate that u is ranked above v and values closer to zero the opposite. h is typically assumed to be the output of a classification algorithm trained on a sample of labeled pairs, and can be for example a convex combination of simpler preference functions as in (Cohen et al., 1999). A crucial difference with the score-based setting is that, in general, the preference function h may not induce a linear ordering. The relation it induces may be non-transitive, thus we may have for example h(u, v ) = h(v , w) = h(w, u) = 1 for three distinct points u, v , and w. To rank a test subset V U , in the second stage, the algorithm orders the points in V by making use of the preference function h learned in the first stage. The subset ranking setup examined by Cossock and Zhang (2006), though distinct, also bears some resemblance with this setting. This paper deals with the preference-based ranking setting just described. The advantage of this setting is that the learning algorithm is not required to return a linear ordering of all points in U , which may be impossible to achieve faultlessly in accordance with a general possibly non-transitive pairwise preference labeling. This is more likely to be achievable exactly or with a better approximation when the algorithm is requested instead, to supply a linear ordering, only for limited subsets V U . When the preference function is obtained as the output of a binary classification algorithm, the preference-based setting can be viewed as a reduction of ranking to classification. The second stage specifies how the ranking is obtained using the preference function. Cohen et al. (1999) showed that in the second stage of the preference-based setting, the general problem of finding a linear ordering with as few pairwise misrankings as possible with respect to the preference function h is NP-complete. The authors presented a greedy algorithm based on the tour- nament degree, that is, for a given element u, the difference between the number of elements it is preferred to versus the number of those preferred to u. The bound proven by the authors, formulated in terms of the pairwise disagreement loss l with respect to the preference function h, can be written as l(greedy , h) 1/2 + l(optimal , h)/2, where l(greedy , h) is the loss achieved by the permutation greedy returned by their algorithm and l(optimal , h) the one achieved by the optimal permutation optimal with respect to the preference function h. This bound was given for the general case of ranking, but, in the particular case of bipartite ranking, a random ordering can achieve a pairwise disagreement loss of 1/2 and thus the bound is not informative. Note that the algorithm can be viewed as a derandomization technique. More recently, Balcan et al. (2007) studied the bipartite ranking problem. In this particular case, the loss of an output ranking is measured by counting pairs of ranked elements, one of which is positive and the other negative (based on some ground truth). They showed that sorting the elements of V according to the same tournament degree used by Cohen et al. (1999) guarantees a regret of at most 2r using a binary classifier with regret r. (The regret is defined as a calibration of the loss function that aligns a theoretical optimum with 0.) However, due to the quadratic nature of the definition of the tournament degree, their algorithm requires (n2 ) calls to the preference function h, where n = |V | is the number of objects to rank. We describe an efficient randomized algorithm for the second stage of preference-based setting and thus for reducing the learning problem of ranking to binary classification. We improve on the recent result of Balcan et al. (2007), by guaranteeing a pairwise misranking regret of at most r using a binary classifier with regret r, thereby improving the bound by a factor of 2. Our reduction applies, with different constants, to a broader class of ranking loss functions, admits a simple proof, and the expected running time complexity of our algorithm in terms of number of calls to a classifier or preference function is improved from (n2 ) to O(n log n). Furthermore, when the top k ranked elements only are required (k n), as in many applications in information extraction or search engines, the time complexity of our algorithm can be further reduced to O(k log k + n). Our reduction and algorithm are thus practical for realistic applications where the number of points to rank exceeds several thousands. The price paid for this improvement is in resorting to nondeterminism. Indeed, our algorithms are randomized, but this turns our to be necessary. We give a simple proof of a lower bound of 2r for any deterministic reduction of ranking to binary classification with classification regret r, thereby generalizing to all deterministic reductions a lower bound result of Balcan et al. (2007). To appreciate our improvement of the reduction bound from a factor of 2 to 1, consider the case of a binary classifier with an error rate of just 25%, which is quite reasonable in many applications. Assume that the Bayes error is close to zero for the classification problem and similarly that for the ranking problem that the regret and loss approximately coincide. Then, the bound of Balcan et al. (2007) guarantees for the ranking algorithm a pairwise misranking error of at most 50%. But, since a random ranking can achieve 50% pairwise misranking error, the bound turns out not to be informative in that case. Instead, with a factor of 1, the bound ensures a pairwise misranking of at most 25%. Much of our results also extend beyond the bipartite case previously studied by Balcan et al. (2007) to the general case of ranking. A by-product of our proofs is a bound on the pairwise disagreement loss with respect to the preference function h that we will compare to the result given by Cohen et al. (1999). The algorithm used by Balcan et al. (2007) to produce a ranking based on the preference function is known as sortby-degree and has been recently used in the context of minimizing the feedback arcset in tournaments (Coppersmith et al., 2006). Here, we use a different algorithm, QuickSort, which has also been recently used for minimizing the feedback arcset in tournaments (Ailon et al. 2005; 2007). The techniques presented build upon earlier work by Ailon et al.(2005; 2007) on combinatorial optimization problems over rankings and clustering. The remainder of the paper is structured as follows. In Section 2, we introduce the definitions and notation used in future sections and introduce a general family of loss functions for ranking. Section 3 describes a simple and efficient algorithm for reducing ranking to binary classification, proves several bounds guaranteeing the quality of the ranking produced by the algorithm, and analyzes the runningtime complexity of our algorithm. In Section 4, we derive a lower bound for any deterministic reduction of ranking to binary classification. In Section 5, we discuss the relationship of the algorithm and its proof with previous related work in combinatorial optimization, and discuss key assumptions related to the notion of regret in this context. 2 Preliminaries This section introduces several preliminary definitions necessary for the presentation of our results. In what follows, U will denote a universe of elements, e.g., the collection of all possible query-result pairs returned by a web search task, and V U will denote a small subset thereof, e.g., a preliminary list of relevant results for a given query. For simplicity of notation we will assume that U is a set of integers, so that we are always able to choose a minimal canonical element in a finite subset, as we do in (9) below. This arbitrary ordering should not be confused with the ranking problem we are considering. 2.1 General Definitions and Notation We first briefly discuss the learning setting and assumptions made here and compare them with those of Balcan et al. (2007) and Cohen et al. (1999). In what follows, V U represents a finite subset extracted from some arbitrary universe U , which is the set we wish to rank at each round. The notation S (V ) denotes the set of rankings on V , that is the set of injections from V to [n] = {1, . . . , n}, where n = |V |. If S (V ) is such a ranking, then (u) is the rank of an element u V , where lower ranks are interpreted as preferable ones. More precisely, we say that u is preferred over v with respect to if (u) < (v ). For convenience, and abusing notation, we also write (u, v ) = 1 if (u) < (v ) and (u, v ) = 0 othVd erwise. We let k enote the collection of all subsets of size exactly k of V . To distinguish between functions taking ordered vs. unordered arguments in what follows, we will use the notation Fu1 u2 ...uk toVdea ote k unordered arguments for n a function F defined on k nd F (u1 , u2 , . . . , uk ) to denote k ordered arguments for a function F defined on V × · · · × V . k 2.2 Ground truth As in standard learning scenarios, at each round, there is an underlying unknown ground truth which we wish the output of the learning algorithm to agree with as much as possible. The ground truth is a ranking that we denote by S (V ), equipped with a function assigning different importance weight to pairs of positions. The combination ( , ) is extremely expressive, as we shall see below in Section 2.5. It can encode in particular the standard average pairwise misranking or AUC loss assumed by Balcan et al. (2007) in a bipartite setting, but also more sophisticated ones capturing misrankings among the top k , and other losses that are close ´ but distinct from those considered by Clemencon and Vayatis ¸ (2007). 2.3 Preference function As with both (Cohen et al., 1999) and (Balcan et al., 2007), we assume that a preference function h : U × U [0, 1] is learned in a first learning stage. The convention is that the higher h(u, v ) is, the more our belief that u should be preferred to v . The function h satisfies pairwise consistency: h(u, v ) + h(v , u) = 1, but need not even be transitive on 3tuples (cycles may be induced). The second stage uses h to output a proper ranking , as we shall further discuss below. The running time complexity of the second stage is measured with respect to the number of calls to h. 2.4 Output of Learning Algorithm The final output of the second stage of the algorithm, , is a proper ranking of V . Its cost is measured differently in (Balcan et al., 2007) and (Cohen et al., 1999). In the former, it is measured against the unknown ground truth and compared to the cost of h against the ground truth. The rationale is that the information encoded in h contains all pairwise preference information using the state-of-the-art binary classification. In (Cohen et al., 1999), is measured against the given preference function h, and compared to the theoretically best one can obtain. Thus, there h plays the role of a known ground truth. 2.5 Loss Functions We are now ready to define the loss functions used to measure the quality of an output ranking either with respect to , as in (Balcan et al., 2007), or with respect to h, as in (Cohen et al., 1999). The following general loss function L measures the quality of a ranking with respect to a desired one using a weight function (described below): n -1u L (, ) = (u, v ) (v , u) ( (u), (v )). 2 =v The sum is over all pairs u, v in the domain V of the rankings , . It counts the number of inverted pairs u, v V weighed by , which assigns importance coefficients to pairs, based on their positions in the ground truth . The function must satisfy the following three natural axioms, which will be necessary in our analysis: (P1) Symmetry: (i, j ) = (j, i) for all i, j ; (P2) Monotonicity: (i, j ) (i, k ) if either i < j < k or i > j > k; (P3) Triangle inequality: (i, j ) (i, k ) + (k , j ). This definition is very general and encompasses many useful, well studied distance functions. Setting (i, j ) = 1 for all i = j yields the unweighted pairwise misranking measure or the so-called Kemeny distance function. For a fixed integer k , the following function 1 if ((i k ) (j k )) (i = j ) (i, j ) = (1) 0 otherwise, can be used to emphasize ranking at the top k elements. Misranking of pairs with one element ranked among the top k is penalized by this function. This can be of interest in applications such as information extraction or search engines where the ranking of the top documents matters more. For this emphasis function, all elements ranked below k are in a tie. In fact, it is possible to encode any tie relation using . Bipartite Ranking. In a bipartite ranking scenario, V is partitioned into a positive and negative set V + and V - of sizes m+ and m- respectively, where m+ + m- = |V | = n. For this scenario (Balcan et al., 2007; Hanley & McNeil, 1982; Lehmann, 1975), we are often interested in the AUC score of S (V ) defined as follows: 1 - AUC(V + , V - , ) = X 1 1 + - (v , u). m- m+ u,vV (u,v)V ×V This expression measures the probability given a random crucial pair of elements, one of which is positive and the other negative, that the pair is misordered in . It is immediate to verify that this is equal to L (, ), where is any ranking placing V + ahead of V - , and + + mn 1 (i m ) (j > m ) 2 (i, j ) = - + 1 (j m+ ) (i > m+ ) (2) m 0 otherwise. Simplified notation. To avoid carrying and , we will define for convenience (u, v ) = (u, v ) ( (u), (v )) and L(, ) := L (, ) = n -1 u 2 =v (u, v ) (v , u) . We will formally call a generalized ranking, and it will take the role of the ground truth. If is obtained as in (2) for some integers m+ , m- satisfying m+ + m- = n then we will say that the corresponding is bipartite. It is immediate to verify from the properties of the weight function that for all u, v , w V , (u, v ) (u, w) + (w, v ) . If is bipartite, then additionally, (u, v ) + (v , w) + (w, u) = (v , u) + (w, v ) + (u, w) . (4) 2.6 Preference Loss Function We need to extend the definition to measure the loss of a preference function h with respect to . In contrast with the loss function just defined, we need to define a preference loss measuring a generalized ranking's disagreements with respect to a preference function h when measured against . We can readily extend the loss definitions defined above as follows: u L(h, ) = L (h, ) = h(u, v ) (v , u) . =v (3) ~ where the minimum is over h, a preference function over U , and ·|V is a restriction operator on preference functions defined in the natural way. The weak ranking and classification regret functions Rrank and Rclass are defined as follows: Rrank (A, D) = EV , ,s [L(As (V ), )] - EV min E |V [L( , )] (5) ~ S (V ) ~ Rclass (h, D) = EV , [L(h|V , )] ~ - EV min E |V [L(h, )] , (6) ~ h where |V is the random variable conditioned on fixed V . The difference between R and R for both ranking and classification is that in their definition the min operator and the EV operator are permuted. The following inequalities follow from the concavity of min and Jensen's inequality: Rrank (A, D) Rrank (A, D) and Rclass (A, D) Rclass (A, D). For a fixed V and any u, v V , let e(u, v ) = E |V [ (u, v )] . c As explained above, L(h, ) is the ideal loss the learning algorithm will aim to achieve with the output ranking hypothesis . 2.7 Input Distribution The set V we wish to rank together with the ground truth are drawn as pair from a distribution we denote by D. In other words, may be a random function of V . For our analysis of the loss though, it is convenient to think of V and as fixed, because our bounds will be conditioned on fixed V , and will easily generalize to the stochastic setting. Finally, we say that D is bipartite if is bipartite with probability 1. 2.8 Regret Functions The notion of regret is commonly used to measure the difference between the loss incurred by a learning algorithm and that of some best alternative. This section introduces the definitions of regret that we will be using to quantify the quality of a ranking algorithm in this context. We will define a notion of weak and strong regret for both ranking and classification losses as follows. To define a strong ranking regret, we subtract from the loss function the minimal loss that could have been obtained from a global ranking of U . More precisely, we define: ~ Rrank (A, D) = EV , ,s [L(As (V ), )] - min EV , [L(|V , )] , ~ S (U ) ~ (7) (8) The reason we work with R lass is because the preference ~ function h over U obtaining the min in the definition of Rclass can be determined locally for any u, v U by e(u, v ) > e(v , u) 1 ~ (9) h(u, v ) = 0 e(v , u) > e(u, v ) 1u>v otherwise . Also, equation (3) holds true with e replacing , and similarly for (4) if D is bipartite (by linearity of expectation). We cannot do a similar thing when working with the strong regret function Rclass . The reason we work with weak ranking regret is for compatibility with our choice of weak classification regret, although our upper bounds on Rrank trivially apply to Rrank in virtue of (7). In Section 5.4, we will discuss certain assumptions under which our results work for the notion of strong regret as well. Note that Balcan et al. (2007) also implicitly use such an assumption in deriving their regret bounds. Our regret bounds (second part of Theorem 2) hold under the same assumption. Our result is thus exactly comparable with theirs. where |V S (V ) is defined by restricting the ranking ~ ~ S (U ) to V in a natural way, and A is a possibly randomized algorithm using a stream of random bits s (and a pre-learned preference function h) to output a ranking As (V ) in S (V ). As for the strong preference loss, it is natural to subtract the minimal loss over all, possibly cyclic, preference functions on U . More precisely, we define: ~ Rclass (h, D) = EV , [L(h|V , )] - min EV , [L(h|V , )] , ~ h 3 Algorithm for Ranking Using a Preference Function This section describes and analyzes an algorithm for obtaining a global ranking of a subset using a prelearned preference function h, which corresponds to the second stage of the preference-based setting. Our bound on the loss will be derived using conditional expectation on the preference loss assuming a fixed subset V U , and fixed ground truth . To further simplify the analysis, we assume that h is binary, that is h(u, v ) {0, 1} for all u, v U . 3.1 Description One simple idea to obtain a global ranking of the points in V consists of using a standard comparison-based sorting algorithm where the comparison operation is based on the preference function. However, since in general the preference function is not transitive, the property of the resulting permutation obtained is unclear. This section shows however that the permutation generated by the standard QuickSort algorithm provides excellent guarantees.1 Thus, the algorithm we suggest is the following. Pick a random pivot element u uniformly at random from V . For each v = u, place v on the left2 of u if h(v , u) = 1, and to its right otherwise. Proceed recursively with the array to the left of u and the one to its right and return the concatenation of the permutation returned by the left recursion, u, and the permutation returned by the right recursion. We will denote by Qh (V ) the permutation resulting in s running QuickSort on V using preference function h, where s is the random stream of bits used by QuickSort for the selection of the pivots. As we shall see in the next two sections, this algorithm produces high-quality global rankings in a time-efficient manner. 3.2 Ranking Quality Guarantees The following theorems bound the ranking quality of the algorithm described, for both loss and regret, in the general and bipartite cases. Theorem 1 (Loss bounds in general case) For any fixed subset V U , preference function h on V , and generalized ranking on V , the following bound holds: E[L(Qh (V ), )] 2L(h, ) . s s 3.3 Analysis of QuickSort Assume V is fixed, and let Qs = Qh (V ) be the (random) s ranking output by QuickSort on V using the preference function h. During the execution of QuickSort, the order between two elements u, v V is determined in one of two ways: · Directly: u (or v ) was selected as the pivot with v (resp. u) present in the same sub-array in a recursive call to QuickSort. We denote by puv = pvu the probability of that event. In that case, the algorithm orders u and v according to the preference function h. · Indirectly: a third element w V is selected as pivot with w, u, v all present in the same sub-array in a recursive call to QuickSort, u is assigned to the left sub-array and v to the right (or vice-versa). Let puvw denote the probability of the event that u, v , and w are present in the same array in a recursive call to QuickSort and that one of them is selected as pivot. Note that conditioned on that event, each of these three elements is equally likely to be selected as a pivot since the pivot selection is based on a uniform distribution. If (say) w is selected among the three, then u will be placed on the left of v if h(u, w) = h(w, v ) = 1, and to its right if h(v , w) = h(w, u) = 1. In all other cases, the order between u, v will be determined only in a deeper nested call to QuickSort. Let X, Y : V × V R be V ny o functions on ordered a tw pairs u, v V , and let Z : 2 R be a function on unV R, ordered pairs. We define three functions [X, Y ] : 2 V V [X ] : 3 R and [Z ] : 3 R as follows: [X, Y ]uv = X (u, v )Y (v , u) + X (v , u)Y (u, v ), [X ]uvw = 1 (h(u, v )h(v , w)X (w, u) + h(w, v )h(v , u)X (u, w))+ 3 1 (h(v , u)h(u, w)X (w, v ) + h(w, u)h(u, v )X (v , w))+ 3 1 (h(u, w)h(w, v )X (v , u) + h(v , w)h(w, u)X (u, v )), 3 [Z ]uvw = 1 (h(u, v )h(v , w) + h(w, v )h(v , u))Zuw + 3 1 (h(v , u)h(u, w) + h(w, u)h(u, v ))Zvw + 3 1 (h(u, w)h(w, v ) + h(v , w)h(w, u))Zuv . 3 Lemma 3 (QuickSort Decomposition) V 1. For any Z : 2 R, u u u Zuv = puv Zuv + puvw [Z ]uvw . e(v , u) 1 ~ h(u, v ) = 0 (23) e(u, v ) < e(v , u) 1u>v otherwise (equality) . Using Lemma 3 and linearity, the LHS of (22) can be rewritten as: n -1 u puv [h - , e]uv ~ 2 0. Then, by conditioning on GL , GR , taking expectations on both sides of (40) and by induction, E[tk (G) | GL , GR ] n - 1 + cnL + c m in{k , nL } log min{k , nL } + cnR 1nL n - k }, occurs when a vertex of out-degree at least n - k 7n/8 is chosen as pivot. For a random pivot v V , where V is the vertex set of G, E[outdeg(v )2 ] n2 /3 + n/2 n2 /2.9. Indeed, each pair of edges (v , u1 ) A and (v , u2 ) A for u1 = u2 gives rise to a triangle which is counted exactly twnce in the cross-terms, hence n2 /3 which upper-bounds i/ 2 3 n; n/2 bounds the diagonal. Thus, Pr[outdeg(v ) 7n/8] = Pr[outdeg(v )2 49n2 /64] 0.46 (by Markov). Plugging in this value into our last estimate yields E[tk (G)] n - 1 + c(n - 1)/2 + c k log k + 0.46 × cn, which is at most cn + c k log k for c 30, as required. 5 Discussion 5.1 History of QuickSort The textbook algorithm, by now standard, was originally discovered by Hoare (1961). Montague and Aslam (Montague & Aslam, 2002) experimented with QuickSort for information retrieval (IR) by aggregating rankings from different sources of retrieval. They claimed an O(n log n) time bound on the number of comparisons, although the proof seemed to rely on the folklore QuickSort proof without addressing the non-transitivity problem. They proved certain combinatorial bounds on the output of QuickSort and provided an empirical justification of its IR merits. Ailon et al. (2005) also considered the rank aggregation problem and proved theoretical cost bounds for many ranking problems on weighted tournaments. They strengthened these bounds by considering non-deterministic pivoting rules arising from solutions to certain ranking LP's. This work was later extended by Ailon (2007) to deal with rankings with ties, in particular, top-k rankings. Hedge et al.(2007) and Williamson and van Zuylen (2007) derandomized the random pivot selection step in QuickSort for many of the combinatorial optimization problems studied by Ailon et al. 5.2 The decomposition technique 4 Lower Bounds Let r denote the classification regret. Balcan et al. (2007) proved a lower bound of 2r for the regret of the algorithm MFAT defined as the solution to the minimum feedback arcset problem on the tournament V with an edge (u, v ) when h(u, v ) = 1. More precisely, they showed an example of fixed V , h, and bipartite generalized ranking on V , such that the classification regret of h tends to 1/2 of the ranking regret of MFAT on V , h. Note that in this case, since is a fixed function of V , the regret and loss coincide both for classification and for ranking. Here we give a simple proof of a more general theorem stating that same bound holds for any deterministic algorithm, including of course MFAT. Theorem 7 For any deterministic algorithm A taking as input V U and a preference function h on V and outputting a ranking S (V ), there exists a bipartite distribution D on (V , ) such that Rrank (A, D) 2 Rclass (h, D). (43) Note that the theorem implies that, in the bipartite case, no deterministic algorithm converting a preference function into a linear ranking can do better than a randomized algorithm, on expectation. Thus, randomization is essentially necessary in this setting. The proof is based on an adversarial argument. In our construction, the support of D is reduced to a single pair (V , ) (deterministic input), thus the loss and both the weak and strong regrets coincide and a similar argument applies to the loss function and the weak regret functions. Proof: Fix V = {u, v , w}, and let the support of D be reduced to (V , ), where the bipartite generalized ranking is one that we will select adversarially. Assume a cycle: h(u, v ) = h(v , w) = h(w, u) = 1. Up to symmetry, there are two options for the output of A on V , h. 1. (u) < (v ) < (w): in this case, the adversary can choose corresponding to the partition V + = {w} and V - = {u, v }. Clearly, Rclass (h, D) now equals 1/2 since h is penalized only for misranking the pair (v , w), but Rrank (A, D) = 1 since is misordering both (u, w) and (v , w). 2. (w) < (v ) < (u): in this case, the adversary can choose corresponding to the partition V + = {u} and V - = {v , w}. Similarly, Rclass (h, D) now equals 1/2 since h is penalized only for misranking the pair (u, w), while Rrank (A, D) = 1 since is misordering both (u, v ) and (u, w). The technique developed in Lemma 3 is very general and can be used for a wide variety of loss functions and variants of QuickSort involving non-deterministic ordering rules (Ailon et al. 2005; 2007). Such results would typically amount to bounding [X ]uvw / [Z ]uvw for some carefully chosen functions X, Z depending on the application. 5.3 Combinatorial Optimization vs. Learning of Ranking QuickSort, sometimes referred to as FAS-Pivot in that context, was used by Ailon et al. (2005; 2007) to approximate certain NP-Hard weighted instances of the problem of minimum feedback arcset in tournaments (Alon, 2006). There is much similarity between the techniques used in that work and those of the analyses of this work, but there is also a significant difference that should be noted. In the minimum feedback arc-set problem, we are given a tournament G and wish to find an acyclic tournament H on the same vertex set minimizing (G, H ), where counts the number of edges pointing in opposite directions between G, H (or a weighted version thereof). However, the cost we are considering is (G, H ) for some fixed acyclic tournament H induced by some permutation (the ground truth). In this work, we showed in fact that if G is obtained from G using QuickSort, then E[(G , H )] 2(G, H ) for any (Theorem 1). If H is the optimal solution to the (weighted) minimum feedback arc-set problem corresponding to G, then it is easy to see that (H, H ) (G, H ) + (G, H ) 2(G, H ). However, recovering G is NP-Hard in general. Approximating (G, H ) modulo a constant factor 1 + using an acyclic tournament H , as in the combinatorial optimization world, only guarantees a constant factor of 2 + : (H ,H ) (G, H ) + (G, H ) (1 + )(G, H ) + (G, H ) (2 + )(G, H ) . Thus, this work also adds a significant contribution to (Ailon et al., 2005; Ailon, 2007; Kenyon-Mathieu & Schudy, 2007). 5.4 Weak vs. Strong Regret Functions For the proof of the regret bound of Theorem 2 we used the ~ fact that the minimizer h in the definition (5-6) of Rclass could be determined independently for each pair u, v U , using (9). This could also be done for strong regrets if the distribution D on V , satisfied the following pairwise IIA condition. Definition 8 A distribution D on subsets V U and generalized rankings on V satisfies the pairwise independence on irrelevant alternatives (pairwise IIA) if for all u, v U and for any two subsets V1 , V2 {u, v }, E |V1 [ (u, v )] = E |V2 [ (u, v )] . Note: We chose the terminology IIA to match that used in Arrow's seminal work (Arrow, 1950) to describe a similar notion. When pairwise IIA holds, the average ground truth relation between u and v , conditioned on u, v included in V , is independent of V . Recall that a bipartite is derived from a pair , , where is defined using a term 1/m- m+ , for compatibility with the definition of AUC. The numbers m+ and m- depend on the underlying size of the positive and negative sets partitioning of V and therefore cannot be inferred from (u, v ) alone. Thus, in the standard bipartite case, the pairwise IIA assumption is not natural. If, however, we replaced our definitions in the bipartite case and used the following: + + 1 (i m ) (j > m ) (44) (i, j ) = 1 (j m+ ) (i > m+ ) 0 otherwise, instead of (2), then it would be reasonable to believe that pairwise IIA does hold in the bipartite case. In fact, it would be reasonable to make the stronger assumption that for any fixed u, v U and V1 , V2 {u, v } the distribution of the random variables (u, v )|V1 and (u, v )|V2 are equal. This corresponds to the intuition that when comparing a pair u, v in a context of a set V containing them, human labelers are not as influenced by the irrelevant information V \{u, v } as they would be by V \{u} if asked to evaluate single elements u. The irrelevant information in V is often referred to as anchor in experimental psychology and economics (Ariely et al., 2008). Our regret bounds would still hold if we used (44), but we chose (2) to present our results in terms of the familiar average pairwise misranking error or AUC loss. Another possible assumption allowing usage of strong regrets is to let the preference function learned in the first stage depend on V . This is the assumption implicitly made by Balcan et al. (2007) (based on our private communication). We do not further elaborate on this assumption. it practical for large-scale information extraction and search engine applications. A finer analysis of QuickSort is likely to further improve our reduction bound by providing a concentration inequality for the algorithm's deviation from its expected behavior using the confidence scores output by the classifier. Our reduction leads to a competitive ranking algorithm that can be viewed as an alternative to the algorithms previously designed for the score-based setting. 7 Acknowledgments We thank Alina Beygelzimer and John Langford for helpful discussions. Mehryar Mohri's work was partially funded by the New York State Office of Science Technology and Academic Research (NYSTAR). References Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization bounds for the area under the roc curve. Journal of Machine Learning Research, 6, 393­425. Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. COLT (pp. 32­47). Ailon, N. (2007). Aggregation of partial rankings, p-ratings and top-m lists. SODA. Ailon, N., Charikar, M., & Newman, A. (2005). Aggregating inconsistent information: ranking and clustering. Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005 (pp. 684­693). ACM. Alon, N. (2006). Ranking tournaments. SIAM J. Discrete Math., 20, 137­142. Ariely, D., Loewenstein, G., & Prelec, D. (2008). Coherent arbitrariness: Stable demand curves without stable preferences. The Quarterly Journal of Economics, 118, 73­105. Arrow, K. J. (1950). A difficulty in the concept of social welfare. Journal of Political Economy, 58, 328­346. Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., & Sorkin, G. B. (2007). Robust reductions from ranking to classification. COLT (pp. 604­619). Springer. ´ Clemencon, S., & Vayatis, N. (2007). Ranking the best in¸ stances. Journal of Machine Learning Research, 8, 2671­ 2699. Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learning to order things. J. Artif. Intell. Res. (JAIR), 10, 243­270. Coppersmith, D., Fleischer, L., & Rudra, A. (2006). Ordering by weighted number of wins gives a good ranking for weighted tournamnets. Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 6 Conclusion We described a reduction of the learning problem of ranking to classification. The efficiency of this reduction makes Cortes, C., Mohri, M., & Rastogi, A. (2007a). An Alternative Ranking Problem for Search Engines. Proceedings of the 6th Workshop on Experimental Algorithms (WEA 2007) (pp. 1­21). Rome, Italy: Springer-Verlag, Heidelberg, Germany. Cortes, C., Mohri, M., & Rastogi, A. (2007b). MagnitudePreserving Ranking Algorithms. Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML 2007). Oregon State University, Corvallis, OR. Cossock, D., & Zhang, T. (2006). Subset ranking using regression. COLT (pp. 605­619). Crammer, K., & Singer, Y. (2001). Pranking with ranking. Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada] (pp. 641­647). MIT Press. Freund, Y., Iyer, R. D., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933­ 969. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology. Hedge, R., Jain, K., Williamson, D. P., & van Zuylen, A. (2007). "deterministic pivoting algorithms for constrained ranking and clustering problems". Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). Hoare, C. (1961). Quicksort: Algorithm 64. Comm. ACM, 4, 321­322. Joachims, T. (2002). Optimizing search engines using clickthrough data. KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 133­142). New York, NY, USA: ACM Press. Kenyon-Mathieu, C., & Schudy, W. (2007). How to rank with few errors. STOC '07: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (pp. 95­ 103). New York, NY, USA: ACM Press. Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco, California: Holden-Day. Montague, M. H., & Aslam, J. A. (2002). Condorcet fusion for improved retrieval. Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, November 4-9, 2002 (pp. 538­548). ACM. Rudin, C., Cortes, C., Mohri, M., & Schapire, R. E. (2005). Margin-based ranking meets boosting in the middle. Learning Theory, 18th Annual Conference on Learning Theory, COLT 2005, Bertinoro, Italy, June 27-30, 2005, Proceedings (pp. 63­78). Springer. Williamson, D. P., & van Zuylen, A. (2007). "deterministic algorithms for rank aggregation and other ranking and clustering problems". Proceedings of the 5th Workshop on Approximation and Online Algorithms (WAOA) (to appear).