An Efficient Reduction of Ranking to Classification

Nir Ailon Google Research 76 Ninth Ave, 4th Floor New York, NY 10011
nailon@google.com

Mehryar Mohri Courant Institute and Google Research 251 Mercer Street New York, NY 10012
mohri@cims.nyu.edu

Abstract
This paper describes an efficient reduction of the learning problem of ranking to binary classification. The reduction is randomized and guarantees a pairwise misranking regret bounded by that of the binary classifier, improving on a recent result of Balcan et al. (2007) which ensures only twice that upper-bound. Moreover, our reduction applies to a broader class of ranking loss functions, admits a simple proof, and the expected time complexity of our algorithm in terms of number of calls to a classifier or preference function is also improved from (n2 ) to O(n log n). In addition, when the top k ranked elements only are required (k n), as in many applications in information extraction or search engine design, the time complexity of our algorithm can be further reduced to O(k log k + n). Our reduction and algorithm are thus practical for realistic applications where the number of points to rank exceeds several thousands. Much of our results also extend beyond the bipartite case previously studied. To further complement them, we also derive lower bounds for any deterministic reduction of ranking to binary classification, proving that randomization is necessary to achieve our reduction guarantees.

1 Introduction
The learning problem of ranking arises in many modern applications, including the design of search engines, information extraction, and movie recommendation systems. In these applications, the ordering of the documents or movies returned is a critical aspect of the system. The problem has been formulated within two distinct settings. In the score-based setting, the learning algorithm receives a labeled sample of pairwise preferences and returns a scoring function f : U  R which induces a linear ordering of the points in the set U . Test points are simply ranked according to the values of f for those points. Several ranking algorithms, including RankBoost (Freund et al., 2003; Rudin et al., 2005), SVM-type ranking (Joachims, 2002), and other algorithms such as PRank (Crammer & Singer, 2001; Agarwal & Niyogi, 2005), were designed for this setting. Gener-

alization bounds have been given in this setting for the pairwise misranking error (Freund et al., 2003; Agarwal et al., 2005), including margin-based bounds (Rudin et al., 2005). Stability-based generalization bounds have also been given in this setting for wide classes of ranking algorithms both in the case of bipartite ranking (Agarwal & Niyogi, 2005) and the general case (Cortes et al. 2007b; 2007a). A somewhat different two-stage scenario was considered in other publications starting with (Cohen et al., 1999), and later (Balcan et al., 2007), which we will refer to as the preference-based setting. In the first stage of that setting, a preference function h : U × U  [0, 1] is learned, where values of h(u, v ) closer to one indicate that u is ranked above v and values closer to zero the opposite. h is typically assumed to be the output of a classification algorithm trained on a sample of labeled pairs, and can be for example a convex combination of simpler preference functions as in (Cohen et al., 1999). A crucial difference with the score-based setting is that, in general, the preference function h may not induce a linear ordering. The relation it induces may be non-transitive, thus we may have for example h(u, v ) = h(v , w) = h(w, u) = 1 for three distinct points u, v , and w. To rank a test subset V  U , in the second stage, the algorithm orders the points in V by making use of the preference function h learned in the first stage. The subset ranking setup examined by Cossock and Zhang (2006), though distinct, also bears some resemblance with this setting. This paper deals with the preference-based ranking setting just described. The advantage of this setting is that the learning algorithm is not required to return a linear ordering of all points in U , which may be impossible to achieve faultlessly in accordance with a general possibly non-transitive pairwise preference labeling. This is more likely to be achievable exactly or with a better approximation when the algorithm is requested instead, to supply a linear ordering, only for limited subsets V  U . When the preference function is obtained as the output of a binary classification algorithm, the preference-based setting can be viewed as a reduction of ranking to classification. The second stage specifies how the ranking is obtained using the preference function. Cohen et al. (1999) showed that in the second stage of the preference-based setting, the general problem of finding a linear ordering with as few pairwise misrankings as possible with respect to the preference function h is NP-complete. The authors presented a greedy algorithm based on the tour-


nament degree, that is, for a given element u, the difference between the number of elements it is preferred to versus the number of those preferred to u. The bound proven by the authors, formulated in terms of the pairwise disagreement loss l with respect to the preference function h, can be written as l(greedy , h)  1/2 + l(optimal , h)/2, where l(greedy , h) is the loss achieved by the permutation greedy returned by their algorithm and l(optimal , h) the one achieved by the optimal permutation optimal with respect to the preference function h. This bound was given for the general case of ranking, but, in the particular case of bipartite ranking, a random ordering can achieve a pairwise disagreement loss of 1/2 and thus the bound is not informative. Note that the algorithm can be viewed as a derandomization technique. More recently, Balcan et al. (2007) studied the bipartite ranking problem. In this particular case, the loss of an output ranking is measured by counting pairs of ranked elements, one of which is positive and the other negative (based on some ground truth). They showed that sorting the elements of V according to the same tournament degree used by Cohen et al. (1999) guarantees a regret of at most 2r using a binary classifier with regret r. (The regret is defined as a calibration of the loss function that aligns a theoretical optimum with 0.) However, due to the quadratic nature of the definition of the tournament degree, their algorithm requires (n2 ) calls to the preference function h, where n = |V | is the number of objects to rank. We describe an efficient randomized algorithm for the second stage of preference-based setting and thus for reducing the learning problem of ranking to binary classification. We improve on the recent result of Balcan et al. (2007), by guaranteeing a pairwise misranking regret of at most r using a binary classifier with regret r, thereby improving the bound by a factor of 2. Our reduction applies, with different constants, to a broader class of ranking loss functions, admits a simple proof, and the expected running time complexity of our algorithm in terms of number of calls to a classifier or preference function is improved from (n2 ) to O(n log n). Furthermore, when the top k ranked elements only are required (k n), as in many applications in information extraction or search engines, the time complexity of our algorithm can be further reduced to O(k log k + n). Our reduction and algorithm are thus practical for realistic applications where the number of points to rank exceeds several thousands. The price paid for this improvement is in resorting to nondeterminism. Indeed, our algorithms are randomized, but this turns our to be necessary. We give a simple proof of a lower bound of 2r for any deterministic reduction of ranking to binary classification with classification regret r, thereby generalizing to all deterministic reductions a lower bound result of Balcan et al. (2007). To appreciate our improvement of the reduction bound from a factor of 2 to 1, consider the case of a binary classifier with an error rate of just 25%, which is quite reasonable in many applications. Assume that the Bayes error is close to zero for the classification problem and similarly that for the ranking problem that the regret and loss approximately coincide. Then, the bound of Balcan et al. (2007) guarantees for the ranking algorithm a pairwise misranking error of at most 50%. But, since a random ranking can achieve 50% pairwise

misranking error, the bound turns out not to be informative in that case. Instead, with a factor of 1, the bound ensures a pairwise misranking of at most 25%. Much of our results also extend beyond the bipartite case previously studied by Balcan et al. (2007) to the general case of ranking. A by-product of our proofs is a bound on the pairwise disagreement loss with respect to the preference function h that we will compare to the result given by Cohen et al. (1999). The algorithm used by Balcan et al. (2007) to produce a ranking based on the preference function is known as sortby-degree and has been recently used in the context of minimizing the feedback arcset in tournaments (Coppersmith et al., 2006). Here, we use a different algorithm, QuickSort, which has also been recently used for minimizing the feedback arcset in tournaments (Ailon et al. 2005; 2007). The techniques presented build upon earlier work by Ailon et al.(2005; 2007) on combinatorial optimization problems over rankings and clustering. The remainder of the paper is structured as follows. In Section 2, we introduce the definitions and notation used in future sections and introduce a general family of loss functions for ranking. Section 3 describes a simple and efficient algorithm for reducing ranking to binary classification, proves several bounds guaranteeing the quality of the ranking produced by the algorithm, and analyzes the runningtime complexity of our algorithm. In Section 4, we derive a lower bound for any deterministic reduction of ranking to binary classification. In Section 5, we discuss the relationship of the algorithm and its proof with previous related work in combinatorial optimization, and discuss key assumptions related to the notion of regret in this context.

2 Preliminaries
This section introduces several preliminary definitions necessary for the presentation of our results. In what follows, U will denote a universe of elements, e.g., the collection of all possible query-result pairs returned by a web search task, and V  U will denote a small subset thereof, e.g., a preliminary list of relevant results for a given query. For simplicity of notation we will assume that U is a set of integers, so that we are always able to choose a minimal canonical element in a finite subset, as we do in (9) below. This arbitrary ordering should not be confused with the ranking problem we are considering. 2.1 General Definitions and Notation

We first briefly discuss the learning setting and assumptions made here and compare them with those of Balcan et al. (2007) and Cohen et al. (1999). In what follows, V  U represents a finite subset extracted from some arbitrary universe U , which is the set we wish to rank at each round. The notation S (V ) denotes the set of rankings on V , that is the set of injections from V to [n] = {1, . . . , n}, where n = |V |. If   S (V ) is such a ranking, then  (u) is the rank of an element u  V , where lower ranks are interpreted as preferable ones. More precisely, we say that u is preferred over v with respect to  if  (u) <  (v ). For convenience, and abusing notation, we


also write  (u, v ) = 1 if  (u) <  (v ) and  (u, v ) = 0 othVd erwise. We let k enote the collection of all subsets of size exactly k of V . To distinguish between functions taking ordered vs. unordered arguments in what follows, we will use the notation Fu1 u2 ...uk toVdea ote k unordered arguments for n a function F defined on k nd F (u1 , u2 , . . . , uk ) to denote k ordered arguments for a function F defined on V × · · · × V . k 2.2 Ground truth As in standard learning scenarios, at each round, there is an underlying unknown ground truth which we wish the output of the learning algorithm to agree with as much as possible. The ground truth is a ranking that we denote by    S (V ), equipped with a function  assigning different importance weight to pairs of positions. The combination (  ,  ) is extremely expressive, as we shall see below in Section 2.5. It can encode in particular the standard average pairwise misranking or AUC loss assumed by Balcan et al. (2007) in a bipartite setting, but also more sophisticated ones capturing misrankings among the top k , and other losses that are close ´ but distinct from those considered by Clemencon and Vayatis ¸ (2007). 2.3 Preference function As with both (Cohen et al., 1999) and (Balcan et al., 2007), we assume that a preference function h : U × U  [0, 1] is learned in a first learning stage. The convention is that the higher h(u, v ) is, the more our belief that u should be preferred to v . The function h satisfies pairwise consistency: h(u, v ) + h(v , u) = 1, but need not even be transitive on 3tuples (cycles may be induced). The second stage uses h to output a proper ranking  , as we shall further discuss below. The running time complexity of the second stage is measured with respect to the number of calls to h. 2.4 Output of Learning Algorithm The final output of the second stage of the algorithm,  , is a proper ranking of V . Its cost is measured differently in (Balcan et al., 2007) and (Cohen et al., 1999). In the former, it is measured against the unknown ground truth and compared to the cost of h against the ground truth. The rationale is that the information encoded in h contains all pairwise preference information using the state-of-the-art binary classification. In (Cohen et al., 1999),  is measured against the given preference function h, and compared to the theoretically best one can obtain. Thus, there h plays the role of a known ground truth. 2.5 Loss Functions We are now ready to define the loss functions used to measure the quality of an output ranking  either with respect to   , as in (Balcan et al., 2007), or with respect to h, as in (Cohen et al., 1999). The following general loss function L measures the quality of a ranking  with respect to a desired one   using a weight function  (described below): n -1u L (,   ) =  (u, v )  (v , u) (  (u),   (v )). 2
=v

The sum is over all pairs u, v in the domain V of the rankings ,   . It counts the number of inverted pairs u, v  V weighed by  , which assigns importance coefficients to pairs, based on their positions in the ground truth   . The function  must satisfy the following three natural axioms, which will be necessary in our analysis:
(P1) Symmetry:  (i, j ) =  (j, i) for all i, j ; (P2) Monotonicity:  (i, j )   (i, k ) if either i < j < k or

i > j > k;
(P3) Triangle inequality:  (i, j )   (i, k ) +  (k , j ).

This definition is very general and encompasses many useful, well studied distance functions. Setting  (i, j ) = 1 for all i = j yields the unweighted pairwise misranking measure or the so-called Kemeny distance function. For a fixed integer k , the following function 1 if ((i  k )  (j  k ))  (i = j )  (i, j ) = (1) 0 otherwise, can be used to emphasize ranking at the top k elements. Misranking of pairs with one element ranked among the top k is penalized by this function. This can be of interest in applications such as information extraction or search engines where the ranking of the top documents matters more. For this emphasis function, all elements ranked below k are in a tie. In fact, it is possible to encode any tie relation using  . Bipartite Ranking. In a bipartite ranking scenario, V is partitioned into a positive and negative set V + and V - of sizes m+ and m- respectively, where m+ + m- = |V | = n. For this scenario (Balcan et al., 2007; Hanley & McNeil, 1982; Lehmann, 1975), we are often interested in the AUC score of   S (V ) defined as follows:
1 - AUC(V + , V - ,  ) = X 1 1 + -  (v , u). m- m+ u,vV (u,v)V ×V

This expression measures the probability given a random crucial pair of elements, one of which is positive and the other negative, that the pair is misordered in  . It is immediate to verify that this is equal to L (,   ), where   is any ranking placing V + ahead of V - , and  + + mn 1 (i  m )  (j > m ) 2  (i, j ) = - + 1 (j  m+ )  (i > m+ ) (2) m 0 otherwise. Simplified notation. To avoid carrying   and  , we will define for convenience   (u, v ) =   (u, v ) (  (u),   (v )) and L(,   ) := L (,   ) =


n -1 u 2
=v

 (u, v )  (v , u) .

We will formally call  a generalized ranking, and it will take the role of the ground truth. If  is obtained as in (2) for some integers m+ , m- satisfying m+ + m- = n then we will say that the corresponding   is bipartite.


It is immediate to verify from the properties of the weight function  that for all u, v , w  V ,   (u, v )    (u, w) +   (w, v ) . If  is bipartite, then additionally,   (u, v ) +   (v , w) +   (w, u) =   (v , u) +   (w, v ) +   (u, w) . (4) 2.6 Preference Loss Function We need to extend the definition to measure the loss of a preference function h with respect to   . In contrast with the loss function just defined, we need to define a preference loss measuring a generalized ranking's disagreements with respect to a preference function h when measured against   . We can readily extend the loss definitions defined above as follows: u L(h,   ) = L (h,   ) = h(u, v )  (v , u) .
=v 

(3)

~ where the minimum is over h, a preference function over U , and ·|V is a restriction operator on preference functions defined in the natural way. The weak ranking and classification regret functions Rrank and Rclass are defined as follows: Rrank (A, D) = EV ,  ,s [L(As (V ),   )] - EV min E  |V [L( ,   )] (5) ~
 S (V ) ~

Rclass (h, D) = EV ,  [L(h|V ,   )] ~ - EV min E  |V [L(h,   )] , (6)
~ h

where   |V is the random variable   conditioned on fixed V . The difference between R and R for both ranking and classification is that in their definition the min operator and the EV operator are permuted. The following inequalities follow from the concavity of min and Jensen's inequality: Rrank (A, D)  Rrank (A, D) and Rclass (A, D)  Rclass (A, D). For a fixed V and any u, v  V , let e(u, v ) = E  |V [  (u, v )] .
c

As explained above, L(h,   ) is the ideal loss the learning algorithm will aim to achieve with the output ranking hypothesis  . 2.7 Input Distribution The set V we wish to rank together with the ground truth   are drawn as pair from a distribution we denote by D. In other words,   may be a random function of V . For our analysis of the loss though, it is convenient to think of V and   as fixed, because our bounds will be conditioned on fixed V ,   and will easily generalize to the stochastic setting. Finally, we say that D is bipartite if   is bipartite with probability 1. 2.8 Regret Functions The notion of regret is commonly used to measure the difference between the loss incurred by a learning algorithm and that of some best alternative. This section introduces the definitions of regret that we will be using to quantify the quality of a ranking algorithm in this context. We will define a notion of weak and strong regret for both ranking and classification losses as follows. To define a strong ranking regret, we subtract from the loss function the minimal loss that could have been obtained from a global ranking  of U . More precisely, we define: ~ Rrank (A, D) = EV ,  ,s [L(As (V ),   )] - min EV ,  [L(|V ,   )] , ~
 S (U ) ~

(7)

(8)

The reason we work with R lass is because the preference ~ function h over U obtaining the min in the definition of Rclass can be determined locally for any u, v  U by  e(u, v ) > e(v , u) 1 ~ (9) h(u, v ) = 0 e(v , u) > e(u, v )  1u>v otherwise . Also, equation (3) holds true with e replacing   , and similarly for (4) if D is bipartite (by linearity of expectation). We cannot do a similar thing when working with the strong regret function Rclass . The reason we work with weak ranking regret is for compatibility with our choice of weak classification regret, although our upper bounds on Rrank trivially apply to Rrank in virtue of (7). In Section 5.4, we will discuss certain assumptions under which our results work for the notion of strong regret as well. Note that Balcan et al. (2007) also implicitly use such an assumption in deriving their regret bounds. Our regret bounds (second part of Theorem 2) hold under the same assumption. Our result is thus exactly comparable with theirs.

where |V  S (V ) is defined by restricting the ranking   ~ ~ S (U ) to V in a natural way, and A is a possibly randomized algorithm using a stream of random bits s (and a pre-learned preference function h) to output a ranking As (V ) in S (V ). As for the strong preference loss, it is natural to subtract the minimal loss over all, possibly cyclic, preference functions on U . More precisely, we define: ~ Rclass (h, D) = EV ,  [L(h|V ,   )] - min EV ,  [L(h|V ,   )] ,
~ h

3 Algorithm for Ranking Using a Preference Function
This section describes and analyzes an algorithm for obtaining a global ranking of a subset using a prelearned preference function h, which corresponds to the second stage of the preference-based setting. Our bound on the loss will be derived using conditional expectation on the preference loss assuming a fixed subset V  U , and fixed ground truth   . To further simplify the analysis, we assume that h is binary, that is h(u, v )  {0, 1} for all u, v  U .


3.1 Description One simple idea to obtain a global ranking of the points in V consists of using a standard comparison-based sorting algorithm where the comparison operation is based on the preference function. However, since in general the preference function is not transitive, the property of the resulting permutation obtained is unclear. This section shows however that the permutation generated by the standard QuickSort algorithm provides excellent guarantees.1 Thus, the algorithm we suggest is the following. Pick a random pivot element u uniformly at random from V . For each v = u, place v on the left2 of u if h(v , u) = 1, and to its right otherwise. Proceed recursively with the array to the left of u and the one to its right and return the concatenation of the permutation returned by the left recursion, u, and the permutation returned by the right recursion. We will denote by Qh (V ) the permutation resulting in s running QuickSort on V using preference function h, where s is the random stream of bits used by QuickSort for the selection of the pivots. As we shall see in the next two sections, this algorithm produces high-quality global rankings in a time-efficient manner. 3.2 Ranking Quality Guarantees The following theorems bound the ranking quality of the algorithm described, for both loss and regret, in the general and bipartite cases. Theorem 1 (Loss bounds in general case) For any fixed subset V  U , preference function h on V , and generalized ranking   on V , the following bound holds: E[L(Qh (V ),   )]  2L(h,   ) . s
s

3.3 Analysis of QuickSort Assume V is fixed, and let Qs = Qh (V ) be the (random) s ranking output by QuickSort on V using the preference function h. During the execution of QuickSort, the order between two elements u, v  V is determined in one of two ways: · Directly: u (or v ) was selected as the pivot with v (resp. u) present in the same sub-array in a recursive call to QuickSort. We denote by puv = pvu the probability of that event. In that case, the algorithm orders u and v according to the preference function h. · Indirectly: a third element w  V is selected as pivot with w, u, v all present in the same sub-array in a recursive call to QuickSort, u is assigned to the left sub-array and v to the right (or vice-versa). Let puvw denote the probability of the event that u, v , and w are present in the same array in a recursive call to QuickSort and that one of them is selected as pivot. Note that conditioned on that event, each of these three elements is equally likely to be selected as a pivot since the pivot selection is based on a uniform distribution. If (say) w is selected among the three, then u will be placed on the left of v if h(u, w) = h(w, v ) = 1, and to its right if h(v , w) = h(w, u) = 1. In all other cases, the order between u, v will be determined only in a deeper nested call to QuickSort. Let X, Y : V × V  R be V ny o functions on ordered a tw pairs u, v  V , and let Z : 2 R be a function on unV R, ordered pairs. We define three functions [X, Y ] : 2 V V  [X ] : 3 R and  [Z ] : 3 R as follows: [X, Y ]uv = X (u, v )Y (v , u) + X (v , u)Y (u, v ),  [X ]uvw = 1 (h(u, v )h(v , w)X (w, u) + h(w, v )h(v , u)X (u, w))+ 3 1 (h(v , u)h(u, w)X (w, v ) + h(w, u)h(u, v )X (v , w))+ 3 1 (h(u, w)h(w, v )X (v , u) + h(v , w)h(w, u)X (u, v )), 3  [Z ]uvw = 1 (h(u, v )h(v , w) + h(w, v )h(v , u))Zuw + 3 1 (h(v , u)h(u, w) + h(w, u)h(u, v ))Zvw + 3 1 (h(u, w)h(w, v ) + h(v , w)h(w, u))Zuv . 3 Lemma 3 (QuickSort Decomposition) V 1. For any Z : 2 R, u u u Zuv = puv Zuv + puvw  [Z ]uvw .
<v <v <v <w

(10)

Taking the expectation of both sides, this implies immediately that
V ,  ,s

E [L(Qh (V ),   )]  2EV ,  [L(h,   )], s

(11)

where h could depend on V . Theorem 2 (Loss and regret bounds in bipartite case) For any fixed V  U , preference function h over V , and bipartite generalized ranking   , the following bound holds: E[L(Qh (V ),   ] = L(h,   ) s
s

(12) (13)

Rrank (Qh (·), D)  Rclass (h, D) . s

Taking the expectation of both sides of Equation 12, this implies immediately that if (V ,   ) is drawn from a bipartite distribution D, then E [L(Qh (V ),   )] = EV ,  [L(h,   )], (14) s 
V , ,s

where h can depend on V . To present the proof of these theorems, we need some tools helpful in the analysis of QuickSort, similar to those originally developed by Ailon et al. (2005). The next section introduces these tools.
1 We are not assuming here transitivity as in standard textbook presentations of QuickSort. 2 We will use the convention that ranked items are written from left to right, starting with the most preferred ones.

2. For any X : V × V  R, u Es [ [Qs , X ]uv ] =
<v

u
<v

puv [h, X ]uv +

u
<v <w

puvw  [X ]uvw .


Proof: To see the first part, notice that for every unordered pair u < v the expression Zuv is accounted for on the RHS of the equation with total coefficient: w 1 puvw (h(u, w)h(w, v ) + h(v , w)h(w, u)) . puv + 3
{u,v}

If   is bipartite, then by (4) the right hand sides of (17) and (18) are equal, giving (16). Otherwise we use (3) to derive   (u, v )    (u, w) +   (w, v )   (v , w)    (v , u) +   (u, w)   (w, u)    (w, v ) +   (v , u) Summing up the three equations, this implies (15). 2. Without loss of generality, assume h(u, v ) = h(v , w) = h(u, w) = 1. Plugging in the definitions gives  [  ]uvw =  [[h,   ]]uvw =   (w, u) as required. We now examine a consequence of Theorem 1 for QuickSort that can be compared with the bound given by Cohen et al. (1999) for a greedy algorithm based on the tournament degree. Let optimal be the ranking with the least amount of pairwise disagreement with h: optimal = argmin L(h,  ) .


Now, puv is the probability that the order of (u, v ) is determined directly (by definition), and 1 puvw (h(u, w)h(w, v ) + h(v , w)h(w, u)) 3 is the probability that their order is determined indirectly via w as pivot. Since each pair's ordering is accounted for exactly once, these probabilities are for pairwise disjoint events that cover the probability space. Thus, the total coefficient of Zuv on the RHS is 1, as is on the LHS. The second part is proved similarly. 3.4 Loss Bounds This section proves Theorem 1 and the first part of Theorem 2. For a fixed   , the loss incurred by QuickSort is n -1 u  L(Qs ,   ) = 2 <v [Qs ,  ]uv . By the second part of Lemma 3, the expected loss is therefore E[L(Qs ,   )] = s n -1 u u puvw  [  ]uvw puv [h,   ]uv + 2 <v <w <v Also, the following holds by definition of L: n -1 u [h,   ]uv . L(h,   ) = 2 <v Thus, by the first part of Lemma 3, L(h,   ) = n -1 u 2
<v

Then, the following corollary bounds the expected pairwise disagreement of QuickSort with respect to optimal by twice that of the preference function with respect to optimal . Corollary 4 For any V  U and preference function h over V , the following bound holds:

.

E[L(Qh (V ), optimal )]  2 L(h, optimal ) . s
s

(19)

The corollary is immediate since technically any ranking, in particular optimal , can be taken as   in the proof of Theorem 1. Corollary 5 Let V  U be an arbitrary subset of U and let optimal be as above. Then, the following bound holds for the pairwise disagreement of the ranking Qh (V ) with respect s to h: E[L(h, Qh (V ))]  3 L(h, optimal ). (20) s
s

. puv [h,   ]uv + u
<v <w

 [[h,   ]]uvw

Proof: The result follows directly Corollary 4 and the application of the triangle inequality. This result is in fact known from previous work (Ailon et al. 2005; 2007) where it is proven directly without resorting to the intermediate inequality (19). In fact, a better factor of 2.5 is known to be achievable using a more complicated algorithm, which gives hope for a 1.5 bound improving Theorem 1. 3.5 Regret Bounds for Bipartite case This section proves the second part of Theorem 2, that is the regret bound. Since in the definition of Rrank and Rclass the expectation over V is outside the min operator, we may continue to fix V . Let DV denote the distribution over the bipartite   conditioned on V . By the definitions of Rrank and Rclass , it is now sufficient to prove that
  |V ,s

To complete the proof, it suffices to show that for all u, v , w,  [  ]uvw  2 [[h,   ]]uvw , and that if   is bipartite, then  [  ]uvw =  [[h,   ]]uvw . (16) (15)

Up to symmetry, there are two cases to consider. The first case assumes that h induces a cycle on u, v , w, the second assumes that it doesn't. 1. Without loss of generality, assume h(u, v ) = h(v , w) = h(w, u) = 1. Plugging in the definitions leads to
 [  ]uvw = 1 ( (u, v ) +   (v , w) +   (w, u)), and (17) 3 1 ( (v , u) +   (w, v ) +   (u, w)) . 3 (18)

E [L(Qh ,   )] - min  [L( ,   )] E ~ s
 ~  |V

 [[h,   ]]uvw =

~   [L(h,  )] - min  [L(h,   )]. (21) E E
  |V ~ h  |V


We let e(u, v ) denote E  |V [  (u, v )], then by the linearity of expectation, E  |V [L( ,   )] = L( , e) and similarly ~ ~ ~ ,   )] = L(h, e). Thus, inequality 21 can be rewrit~ E  |V [L(h ten as ~ E[L(Qh , e)] - min L( , e)  L(h, e) - min L(h, e). (22) ~
s s  ~ ~ h

~ ~ (i) Without loss of generality, assume h(u, v ) = h(v , w) = ~ (w, u) = 1. But this implies that e(u, v )  e(v , u), h e(v , w)  e(w, v ) and e(w, u)  e(u, w). Together with (27) and (28), this implies that e(u, v ) = e(v , u), e(v , w) = e(w, v ), and e(w, u) = e(u, w). Consequently,  [e]uvw =  [[ , e]]uvw ~ ~ =  [[h, e]]uvw =  [[h, e]]uvw 1 = (e(u, v ) + e(v , w) + e(w, u)) , 3 and Fuvw = 0, as required. ~ ~ (ii) Without loss of generality, assume h(u, v ) = h(v , w) = ~ h(u, w) = 1. This implies that e(u, v )  e(v , u) e(v , w)  e(w, v ) e(u, w)  e(w, u) . (29)

~ Now let  and h be the minimizers of the min operators on ~ the left and right sides, respectively. Recall that for all u, v  ~ V , h(u, v ) can be taken greedily as a function of e(u, v ) and e(v , u), as in (9):  e(u, v ) > e(v , u) 1 ~ h(u, v ) = 0 (23) e(u, v ) < e(v , u)  1u>v otherwise (equality) . Using Lemma 3 and linearity, the LHS of (22) can be rewritten as: n -1 u puv [h -  , e]uv ~ 2 <v , u puvw ( [e] -  [[ , e]])uvw ~ +
<v <w

and the RHS of (22) as: n -1 u ~ puv [h - h, e]uv 2 <v . + u
<v <w

~ Let P  P denote the polytope defined by (29) and (26)-(28). Clearly, F is linear in the 6 e variables when all the other variables are fixed. Since F is also homogenous in the e variables, it suffices to prove that ~ ~ F  0 for e taking values in P  P , which is defined by adding the constraint, say, a e(a, b) = 2 .
,b{u,v ,w}

~ puvw  [[h - h, e]]uvw

~ Now, clearly, for all (u, v ) by construction of h, we must ~ , e]uv . To conclude the proof have [h -  , e]uv  [h - hn  ~ of the theorem, we define F : 3 R as follows: ~ F =  [e] -  [[ , e]] - ( [[h, e]] -  [[h, e]]) . (24) ~ It now suffices to prove that Fuvw  0 for all u, v , w  V . Clearly F is a function of the values of e(a, b) : {a, b}  {u, v , w} h(a, b) : {a, b}  {u, v , w}  (a, b) : {a, b}  {u, v , w}. ~

It is now enough to prove that F  0 for   being a ~ vertex of of P . This finite set of cases can be easily checked to be: (e(u, v ), e(v , u), e(u, w), e(w, u), e(w, v ), e(v , w))  A  B , where A = {(0, 0, 1, 0, 0, 1), (1, 0, 1, 0, 0, 0)} B ={(.5, .5, .5, .5, 0, 0), (.5, .5, 0, 0, .5, .5), (0, 0, .5, .5, .5, .5)} . The points in B were already checked in case (i), which is, geometrically, a boundary of case (ii). It remains to check the two points in A. · case (0, 0, 1, 0, 0, 1): plugging in the definitions, one checks that: 1  [e]uvw = (h(w, v )h(v , u) + h(w, u)h(u, v )) 3  [[h, e]]uvw = 1 ((h(u, v )h(v , w) + h(w, v )h(v , u))h(w, u) 3 + (h(v , u)h(u, w) + h(w, u)h(u, v ))h(w, v )) ~  [[h, e]]uvw = 0 . Clearly F could be positive only of uvw = 1, which happens if and only if either h(w, v )h(v , u) =

(25)

~ Recall that h depends on e. By (3) and (4), the e-variables can take values satisfying the following constraints for all u, v , w  V :  {a, b, c} = {u, v , w} , e(a, c)  e(a, b) + e(b, c) (26) e(u, v ) + e(v , w) + e(w, u) = e(v , u)+ (27) e(w, v ) + e(u, w) a, b  {u, v , w} , e(a, b)  0 . (28) Let P  R6 denote the polytope defined by (26-28) in the variables e(a, b) for {a, b}  {u, v , w}. We subdivide P into ~ smaller subpolytopes on which the h variables are constant. ~ Up to symmetries, we can consider only two cases: (i) h ~ is cycle-free on u, v , w. induces a cycle on u, v , w and (ii) h


1 or h(w, u)h(u, v ) = 1. In the former case, we obtain that either h(w, v )h(v , u)h(w, u) = 1 or h(v , u)h(u, w)h(w, v ) = 1 , (30) (31)

both implying that  [[h, e]]uvw  1, thus F  0. In the latter case, either h(w, u)h(u, v )h(w, v ) = 1 or h(u, v )h(v , w)h(w, u) = 1 , (32) (33)

both implying again that  [[h, e]]uvw  1 and thus F  0. · case (1, 0, 1, 0, 0, 0): plugging in the definitions, one checks that: 1  [e]uvw = (h(w, v )h(v , u) + h(v , w)h(w, u)) 3  [[h, e]]uvw = 1 ((h(u, v )h(v , w) + h(w, v )h(v , u))h(w, u) 3 + (h(u, w)h(w, v ) + h(v , w)h(w, u))h(v , u)) . ~  [[h, e]]uvw = 0 . Now F could be positive if and only if either h(w, v )h(v , u) = 1 or h(v , w)h(w, u) = 1 . In the former case, we obtain that either h(w, v )h(v , u)h(w, u) = 1 or h(v , u)h(u, w)h(w, v ) = 1 , (36) (37) (34) (35)

Proof: Let T (n) be the maximum expected running time of QuickSort on a possibly cyclic tournament on n vertices in terms of number of comparisons. Let G = (V , A) denote a tournament. The main observation is that each vertex v  V is assigned to the left recursion with probability exactly outdeg(v )/n and to the right with probability indeg(v )/n, over the choice of the pivot. Therefore, the expected size of both the left and right recursions is exactly (n - 1)/2. The separation itself costs n - 1 comparisons. The resulting recursion formula T (n)  n - 1 + 2T ((n - 1)/2) clearly solves to T (n) = O(n log n). Assume now that only the k first elements of the output are sought, that is, we are interested in outputting only elements in positions 1, . . . , k . The algorithm which we denote by k -QuickSort is clear: recurse with min {k , nL }-QuickSort on the left side and max {0, k - nL - 1}-QuickSort on the right side, where nL , nR are the sizes of the left and right recursions respectively and 0-QuickSort takes 0 steps by assumption. To make the analysis simpler, we will assume that whenever k  n/8, k -QuickSort simply returns the output of the standard QuickSort, which runs in expected time O(n log n) = O(n + k log k ), within the sought bound. Fix a tournament G on n vertices, and let tk (G) denote the running time of k -QuickSort on G, where k < n/8. Denote the (random) left and right sub-tournaments by GL and GR respectively, and let nL = |GL |, nR = |GR | denote their sizes in terms of number of vertices. Then, clearly,
tk (G) = n - 1 + tmin{k,nL } (GL ) + tmax{0,k-nL -1} (GR ). (40)

Assume by structural induction that for all {k , n n < n} and for all tournaments G on n vertices,
f

:

k


E[tk

(G

)

]  cn

+

c k log k

both implying that  [[h, e]]uvw  1, and thus F  0. In the latter case, either h(v , w)h(w, u)h(v , u) = 1 or h(u, v )h(v , w)h(w, u) = 1 , (38) (39)

or some global c, c > 0. Then, by conditioning on GL , GR , taking expectations on both sides of (40) and by induction, E[tk (G) | GL , GR ]  n - 1 + cnL + c
m

in{k , nL } log min{k , nL } + cnR 1nL <k-1 +
m

both implying again that  [[h, e]]uvw  1 and thus F  0. This concludes the proof of the second part of Theorem 2. 3.6 Time Complexity Running QuickSort does not entail (|V |2 ) accesses to hu,v . The following bound on the running time is proven in Section 3.6. Theorem 6 The expected number of times QuickSort accesses to the preference function h is at most O(n log n). Moreover, if only the top k elements are sought then the bound is reduced to O(k log k + n) by pruning the recursion. It is well known that QuickSort on cycle-free tournaments runs in time O(n log n), where n is the size of the set we wish to sort. That this holds for QuickSort on general tournaments is a simple extension (communicated by Heikki Mannila) which we present it here to keep this presentation selfcontained. The second part of the theorem requires some more work.

c

ax{k - nL - 1, 0} log max{k - nL - 1, 0}.

By convexity of the function x  x log x, min{k , nL } log min{k , nL }+ max{k - nL - 1, 0} log max{k - nL - 1, 0}  k log k . (41) Thus, E[tk (G) | GL , GR ]  n - 1 + cnL + cnR 1nL <k-1 + c k log k . (42) By conditional expectation, E[tk (G)]  n-1+c(n-1)/2+c k log k +c E[nR 1nL <k-1 ]. To complete the inductive hypothesis, we need to bound the quantity E[nR 1nL <k-1 ], which is bounded by n Pr[nL < k - 1]. The event {nL < k - 1}, equivalent to {nR > n - k }, occurs when a vertex of out-degree at least n - k  7n/8 is chosen as pivot. For a random pivot v  V , where V is the vertex set of G, E[outdeg(v )2 ]  n2 /3 + n/2  n2 /2.9.


Indeed, each pair of edges (v , u1 )  A and (v , u2 )  A for u1 = u2 gives rise to a triangle which is counted exactly twnce in the cross-terms, hence n2 /3 which upper-bounds i/ 2 3 n; n/2 bounds the diagonal. Thus, Pr[outdeg(v )  7n/8] = Pr[outdeg(v )2  49n2 /64]  0.46 (by Markov). Plugging in this value into our last estimate yields E[tk (G)]  n - 1 + c(n - 1)/2 + c k log k + 0.46 × cn, which is at most cn + c k log k for c  30, as required.

5 Discussion
5.1 History of QuickSort The textbook algorithm, by now standard, was originally discovered by Hoare (1961). Montague and Aslam (Montague & Aslam, 2002) experimented with QuickSort for information retrieval (IR) by aggregating rankings from different sources of retrieval. They claimed an O(n log n) time bound on the number of comparisons, although the proof seemed to rely on the folklore QuickSort proof without addressing the non-transitivity problem. They proved certain combinatorial bounds on the output of QuickSort and provided an empirical justification of its IR merits. Ailon et al. (2005) also considered the rank aggregation problem and proved theoretical cost bounds for many ranking problems on weighted tournaments. They strengthened these bounds by considering non-deterministic pivoting rules arising from solutions to certain ranking LP's. This work was later extended by Ailon (2007) to deal with rankings with ties, in particular, top-k rankings. Hedge et al.(2007) and Williamson and van Zuylen (2007) derandomized the random pivot selection step in QuickSort for many of the combinatorial optimization problems studied by Ailon et al. 5.2 The decomposition technique

4 Lower Bounds
Let r denote the classification regret. Balcan et al. (2007) proved a lower bound of 2r for the regret of the algorithm MFAT defined as the solution to the minimum feedback arcset problem on the tournament V with an edge (u, v ) when h(u, v ) = 1. More precisely, they showed an example of fixed V , h, and bipartite generalized ranking   on V , such that the classification regret of h tends to 1/2 of the ranking regret of MFAT on V , h. Note that in this case, since   is a fixed function of V , the regret and loss coincide both for classification and for ranking. Here we give a simple proof of a more general theorem stating that same bound holds for any deterministic algorithm, including of course MFAT. Theorem 7 For any deterministic algorithm A taking as input V  U and a preference function h on V and outputting a ranking   S (V ), there exists a bipartite distribution D on (V ,   ) such that Rrank (A, D)  2 Rclass (h, D). (43) Note that the theorem implies that, in the bipartite case, no deterministic algorithm converting a preference function into a linear ranking can do better than a randomized algorithm, on expectation. Thus, randomization is essentially necessary in this setting. The proof is based on an adversarial argument. In our construction, the support of D is reduced to a single pair (V ,   ) (deterministic input), thus the loss and both the weak and strong regrets coincide and a similar argument applies to the loss function and the weak regret functions. Proof: Fix V = {u, v , w}, and let the support of D be reduced to (V ,   ), where the bipartite generalized ranking   is one that we will select adversarially. Assume a cycle: h(u, v ) = h(v , w) = h(w, u) = 1. Up to symmetry, there are two options for the output  of A on V , h. 1.  (u) <  (v ) <  (w): in this case, the adversary can choose   corresponding to the partition V + = {w} and V - = {u, v }. Clearly, Rclass (h, D) now equals 1/2 since h is penalized only for misranking the pair (v , w), but Rrank (A, D) = 1 since  is misordering both (u, w) and (v , w). 2.  (w) <  (v ) <  (u): in this case, the adversary can choose   corresponding to the partition V + = {u} and V - = {v , w}. Similarly, Rclass (h, D) now equals 1/2 since h is penalized only for misranking the pair (u, w), while Rrank (A, D) = 1 since  is misordering both (u, v ) and (u, w).

The technique developed in Lemma 3 is very general and can be used for a wide variety of loss functions and variants of QuickSort involving non-deterministic ordering rules (Ailon et al. 2005; 2007). Such results would typically amount to bounding  [X ]uvw / [Z ]uvw for some carefully chosen functions X, Z depending on the application. 5.3 Combinatorial Optimization vs. Learning of Ranking

QuickSort, sometimes referred to as FAS-Pivot in that context, was used by Ailon et al. (2005; 2007) to approximate certain NP-Hard weighted instances of the problem of minimum feedback arcset in tournaments (Alon, 2006). There is much similarity between the techniques used in that work and those of the analyses of this work, but there is also a significant difference that should be noted. In the minimum feedback arc-set problem, we are given a tournament G and wish to find an acyclic tournament H on the same vertex set minimizing (G, H ), where  counts the number of edges pointing in opposite directions between G, H (or a weighted version thereof). However, the cost we are considering is (G, H ) for some fixed acyclic tournament H induced by some permutation  (the ground truth). In this work, we showed in fact that if G is obtained from G using QuickSort, then E[(G , H )]  2(G, H ) for any  (Theorem 1). If H is the optimal solution to the (weighted) minimum feedback arc-set problem corresponding to G, then it is easy to see that (H, H )  (G, H ) + (G, H )  2(G, H ). However, recovering G is NP-Hard in general. Approximating (G, H ) modulo a constant factor 1 +  using an acyclic tournament H , as in the combinatorial optimization world, only guarantees a constant factor of 2 + : (H ,H )  (G, H ) + (G, H )  (1 + )(G, H ) + (G, H )  (2 + )(G, H ) .


Thus, this work also adds a significant contribution to (Ailon et al., 2005; Ailon, 2007; Kenyon-Mathieu & Schudy, 2007). 5.4 Weak vs. Strong Regret Functions For the proof of the regret bound of Theorem 2 we used the ~ fact that the minimizer h in the definition (5-6) of Rclass could be determined independently for each pair u, v  U , using (9). This could also be done for strong regrets if the distribution D on V ,   satisfied the following pairwise IIA condition. Definition 8 A distribution D on subsets V  U and generalized rankings   on V satisfies the pairwise independence on irrelevant alternatives (pairwise IIA) if for all u, v  U and for any two subsets V1 , V2  {u, v }, E  |V1 [  (u, v )] = E  |V2 [  (u, v )] . Note: We chose the terminology IIA to match that used in Arrow's seminal work (Arrow, 1950) to describe a similar notion. When pairwise IIA holds, the average ground truth relation between u and v , conditioned on u, v included in V , is independent of V . Recall that a bipartite   is derived from a pair   ,  , where  is defined using a term 1/m- m+ , for compatibility with the definition of AUC. The numbers m+ and m- depend on the underlying size of the positive and negative sets partitioning of V and therefore cannot be inferred from (u, v ) alone. Thus, in the standard bipartite case, the pairwise IIA assumption is not natural. If, however, we replaced our definitions in the bipartite case and used the following:  + + 1 (i  m )  (j > m ) (44)  (i, j ) = 1 (j  m+ )  (i > m+ )  0 otherwise, instead of (2), then it would be reasonable to believe that pairwise IIA does hold in the bipartite case. In fact, it would be reasonable to make the stronger assumption that for any fixed u, v  U and V1 , V2  {u, v } the distribution of the random variables   (u, v )|V1 and   (u, v )|V2 are equal. This corresponds to the intuition that when comparing a pair u, v in a context of a set V containing them, human labelers are not as influenced by the irrelevant information V \{u, v } as they would be by V \{u} if asked to evaluate single elements u. The irrelevant information in V is often referred to as anchor in experimental psychology and economics (Ariely et al., 2008). Our regret bounds would still hold if we used (44), but we chose (2) to present our results in terms of the familiar average pairwise misranking error or AUC loss. Another possible assumption allowing usage of strong regrets is to let the preference function learned in the first stage depend on V . This is the assumption implicitly made by Balcan et al. (2007) (based on our private communication). We do not further elaborate on this assumption.

it practical for large-scale information extraction and search engine applications. A finer analysis of QuickSort is likely to further improve our reduction bound by providing a concentration inequality for the algorithm's deviation from its expected behavior using the confidence scores output by the classifier. Our reduction leads to a competitive ranking algorithm that can be viewed as an alternative to the algorithms previously designed for the score-based setting.

7 Acknowledgments
We thank Alina Beygelzimer and John Langford for helpful discussions. Mehryar Mohri's work was partially funded by the New York State Office of Science Technology and Academic Research (NYSTAR).

References
Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization bounds for the area under the roc curve. Journal of Machine Learning Research, 6, 393­425. Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. COLT (pp. 32­47). Ailon, N. (2007). Aggregation of partial rankings, p-ratings and top-m lists. SODA. Ailon, N., Charikar, M., & Newman, A. (2005). Aggregating inconsistent information: ranking and clustering. Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005 (pp. 684­693). ACM. Alon, N. (2006). Ranking tournaments. SIAM J. Discrete Math., 20, 137­142. Ariely, D., Loewenstein, G., & Prelec, D. (2008). Coherent arbitrariness: Stable demand curves without stable preferences. The Quarterly Journal of Economics, 118, 73­105. Arrow, K. J. (1950). A difficulty in the concept of social welfare. Journal of Political Economy, 58, 328­346. Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., & Sorkin, G. B. (2007). Robust reductions from ranking to classification. COLT (pp. 604­619). Springer. ´ Clemencon, S., & Vayatis, N. (2007). Ranking the best in¸ stances. Journal of Machine Learning Research, 8, 2671­ 2699. Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learning to order things. J. Artif. Intell. Res. (JAIR), 10, 243­270. Coppersmith, D., Fleischer, L., & Rudra, A. (2006). Ordering by weighted number of wins gives a good ranking for weighted tournamnets. Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA).

6 Conclusion
We described a reduction of the learning problem of ranking to classification. The efficiency of this reduction makes


Cortes, C., Mohri, M., & Rastogi, A. (2007a). An Alternative Ranking Problem for Search Engines. Proceedings of the 6th Workshop on Experimental Algorithms (WEA 2007) (pp. 1­21). Rome, Italy: Springer-Verlag, Heidelberg, Germany. Cortes, C., Mohri, M., & Rastogi, A. (2007b). MagnitudePreserving Ranking Algorithms. Proceedings of the Twenty-fourth International Conference on Machine Learning (ICML 2007). Oregon State University, Corvallis, OR. Cossock, D., & Zhang, T. (2006). Subset ranking using regression. COLT (pp. 605­619). Crammer, K., & Singer, Y. (2001). Pranking with ranking. Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada] (pp. 641­647). MIT Press. Freund, Y., Iyer, R. D., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933­ 969. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology. Hedge, R., Jain, K., Williamson, D. P., & van Zuylen, A. (2007). "deterministic pivoting algorithms for constrained ranking and clustering problems". Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). Hoare, C. (1961). Quicksort: Algorithm 64. Comm. ACM, 4, 321­322. Joachims, T. (2002). Optimizing search engines using clickthrough data. KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 133­142). New York, NY, USA: ACM Press. Kenyon-Mathieu, C., & Schudy, W. (2007). How to rank with few errors. STOC '07: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (pp. 95­ 103). New York, NY, USA: ACM Press. Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco, California: Holden-Day. Montague, M. H., & Aslam, J. A. (2002). Condorcet fusion for improved retrieval. Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, November 4-9, 2002 (pp. 538­548). ACM. Rudin, C., Cortes, C., Mohri, M., & Schapire, R. E. (2005). Margin-based ranking meets boosting in the middle. Learning Theory, 18th Annual Conference on Learning Theory, COLT 2005, Bertinoro, Italy, June 27-30, 2005, Proceedings (pp. 63­78). Springer.

Williamson, D. P., & van Zuylen, A. (2007). "deterministic algorithms for rank aggregation and other ranking and clustering problems". Proceedings of the 5th Workshop on Approximation and Online Algorithms (WAOA) (to appear).