Query-Level Stability and Generalization in Learning to Rank Yanyan Lan* lanyanyan@amss.ac.cn Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, P. R. China. Tie-Yan Liu tyliu@microsoft.com Microsoft Research Asia, Sigma Center, No. 49, Zhichun Road, Haidian District, Beijing, 100190, P. R. China. Tao Qin* qinshitao99@mails.thu.edu.cn Department of Electronic Engineering, Tsinghua University, Beijing, 100084, P. R. China. Zhiming Ma mazm@amt.ac.cn Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, P. R. China. Hang Li hangli@microsoft.com Microsoft Research Asia, Sigma Center, No. 49, Zhichun Road, Haidian District, Beijing, 100190, P. R. China. Abstract This paper is concerned with the generalization ability of learning to rank algorithms for information retrieval (IR). We point out that the key for addressing the learning problem is to look at it from the viewpoint of query. We define a number of new concepts, including query-level loss, query-level risk, and querylevel stability. We then analyze the generalization ability of learning to rank algorithms by giving query-level generalization bounds to them using query-level stability as a tool. Such an analysis is very helpful for us to derive more advanced algorithms for IR. We apply the proposed theory to the existing algorithms of Ranking SVM and IRSVM. Experimental results on the two algorithms verify the correctness of the theoretical analysis. sociated documents, and the corresponding relevance judgments, a ranking model is created which best represents the relevance of documents with respect to queries. When a user submits a query to the IR system, the trained model assigns a score to each document associated with the query, sorts the documents based on their scores, and presents the top ranked documents to the user. Average ranking accuracy over a large number of queries is usually used to evaluate the effectiveness of a ranking model. Therefore, from the application's perspective, both training and evaluation should be conducted at query level. Many learning to rank algorithms have been proposed in recent years. Examples include the pointwise ranking algorithms like MCRank (Li et al., 2007), the pairwise ranking algorithms like Ranking SVM (Herbrich et al., 1999) and RankBoost (Freund et al., 2003), and the listwise ranking algorithms like ListNet (Cao et al., 2007). Analysis on the algorithms in the light of statistical learning theory, however, was not sufficient, particularly that on the generalization ability of the proposed algorithms. The pointwise and pairwise approaches transform the ranking problem to classification or regression, and thus existing theory on classification and regression can be applied. However, it deviates from the direction of enhancing ranking accuracy at query level. Furthermore, the listwise approach lacks of analysis on generalization ability. In this paper, we investigate the generalization ability of learning to rank algorithms, in particular from the viewpoint of query-level training and evaluation. 1. Intro duction Recently, learning to rank has gained increasing attention in machine learning and information retrieval (IR). When applied to IR, learning to rank is a task as follows. Given a set of training queries, their asAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). *The work was performed when the first and the third authors were interns at Microsoft Research Asia. Query-Level Stability and Generalization in Learning to Rank We propose a new probabilistic formulation of learning to rank for IR. The formulation can naturally represent the pointwise, pairwise and listwise approaches in a unified framework. Within the framework, we introduce the concepts of query-level loss, query-level risk, and particularly query-level stability. Query-level stability measures whether the output of a learning algorithm changes largely with small changes in the training queries. With query-level stability as a tool we can conduct analysis on query-level generalization bounds of learning algorithms. A query-level generalization bound indicates how well one can enhance the expected ranking accuracy (corresponding to the expected risk) by enhancing the average ranking accuracy in training (corresponding to the empirical risk). We take the algorithms of Ranking SVM (Joachims, 2002; Herbrich et al., 1999) and IRSVM (Cao et al., 2006; Qin et al., 2007) as examples, and apply the proposed theory to them. Our theoretical result shows that the query-level generalization bound of Ranking SVM is not reasonably good, mainly because Ranking SVM is trained at document pair level, not query level. Furthermore, IRSVM does have a better generalization bound than Ranking SVM, due to its stronger query-level stability. We also conducted experiments and our experimental results agree with the theoretical findings. The contributions of this paper are listed as follows. (1) A proposal on conducting analysis on learning to rank algorithms at query level is made. (2) A new probabilistic formulation of learning to rank is proposed. (3) A new methodology for analyzing generalization ability of learning to rank algorithms on the basis of query-level stability is proposed. (4) The proposed theory is applied to learning to rank algorithms of Ranking SVM and IRSVM. The correctness of the theory has been verified by experiments. MAP (Baeza-Yates & Ribeiro-Neto, 1999), and NDCG (J¨rvelin & Kek¨l¨inen, 2002) have been defined and a aa used. All the measures are query-based; if the evaluation measure for a query q is E V (q ), then the averaged E V (q ) on a number of queries is used. From the application's perspective, both training and testing in learning to rank should be conducted at query level. 2.2. Learning to Rank So far learning to rank has been addressed by the pointwise, pairwise, and listwise approaches. In the pointwise approach (Li et al., 2007), ranking is transformed to regression or classification, and the loss function in learning is defined as a function of a single document. In the pairwise approach (Herbrich et al., 1999; Joachims, 2002; Freund et al., 2003; Cao et al., 2006), ranking is transformed to pairwise classification, and the loss function is defined on a document pair. In the listwise approach (Cao et al., 2007; Qin et al., 2007), document lists are viewed as learning instances and the loss function is defined on that basis. Although many learning methods have been proposed, theoretical investigations on them were not sufficient. Since training and testing should be conducted at query level, studies on query-level generalization ability of learning algorithms are really needed. Unfortunately, it was missing in the previous work. 2.3. Stability Theory The notion of stability (Devroye & Wagner, 1979) has been proposed for analyzing the generalization bounds of learning algorithms. Bousquet et al. (Bousquet & Elisseeff, 2002) propose the theory of uniform leave-one-out stability. Based on it, the generalization bounds of classification algorithms such as Support Vector Machines (SVM) can be derived. Agarwal et al. (Agarwal & Niyogi, 2005) apply the stability tool to bipartite ranking. We can apply the existing stability theory to get document level and document pair level generalization bounds. However, they may be not suitable for the task of IR. In this paper, we propose query-level stability and reveal the relation between query-level stability and query-level generalization bound. 2. Previous Work 2.1. Ranking in IR Ranking is a central issue for IR. Many methods for creating ranking models have been proposed, including heuristics and learning based methods, (Baeza-Yates & Ribeiro-Neto, 1999; Herbrich et al., 1999; Joachims, 2002; Freund et al., 2003; Burges et al., 2005; Cao et al., 2007). Typically a ranking model is defined as a function of features based on query-document pair, and is learned with training data containing a number of queries, associated documents, and corresponding relevance judgments. Measures for evaluating the performance of a ranking model, such as Precision, 3. Probabilistic Formulation for Ranking As explained in Section 2, ranking in IR is evaluated at query level. Therefore, to design and evaluate a learning to rank algorithm, we should also look at it from Query-Level Stability and Generalization in Learning to Rank the query perspective. To this end, we give a novel probabilistic formulation of ranking for IR, which contains queries and their associates (documents, document pairs, or document sets) in two layers. We then introduce the notions of query-level loss and querylevel risk. Assume that query q is a random sample from the query space Q according to a probability distribution PQ . For query q , an associate (q) and its groundtruth g ( (q) ) are sampled from space × G according to a joint probability distribution Dq , where is the space of associates and G is the space of ground truth. Here the associate (q) can be a single document, a pair of documents, or a set of documents, and correspondingly the ground truth g ( (q) ) can be a relevance score (or class label), an order on a pair of documents, or a permutation (list) of documents. Let l(f ; (q) , g ( (q) )) denote a loss (referred to as associate-level loss ) defined on ( (q) , g ( (q) )) and a ranking function f . Expected query-level loss is defined as: L( f ; q ) = ×G This probabilistic formulation can cover most of existing learning to rank algorithms. If we let the associate to be a single document, a document pair, or a document set, we can respectively define pointwise, pairwise, or listwise losses, and develop pointwise, pairwise, or listwise approaches to learning to rank. (a) Pointwise Case Let D denote the document space. We use a feature mapping function : Q × D X (= Rd ) to create a d-dimensional feature vector for each query-document pair. For each query q , suppose that the feature vector of a document is x(q) and its relevance score (or class label) is y (q) , then (x(q) , y (q) ) can be viewed as a random sample from X × R according to a probability distribution Dq . If l(f ; x(q) , y (q) ) is a pointwise loss (square loss for example), then the expected querylevel loss becomes: X L( f ; q ) = ×R f l ; x(q) , y (q) D q d x(q) , dy (q) . l(f ; (q) , g ( (q) )) Dq (d (q) , dg ( (q) )). Given training samples (q1 , S1 ), · · · , (qr , Sr ), where (i) (i) (i) (i) Si = {(x1 , y1 ), · · · , (xni , yni )}, i = 1, · · · , r, the empirical query-level loss of query qi , (i = 1, · · · , r) turns out to be: ni 1j (i) (i) ^ L(f ; qi ) = l(f ; xj , yj ). ni =1 Empirical query-level loss is defined as: 1j (q ) (q ) ^ L(f ; q ) = l(f ; j , g (j )), n q =1 nq (b) Pairwise Case For each query q , z (q) = (x1 , x2 ) stands for a document pair associated with it. Moreover, y (q) = 1 if (q ) (q ) x1 is ranked above x2 , y (q) = -1 otherwise. Let (q ) (q ) (q ) Y = {1, -1}. (x1 , x2 , y ) can be viewed as a random sample from X 2 ×Y according to a probability distribution Dq . If l(f ; z (q) , y (q) ) is a pairwise loss (hinge loss for example, (Herbrich et al., 1999)), then the expected query-level loss becomes: X L( q ) = 2 ×Y where (j , g (j )), j = 1 · · · , nq stands for nq associates of q , which are sampled i.i.d. according to Dq . The empirical query-level loss can be an estimate of the expected query-level loss. It can be proven that the estimation is consistent. The goal of learning to rank is to select the ranking function f which can minimize the expected query-level risk defined as: Q Rl (f ) = EQ L(f ; q ) = L(f ; q ) PQ (dq ). (1) (q ) (q ) (q ) (q ) f l In practice, PQ is unknown. What we have are the training samples (q1 , S1 ), · · · , (qr , Sr ), where Si = ( i) (i) (i) ( i) {(1 , g (1 )), · · · , (ni , g (ni ))}, i = 1, · · · , r, and ni is the number of associates for query qi . Here q1 , · · · , qr can be viewed as data sampled i.i.d. ac( i) (i) cording to PQ , and (j , g (j )) as data sampled i.i.d. according to Dqi , j = 1, · · · , ni , i = 1, · · · , r. Empirical query-level risk is defined as: ir Rl (f ) = 1 ^ L(f ; qi ). r =1 (2) ; z (q) , y (q) D q d z (q) , dy (q) . Given training samples (q1 , S1 ), · · · , (qr , Sr ), where (i) (i) (i) (i) Si = {(z1 , y1 ), · · · , (zni , yni )}, i = 1, · · · , r, the empirical query-level loss of query qi , (i = 1, · · · , r) turns out to be: ni 1j (i) (i) ^ L( f ; q i ) = l(f ; zj , yj ). ni =1 (c) Listwise Case For each query q , let s(q) denote a set of m documents associated with it, (s(q) ) denote a permutation of documents in s(q) according to their relevance degrees to the query, where is the space of all permutations The empirical query-level risk is an estimate of the expected query-level risk. It can be proven that the estimation is consistent. Query-Level Stability and Generalization in Learning to Rank on m documents. (s(q) , (s(q) )) can be viewed as a random sample from X m × according to a probability distribution Dq . If l(f ; s(q) , (s(q) )) is a listwise loss (cross entropy loss for example, (Cao et al., 2007)), then the expected query-level loss becomes: X L( q ) = m × Lemma 1. Let A be a learning to rank algorithm, {(qi , Si ), i = 1, · · · , r} be the training set, and l be the associate-level loss function. If A has leave-one-queryout associate-level loss stability with coefficient with respect to l, then the fol lowing inequalities hold: L (f{(qi ,Si )}r=1 , q ) - L(f{(qi ,Si )}r=1,i=j , q ) (r), i i ^ ^ L(f{(qi ,Si )}r=1 , q ) - L(f{(qi ,Si )}r=1,i=j , q ) (r). i i f l ;s (q ) s , (q ) D q d s (q ) s , d (q ) . Given training samples (q1 , S1 ), · · · , (qr , Sr ), where (i) (i) (i) (i) Si = {(s1 , (s1 )), · · · , (sni , (sni ))}, i = 1, · · · , r, the empirical query-level loss of query qi , (i = 1, · · · , r) turns out to be: ni 1j (i) (i) ^ L(f , qi ) = l(f ; sj , (sj )). ni =1 4. Stability Theory For Query-level Generalization Bound Analysis Based on the probabilistic formulation, we propose a novel concept named query-level stability. We further discuss how to use query-level stability to analyze the generalization ability of a learning to rank algorithm. First, we give a definition to uniform leave-one-queryout associate-level loss stability. The stability of a learning algorithm represents the degree of change in the loss of prediction when randomly removing a query and its associates from the training data. Definition 1. Let A be a learning to rank algorithm, {(qi , Si ), i = 1, · · · , r} be the training set, l be the associate-level loss function, and be a function mapping an integer to a real number. We say that A has uniform leave-one-query-out associate-level loss stability with coefficient with respect to l, if qj Q, Sj ( × G )nj , j = 1, · · · , r, q Q, ( (q) , g ( (q) )) × G , the fol lowing inequality holds: (f{(qi ,Si )}r=1 , (q) , g ( (q) )) i (q ) -l(f{(qi ,Si )}r=1,i=j , , g ( (q) )) (r). i l Based on the concept of query-level stability, we can derive a query-level generalization bound, as shown in Theorem 1. The theorem states that if an algorithm has query-level stability, then with high probability over the samples, the expected query-level risk can be bounded by the empirical risk and a term which depends on the query number and parameters of the algorithm. Furthermore, the theorem quantifies the expected loss on new queries, which is exactly what we mean by query-level generalization. Theorem 1. Let A be a learning to rank algorithm, (q1 , S1 ), · · · , (qr , Sr ) be r training samples, and let l be the associate-level loss function. If (1) (q1 , S1 ), · · · , (qr , Sr ), q Q, ( (q) , g (q) lf (q) (q ) × G, ,g B , (2) A (qi ,Si )r=1 , i has query-level stability with coefficient , then (0, 1) with probability at least 1 - over r the samples of {(qi , Si )}i=1 in the product space r i=1 {Q × ( × G ) }, the fol lowing inequality holds: f Rl {(qi ,Si )}r=1 i + Rl f {(qi ,Si )}r=1 i l n1 . 2r 2 (r) + (4r (r) + B ) Proof. For clarity of the proof, we first give the following definitions: ({(qi , Si )}r=1 ) = Rl i Q( = 1 f {(qi ,Si )}r=1 i - Q( ··· Rl , f {(qi ,Si )}r=1 i , Q = 2 , ×G ×G )n1 ×G )nr Here {(qi , Si )}r=1,i=j stands for the samples i (q1 , S1 ), · · · , (qj -1 , Sj -1 ), (qj +1 , Sj +1 ), · · · , (qr , Sr ), where (qj , Sj ) is deleted. f{(qi ,Si )}r=1 stands for the i ranking function learned from {(qi , Si )}r=1 . We wil l i use the notations hereafter. With the definition, we can obtain the following lemma. It states that, if an algorithm has uniform leave-one-query-out associate-level loss stability, it will be stable in terms of expected query-level loss and empirical query-level loss. For ease of explanation, we simply call the uniform leave-one-query-out associatelevel loss stability query-level stability. nr n1 P1 (d ) = Dqr (dSr )PQ (dqr ) · · · Dq1 (dS1 )PQ (dq1 ), P2 (d ) = Dq (d (q) , dg (w(q) ))PQ (dq ). We then prove the theorem in two steps. 1) Get the bound of ({(qi , Si )}r=1 ) - i 1 ({(qi , Si )}r=1 ) P1 (d ) i . For this purpose, we get the upper bound of the following term first: ({(qi , Si )}r=1 ) - ({(qi , Si )}i=1 ) i r,j,qj Query-Level Stability and Generalization in Learning to Rank where {(qi , Si )}i=1 means that query (qj , Sj ) is changed for another query (qj , Sj ), where Sj refers to (w1 , g (w1 )), · · · , (wnj , g (wnj )). To utilize the query-level stability, we divide into two terms: = 1 - 2 , and discuss either of them separately, as follows. = f 1 ({(qi , Si )}r=1 ) = Rl {(qi ,Si )}r=1 i i ) (q ) (q ) l(f{(qi ,Si )}r=1 ; , g ( ))P2 (d . i (j ) r,j,qj 2) Get the bound of 1 ({(qi , Si )}r=1 ) P1 (d ) i (j ) (j ) [{(qi , Si )}r=1 ]P1 (d ) i 1 ) = [l(f{(qi ,Si )}r=1 ; (q) , g ( (q) ))] P2 (d P1 (d ) i 1 2 (i) (i) - l(f{(qi ,Si )}r=1 ; j , g (j )) P1 (d ) i 1 = [l(f{(qi ,Si )}r=1 ; (q) , g ( (q) )) i 2 =2 f 2 ({(qi , Si )}r=1 ) = Rl {(qi ,Si )}r=1 i i ni r 1i 1j (i) (i) l(f{(qi ,Si )}r=1 ; j , g (j )). i r =1 ni =1 - l(f{(qi ,Si )}r=1 ; j , g (j ))] P2 (d P1 (d ). i = [l(f{(qi ,Si )}r=1 ; (q) , g ( (q) )) i 1 2 r,i,q i ,Si )}i=1 (i) (i) ) - l(f{(q ; j , g (j ))] P2 (d P1 (d ). (q ) (q ) ) Based on query-level stability, we can obtain that qj Q, Sj ( × G )nj , j = 1, · · · , r, q , qj Q, Sj j {Q × ( × G )n }, ( (q) , g ( (q) )) × G , the following inequality holds: (f{(qi ,Si )}r=1 , (q) , g ( (q) )) i (q ) j, , g ( ( q ) )) 2 (r). r,j,q l The reason that the last equality holds is as follows. Because the integral is conducted over all of the samples, and the samples are i.i.d., we can change the ith query in the training set for (q , (q) , g ( (q) )). Then by further using (3), we have: [{(qi , Si )}r=1 ]P1 (d ) i 1 -l(f {(qi ,Si )}i=1 (3) 2 (r). (7) With (3), as 1 is an integral function, the following inequality holds: |1 ({(qi , Si )}r=1 ) - 1 ({(qi , Si )}i=1 )| 2 (r). i r,j,qj Merging Eq. (6) and (7) yields the inequality in the theorem. (4) As for 2 , we have |2 ({(qi , Si )}r=1 ) - 2 ({(qi , Si )}i=1 )| i 1i r r r,j,q j 5. Case Study ni =1,i=j 1j (i) (i) |l(f{(qi ,Si )}r=1 ; j , g (j )) i n i =1 r,j,qj - l (f + {(qi ,Si )}i=1 ; j , g (j ))| (i) (i) nj 1 1s (j ) (j ) nj ; l(f | s , g (s )) r nj =1 {(qi ,Si )}i=1 - 1s (j ) () , g (sj ))| l(f r,j,qj ; s nj =1 {(qi ,Si )}i=1 B . r r,j,qj ({(qi , Si )}i=1 )| nj Without loss of generality, we take existing algorithms of Ranking SVM (Joachims, 2002; Herbrich et al., 1999) and IRSVM (Cao et al., 2006; Qin et al., 2007) as examples to show how to analyze the query-level generalization bound of an algorithm, using the tool of query-level stability. Both of the two algorithms belong to the pariwise case of our probabilistic formulation. It should be noted that the framework is neither limited to these two algorithms nor to the pair-wise case, we leave the discussions on other algorithms or other approaches to our future work. (5) 2 (r) + 5.1. Generalization Bound of Ranking SVM Ranking SVM is widely used in ranking for IR, which views document pair as associate of the query and minimizes: n min f F By jointly considering (4) and (5), we obtain: |({(qi , Si )}r=1 ) i - B 4 (r) + . r Based on McDiarmid's inequality(McDiarmid, 1989), with probability at least 1 - over the sar ples of {(qi , Si )}r=1 in the product space m i i=1 {Q × ( × G ) }, we have ({(qi , Si )}r=1 ) i 1 1i lh (f ; zi , yi ) + f 2 , K n =1 (8) ({(qi , Si )}r=1 ) P1 (d ). i l n . 2r 1 where lh (f ; zi , yi ) is the hinge loss, and K is a kernel function in the Reproducing Kernel Hilbert Space (RKHS). Using the conventional stability theory (Bousquet & Elisseeff, 2002), we can get the following lemma which shows the query-level stability of Ranking SVM. + (4r (r) + B ) (6) Query-Level Stability and Generalization in Learning to Rank Lemma 2. If x X , K (x, x) 2 < , then Ranking SVM has query-level stability with coefficient 2 r (r) = 4r × maxni ,Si 1 ni n . r i=1 i As for this lemma, we have the following discussions. (1) When r approaches infinity, suppose the mean and variance of the distribution of nq are µ and 2 respectively. Then by the Law of Large Numbers and Chebyshev's inequality, 0 < < 1, > 0, R( ), if r > R( ), with probability at least 1 - , the following inequality holds: maxni ,Si Therefore, (r) ni r i=1 1+ 4 2 r µ Theorem 2 states that when the number of training queries tends to be infinity, with high probability the empirical query-level risk of Ranking SVM will converge to its expected query-level risk. However, when the number of training queries is finite, the expected query-level risk and empirical query-level risk are not necessarily close to each other, and the bound in Theorem 3 quantifies the difference, which is an increasing function of the number of training queries. 5.2. Generalization Bound of IRSVM In IR application, the numbers of document pairs associated with different queries vary largely (See LETOR or other public dataset). In consideration of this, IRSVM, studied in (Cao et al., 2006) and (Qin et al., 2007), is an adaptive version of Ranking SVM to the IR applications, which minimizes: min f F ni r 1j 1i (i) (i) lh (f ; zj , yj )+ r =1 ni =1 1+ ni r 1 r µ µ r 1- . 1- µ . That is, (r) will ap- 1 proach zero, with a convergence rate of O( r ), when r goes to infinity. f K. 2 (9) (2) When r is finite (which is the case in practice), we have no reasonable statistical estimation of the term r maxni ,Si 1 ni n . As a result, we can only get a r i=1 i loose bound for (r) as 4 . That is, when r increases but is still finite, (r) does not necessarily decrease. 2 We can prove the query-level stability of IRSVM as shown in Lemma 3. Due to space limitations, we omit the proof. Lemma 3. If x X , K (x, x) 2 < , then 2 IRSVM has query-level stability (r) = 4r . With a similar analysis to that for Ranking SVM, we obtain the following theorem. Theorem 4. If x X , K (x, x) 2 < , then for IRSVM, (0, 1), with probability at least 1 - r ovr r the samples of {(qi , Si )}i=1 in the product space e i=1 {Q × (X × X × Y ) }, we have: f Rl {(qi ,Si )}r=1 i Based on the above lemma, we can further derive the generalization bound of Ranking SVM. In particular, as the function f{(qi ,Si )}r=1 is learned from the traini ing samples (q1 , S1 ), · · · , (qr , Sr ), there is a consK nt ta f C , such that, (q1 , S1 ), · · · , (qr , Sr ), {(qi ,Si )}r=1 i C . f Then, (q1 , S1 ), · · · , (qr , Sr ), z Z , y Y , lh {(qi ,Si )}r=1 , z , y 1 + 2C . By further considi ering Theorem 1, we obtain the following theorems. Theorem 2. If x X , K (x, x) 2 < , then for Ranking SVM, (0, 1), > 0, R( ), if r > R( ), then with probability at least 1 - 2 r ovr r the samples of {(qi , Si )}i=1 in the product space e i=1 {Q × (X × X × Y ) }, we have: +f Rl {(qi ,Si )}r=1 i 8 2 r 1+ µ r + Rl f {(qi ,Si )}r=1 i 2 h l n1 . 2r Rl f 16 + (1 + 2C ) 8 + r 2 16 + {(qi ,Si )}r=1 i 1+ 2 µ r 1- µ 1- µ + (1 + 2C ) l n1 . 2r Theorem 3. If x X , K (x, x) 2 < and we have no constraint on r, then for Ranking SVM, (0, 1), with probability at least 1 - r ovr r the samples of {(qi , Si )}i=1 in the product space e i=1 {Q × (X × X × Y ) }, we only have: +f Rl {(qi ,Si )}r=1 i 82 + 1 Rl f {(qi ,Si )}r=1 i The theorem states that when the number of training queries tends to be infinity, with high probability the empirical query-level risk of IRSVM will converge to its expected query-level risk. When the number of queries is finite, the bound in the theorem quantifies the difference between the two risks, which is a decreasing function of the number of training queries. Remark 1. By comparing Theorem 2 and Theorem 4, we can find that the convergence rates of the empirical query-level risk to the expected query-level risk for 1 Ranking SVM and IRSVM are the same, i.e. O( r ). However, by comparing Theorem 3 to Theorem 4, we can see that for the case of finite r, the bound of IRSVM is much tighter than that of Ranking SVM. 6r2 + (1 + 2C ) l n1 . 2r Query-Level Stability and Generalization in Learning to Rank 6. Exp eriments and Discussion We conducted experiments on Ranking SVM and IRSVM to verify our theoretical results. 6.1. Query-level Stability First, we conducted an experiment to compare the stabilities of Ranking SVM and IRSVM. We randomly sampled 1,200 queries from a search engine's data repository, each query associated with hundreds of documents and their relevance labels. There are five labels: "perfect", "excellent", "good", "fair", and "bad". We split the queries into three sets: a training set with 200 queries, a validation set with 500 queries, and a test set with 500 queries (we denote the test set as T ). The validation set was used to select the regularization parameter for Ranking SVM and IRSVM. We first trained two ranking models with Ranking 0 SVM and IRSVM, denoted as f0 and f respectively. Then we randomly deleted one query from the training set, and trained two new models with Ranking SVM 1 and IRSVM, denoted as f1 and f respectively. We repeated this process 30 times, and created the mod1 2 3 els f1 , f2 , · · · , f30 an f , f , · · · , f 0 . Then on the test set, we compared the associate-level loss for f0 with that for fi , and obtained the difference i for Ranki ing SVM. Similarly, we computed for IRSVM. i = max max |lh (f0 , z (q) , y (q) ) - lh (fi , z (q) , y (q) )|, q T z Sq From Theorems 3 and 4 we can see that the bound for Ranking SVM is much looser than that for IRSVM, especially when the number of training queries r is large but finite. We interpret the result as follow. The actual empirical risk and expected risk with respect to Ranking SVM are as follows. in ir Rl (f ) = 1 lh (f ; z (i) , y (i) )), n = ni . h n =1 =1 X Rlh (f ) = lh (f ; z , y )P (dz , dy ). 2 ×Y In the definitions, only document pair but no query appears, and thus we call them the pair-level risks. For comparison, we also list the query-level risks for the learning to rank problem (See also Section 3) where hinge loss is used as associate-level loss. i r 1 jni Rl (f ) = 1 lh (f ; z (i) , y (i) ). h r =1 ni =1 QX Rlh (f ) = 2 ×Y lh (f ; z (q) , y (q) ) Dq (dz (q) , dy (q) ) PQ (dq ). = max max |lh (f , z (q) , y (q) ) - lh (f , z (q) , y (q) )|. q T z Sq i 0 i According to Definition 1, i can bound from below the query-level stability (r)(r = 200) of Ranking i SVM. Similarly, can bound from below the querylevel stability (r)(r = 200) of IRSVM. In this regard, we can compare stabilities of Ranking SVM and i IRSVM by comparing i and . We list all the 30 values of i and in Table 1. From i it, we can see that i is always much larger than . The mean (or maximum) value of i over the 30 trials is 1.23 (or 4.53). It is about more than ten times higher i than the mean (or maximum) value of , which is only 0.12 (or 0.27). Furthermore, the variance of i i (i.e. 0.72) is also larger than that of (i.e. 0.003). These results indicate that the query-level stability of RankSVM is not so good as that of IRSVM. (Note that Lemmas 2 and 3 hold for any r, the number of training queries. We simply set r = 200.) 6.2. Query-level Generalization Bounds Next, we compared the performances of Ranking SVM and IRSVM, to verify the theoretical results on their query-level generalization bounds. i By comparing the above formulas, we can clearly see that what is optimized in Ranking SVM (i.e. the pairlevel risk) is not equal to what should be optimized (i.e. the query-level risks), unless every training query has the same number of document pairs, which is not true in practice. In contrast, it is easy to verify that what is optimized in IRSVM is exactly the query-level risk. Therefore, no surprisingly IRSVM has a better query-level generalization bound. In summary, the theoretical results indicate that the performance of Ranking SVM on the test set in terms of a query-level measure should not be so good as that of IRSVM. We have verified this through experiments. We tested the ranking performances of Ranking SVM (RankSVM for short) and IRSVM on the test set, in terms of Precision and NDCG. The results are shown in Figure 1. Furthermore, MAP 1 for Ranking SVM is 0.39 and MAP for IRSVM is 0.41. From the results, we can see that IRSVM achieves better ranking performance than RankSVM, in terms of all the query-level measures. This is also consistent with the results reported in (Cao et al., 2006) and (Qin et al., 2007). 7. Conclusions In this paper, we have studied the generalization ability of learning to rank algorithms for IR. A probabilistic formulation for ranking has been proposed, which covers ranking algorithms belonging to the pointwise, 1 To compute MAP, we treated "perfect", "excellent" and "good" as relevant, and "fair" and "bad" as irrelevant. Query-Level Stability and Generalization in Learning to Rank Table 1. Comparison of Query-level Stability i 1 2 3 4 5 6 i 3.59 1.14 0.88 0.81 1.84 1.15 i 0.07 0.07 0.06 0.06 0.05 0.24 7 0.89 0.18 13 0.56 0.11 19 1.04 0.08 25 0.50 0.18 8 1.30 0.06 14 1.43 0.13 20 0.86 0.05 26 0.88 0.08 9 0.90 0.09 15 1.42 0.14 21 0.43 0.09 27 4.53 0.12 10 1.42 0.08 16 1.01 0.11 22 0.51 0.20 28 0.99 0.09 11 1.38 0.11 17 1.13 0.06 23 0.64 0.27 29 1.13 0.21 12 1.39 0.15 18 1.34 0.11 24 0.92 0.14 30 0.62 0.14 0 .7 0.68 0.66 0.64 0.62 0 .6 1 2 3 NDCG@ 4 5 RankSVM IRSVM (a) NDCG@1-5 0.4 0.39 0.38 0.37 0.36 0.35 0.34 0.33 0.32 1 2 3 Precision@ 4 5 RankSVM IRSVM pairwise and listwise approaches. The tool of querylevel stability has been developed, which has been further used to analyze the generalization bound of a ranking algorithm. We have applied the tool to two existing ranking algorithms (Ranking SVM and IRSVM) and obtained theoretical results. We have also verified the correctness of the results by experiments. As far as we know, this is the first work on query-level generalization bound of learning to rank algorithms. There are still many issues to investigate. (1) We have taken SVM based ranking algorithms as examples. We will try to obtain similar results for other algorithms, such as RankBoost. (2) We have focused on the pairwise approach. The proposed formulation for ranking and the tool of query-level stability can also be used to analyze other approaches. (3) It is worth checking whether new learning to rank algorithms can be derived under the guide of the theoretical study. (b) Precision@1-5 Figure 1. Accuracies of Ranking SVM and IRSVM Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., & Li, H. (2007). Learning to rank: from pairwise approach to listwise approach. ICML '07 (pp. 129­136). Devroye, L., & Wagner, T. (1979). Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 25, 601­604. Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res., 4, 933­969. Herbrich, R., Obermayer, K., & Graepel, T. (1999). Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers. (pp. 115­132). J¨rvelin, K., & Kek¨l¨inen, J. (2002). Cumulated gaina aa based evaluation of ir techniques. ACM Trans. Inf. Syst., 20, 422­446. Joachims, T. (2002). Optimizing search engines using clickthrough data. KDD '02 (pp. 133­142). Li, P., Burges, C., & Wu, Q. (2007). Mcrank: Learning to rank using multiple classification and gradient boosting. NIPS2007. McDiarmid, C. (1989). On the method of bounded differences. Cambridge University Press. Qin, T., Zhang, X.-D., Tsai, M.-F., Wang, D.-S., Liu, T.Y., & Li, H. (2007). Query-level loss functions for information retrieval. Information Processing & Management. References Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. Proc. of COLT'05 (pp. 32­47). Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Addison Wesley. Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2, 499­526. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. ICML '05 (pp. 89­96). Cao, Y., Xu, J., Liu, T.-Y., Li, H., Huang, Y., & Hon, H.W. (2006). Adapting ranking svm to document retrieval. SIGIR '06 (pp. 186­193).