Generalization Analysis of Listwise Learning-to-Rank Algorithms Yanyan Lan* lanyanyan@amss.ac.cn Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, P.R. China. Microsoft Research Asia, 4F, Sigma Center, No. 49, Zhichun Road, Beijing, 100190, P. R. China. Tie-Yan Liu tyliu@microsoft.com Microsoft Research Asia, 4F, Sigma Center, No. 49, Zhichun Road, Beijing, 100190, P. R. China. Zhiming Ma mazm@amt.ac.cn Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, P.R. China. Hang Li hangli@microsoft.com Microsoft Research Asia, 4F, Sigma Center, No. 49, Zhichun Road, Beijing, 100190, P. R. China. Abstract This paper presents a theoretical framework for ranking, and demonstrates how to perform generalization analysis of listwise ranking algorithms using the framework. Many learning-to-rank algorithms have been proposed in recent years. Among them, the listwise approach has shown higher empirical ranking performance when compared to the other approaches. However, there is no theoretical study on the listwise approach as far as we know. In this paper, we propose a theoretical framework for ranking, which can naturally describe various listwise learningto-rank algorithms. With this framework, we prove a theorem which gives a generalization bound of a listwise ranking algorithm, on the basis of Rademacher Average of the class of compound functions. The compound functions take listwise loss functions as outer functions and ranking models as inner functions. We then compute the Rademacher Averages for existing listwise algorithms of ListMLE, ListNet, and RankCosine. We also discuss the tightness of the bounds in different situations with regard to the list length and transformation function. Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). *The work was performed when the first author was an intern at Microsoft Research Asia. 1. Introduction Ranking is an important problem in various applications, such as information retrieval, natural language processing, computational biology, and social sciences. In many real scenarios, the ranking problem is defined as follows. Given a group of objects, a ranking model sorts the objects with respect to each other in the group, according to their degrees of relevance, importance, or preferences. For example, in information retrieval, a "group" corresponds to a query, and "objects" corresponds to documents associated with the query. In recent years, machine learning technologies have been developed to learn ranking models, and a new research branch named "learning to rank" has emerged. In the training phase, several groups of objects are given to facilitate the creation of a ranking model, and in testing, the ranking model is used to predict the ranked list for a new group of objects. Many learning-to-rank algorithms have been proposed in the literature, which can be categorized into three groups: the pointwise, pairwise, and listwise approaches. The pointwise and pairwise approaches (Li et al., 2007; Herbrich et al., 1999) respectively transform ranking into (ordinal) regression or classification on single object and object pairs. The listwise approach solves the problem of ranking by minimizing a loss function defined on object lists. Representative listwise ranking algorithms include ListMLE (Xia et al., 2008), ListNet (Cao et al., 2007), and RankCosine (Qin et al., 2007). According to previous studies (Cao et al., 2007; Qin et al., 2007; Xia et al., 2008), the listwise approach can outperform the other two approaches on benchmark datasets. Generalization Analysis of Listwise Learning-to-Rank Algorithms Although some works have discussed the generalization bound for pairwise ranking algorithms, such as (Agarwal & Niyogi, 2005) and (Freund et al., 2003), these studies are not sufficient especially for handling the real settings in applications like information retrieval. Ranking in these real applications employs a special data structure of two layers: group and object (e.g., in information retrieval, query and document). While the ranking model operates at the object level, the evaluation of a ranking model is usually performed at the group level. Due to these key differences from conventional learning tasks, many existing generalization theories in machine learning may not be directly applied. Without loss of generality, we take information retrieval as an example application in this paper. In such a scenario, a meaningful generalization bound on a learning to rank algoirthm should be defined at query level. As far as we know, (Lan et al., 2008) was the only work that investigates the query-level generalization ability of learning-to-rank algorithms. The problem with their approach, however, was that the proposed framework could only be used in the generalization analysis of pairwise approach, but not the listwise approach1 . Therefore, how to conduct generalization analysis of listwise ranking algorithms was still an open issue. The goal of this paper is exactly to tackle the challenge. First, to conduct generalization analysis of listwise ranking algorithms, we extend the query-level ranking framework proposed in (Lan et al., 2008). In our framework, a query is directly assumed to be a random variable, represented as the set of documents associated with it. However, there is no further assumption on stochastic generation of documents from the set, once it is given and fixed. In this way, it can more naturally characterize the listwise ranking algorithms. Second, with the extended framework, we further employ the Rademacher Average technique (Mendelson, 2001; Bartlett & Mendelson, 2003; Mendelson, 2003; Bousquet et al., 2004) to perform the generalization analysis. Specifically, in this paper, we prove that if we define a compound function, whose outer function is the listwise loss function and inner function is the ranking model of a listwise algorithm, then the generalization bound of the listwise algorithm will be a function of the Rademacher Average of this compound The authors claimed that their framework can handle the listwise case. However, their "listwise" formulation is more like an extension of the pairwise approach in which m-document subset is utilized, and thus it is not the same as the conventional listwise approach. 1 function class. We then demonstrate how to compute the Rademacher Average, by taking the existing algorithms of ListMLE, ListNet, and RankCosine as examples. We also discuss the tightness of the generalization bounds in different situations, in order to provide guidelines on algorithm design and parameter tuning. For instance, we discuss how to set the parameters of an algorithm, how to select a transformation function2 , and how to choose the most suitable algorithm in a given setting. To our knowledge, this paper is the first study on the generalization ability of listwise ranking algorithms. Major contributions of the paper include: 1) the proposal of the extended query-level ranking framework, which enables theoretical analysis on the listwise approach; 2) the proof of the theorem that gives a generalization bound of listwise ranking algorithm on the basis of the Rademacher Average; 3) the derivation of the Rademacher Averages for three listwise ranking algorithms; 4) the investigations on the tightness of generalization bounds, with respect to different listwise loss functions and transformation functions. 2. Query-Level Ranking Framework Given the special characteristics of ranking when compared with classification and regression (e.g., the notion of `query' exists in learning and evaluation), a new framework is needed to facilitate theoretical research on the problem. The framework proposed in (Lan et al., 2008) was based on the same motivation. Unfortunately, it can only explain the pointwise and pairwise approaches, but not the listwise approach including ListNet (Cao et al., 2007) and ListMLE (Xia et al., 2008). In this work, we extend their framework, and provide a more reasonable formalization of the listwise ranking algorithms. For ease of explanation, let us first look at the querylevel ranking framework proposed in (Lan et al., 2008). Let Q be the query space, D be the document space, and X (= Rd ) be the d-dimensional feature space. We use the mapping : Q × D X to create a feature vector from a query-document pair. Let q be a random variable defined on the query space with an unknown In the existing listwise ranking algorithms, a transformation function is first applied to the ranking scores of documents, and then a liswise loss is defined based on the transformed scores and the ground truth. 2 Generalization Analysis of Listwise Learning-to-Rank Algorithms probability distribution PQ . Let f denote the realvalued ranking function, which assigns each document a score f (x). The scores of the documents associated with the same query are used to rank the documents. We measure the loss of ranking documents for query q using function f with a loss function L(f ; q). The goal of ranking is to minimize the expected risk of f as defined below: R(f ) = Q Let l(f ; z, y) denote a listwise loss function (e.g., the likelihood loss in (Xia et al., 2008), the cross entropy loss in (Cao et al., 2007), and the cosine loss in (Qin et al., 2007)) defined on the random variable (z, y) and the ranking model f F, where F is the realvalued function class. Then in our extended framework, L(f ; q) and PQ can be set as L(f ; q) = l(f ; z, y) and PQ = P (·, ·) respectively. In this way, the expected risk of the listwise approach with respect to the loss function l is defined as: Rl (f ) = Z×Y L(f ; q)PQ (dq). As the distribution PQ is unknown, we use the empirical risk to approximate the expected risk, which is defined as follows: 1 ^ R(f ) = n n l(f ; z, y) P (dz, dy). L(f ; qi ), i=1 The empirical risk of the listwise approach with respect to the loss function l is defined as: Rl (f ; S) = 1 n n where q1 , · · · , qn are i.i.d observations of q. Different ways of defining the loss function L(f ; q) lead to different formalizations. Lan et al.(2008) gives a definition of L(f ; q) for the pointwise approach, pairwise approach, and `listwise' approach. Specifically, L(f ; q) is defined as a function based on a probability distribution of documents, document pairs, or mdocument subset. We point out that this might not be appropriate, because for the existing listwise algorithms such as ListNet and ListMLE, once a query is given, the set of all the documents associated with the query is fixed. There is no further random sampling of documents from the set and the query-level loss function does not need to be dependent on such a random sampling. Therefore, it would be more accurate to define the loss function L(f ; q) as a function of the fixed set of all the associated documents. This is exactly the way in which we define the loss function L(f ; q) in this paper. For each query, the number of all the documents associated is assumed to be m. This is for simplicity and it is exactly the case when we perform re-ranking of the top-ranked m documents in practice. The goal of learning in the listwise approach is to best predict the ranked list of m documents given a query. We actually represent query q by (z, y), where z = (x1 , · · · , xm ) and y stands for the ground-truth permutation of m documents. Let Z = X m be the input space, whose elements are m feature vectors corresponding to the m documents, where y(i) stands for the index of the document whose rank is i in the permutation y. We call m the list length and assume that m 3. Let Y be the output space, whose elements are permutations of the m documents. Then we regard (z, y) as a random variable on the space Z × Y according to an unknown probability distribution P (·, ·). l(f ; zi , yi ), i=1 where (zi , yi ), i = 1, · · · , n denotes the training data sampled i.i.d. with (z, y) from the space Z × Y, S = (i) (i) {(z1 , y1 ), · · · , (zn , yn )} and zi = (x1 , · · · , xm ). With the theoretical framework above, one can perform various theoretical analysis on the listwise approach. One of the most important analyses is about the generalization ability of the listwise ranking algorithms. Mathematically, it is to find a tight upper bound of supf F (Rl (f ) - Rl (f ; S)), which is called generalization bound in this paper. 3. Generalization Analysis on Listwise Ranking Algorithms In this section, we give the generalization bound of a listwise algorithm using the ranking framework and Rademacher Average3 . To the best of our knowledge, this is the first generalization analysis on the listwise approach to learning to rank. 3.1. Rademacher Average based Generalization Bound Rademacher Average measures how much the function class F can fit random noise. The definition of Rademacher Average is as follows. Definition 1. For a function class G, the empirical Rademacher Average is defined as: R(G) = E sup gG 1 n n i g(Xi ), i=1 3 There are two types of Rademacher Averages: the empirical and expected Rademacher Averages (Bartlett & Mendelson, 2003). We use the empirical Rademacher Average in this paper. Generalization Analysis of Listwise Learning-to-Rank Algorithms where Xi , i = 1, · · · , n are i.i.d. random variables, and i , i = 1, · · · , n are i.i.d. random variables, with probability 1 to take value 1 or -1, stands for 2 {1 , · · · , n }. According to (Bartlett & Mendelson, 2003), there are three properties with Rademacher Average, as summarized below. 1) c R, where R stands for the space of real numbers, cG {h : g G, s.t.h = cg.}, then, R(cG) = |c|R(G). (1) 3.2.1. Listwise Ranking Algorithms We first show the listwise loss functions of the three algorithms: ListMLE :l(f ; z, y) = - log P (y|z; f ), m P (y|z; f ) = i=1 m j=i (f (xy(i) )) . (f (xy(j) )) ListNet :l(f ; z, y) = - P (|z; gy ) = P (|z; gy ) log P (|z; f ), Y m m j=i 2) If is a Lipschitz function with Lipschitz constant L (i.e., x1 , x2 within the domain of , |(x1 ) - (x2 )| L |x1 - x2 |) and (0) = 0, G {h : g G, s.t.h = g.}, where g stands for the compound function, whose outer function is and inner function is g. Then, R( G) L R(G). (2) i=1 m (gy (x(i) )) , (gy (x(j) )) P (|z; f ) = i=1 m j=i (f (x(i) )) . (f (x(j) )) (gy (z))T (f (z)) (gy (z)) (f (z)) . RankCosine :l(f ; z, y) = 1 2 1- 3) Given Gi , i = 1, · · · , n, i=1 Gi n 1, · · · , n, s.t.g = i=1 gi .}, then, n n n {g : gi Gi , i = Where gy (x) is the score of x given in the groundtruth, (·) is the transformation function, which is an increasing and strictly positive function. In this paper, we assume to be differentiable4 . These algorithms actually minimize the following empirical risk to learn the ranking model: Rl (f ; S) = 1 n n R( i=1 Gi ) R(Gi ). i=1 (3) Applying the theory of Rademacher Average (Bartlett & Mendelson, 2003)(Bousquet et al., 2004) to the listwise approach, we can get the generalization bound of the approach as shown in the following theorem. For the detailed proof, please refer to (Liu & Lan, 2008). Theorem 1. Let A denote a listwise ranking algorithm, and let lA (f ; z, y) [0, 1] be its listwise loss, given the training data S = {(zi , yi ), i = 1, · · · , n}, with probability at least 1 - , the following inequality holds: sup (RlA (f ) - RlA (f ; S)) 2R(lA F ) + 1 n n i=1 l(f ; zi , yi ). i=1 f F 2 ln 2 , n where R(lA F) = E supf F i lA (f ; zi , yi ). Theorem 1 shows that the generalization bound of a listwise ranking algorithm is related to R(lA F), i.e., the Rademacher Average of the class of compound functions, whose outer functions are the listwise loss functions and inner functions are the ranking models. In order to apply this theorem to a specific listwise ranking algorithm, we need to compute the Rademacher Average of the corresponding compound function class. 3.2. Rademacher Averages of Listwise Ranking Algorithms We demonstrate how to compute the Rademacher Averages for three existing listwise ranking algorithms, ListMLE, ListNet and RankCosine. To conduct meaningful comparison among the algorithms, we normalize their listwise loss functions to the same range, e.g., [0, 1]. We further make two assumptions on the feature vector and the ranking model. Let x be the feature vector of a querydocument pair, we assume that x X , x M . Furthermore, following the common practice as in (Bartlett & Mendelson, 2003), we assume that the ranking model f to be learned is from the linear function class F = {x w · x : w B.}5 . Therefore, we have x X , f F, |f (x)| BM . Then, for a listwise ranking algorithm A (e.g., ListMLE, ListNet or RankCosine), we normalize its original listwise loss function l(f ; z, y) to be lA (f ; z, y) = l(f ;z,y) , where ZA (BM ) ZListM LE = ZListN et = m(log m + log (-BM ) ), and ZRankCosine = 1. With the normalization, the algorithms actually minimize the following empirical risk in learning: RlA (f ; S) = 1 n m i=1 lA (f ; zi , yi ). Note that since the normalizer is a constant, the normalization only changes the empirical and expected 4 Most transformation functions used in the literature, such as the linear, exponential and sigmoid functions, satisfy this requirement. 5 In this paper, we take the linear ranking model as an example, and leave the investigations on more complicated ranking models to future work. Generalization Analysis of Listwise Learning-to-Rank Algorithms risks of an algorithm, but does not change their optimal solutions. 3.2.2. Upper Bounds of R(lA F) According to Theorem 1, the generalization ability of a listwise ranking algorithm depends on its Rademacher Average R(lA F). Hereafter, we discuss the upper bound of R(lA F) for each algorithm we are concerned with. The results are summarized in Theorem 2. Theorem 2. The upper bounds of R(lA F) of ListMLE, ListNet, and RankCosine can be represented as the following form: R(lA F ) CA ()N ()R(F ), Furthermore, we have Ei = 0 from the definition of i . Therefore, E [sup f F 1 n n i log((f (xyi (j) )))] i=1 n (i) 1 1 E [sup (-BM ) n f F i (f (xyi (j) ))], i=1 (i) (5) Second, let us consider the term (i) n m 1 E [supf F n i=1 i log( s=j (f (xyi (s) )))]. With a similar technique to that used for the first term and the property of Rademacher Average (Eq. 3), the following inequality holds: E [sup f F where A stands for a listwise ranking algorithm, N () = supx[-BM,BM ] (x), and CA () is defined as follows: CListM LE () = CListN et () = 2 (-BM )(log m + log 2m! (BM ) ) (-BM ) (BM ) ) (-BM ) 1 n n m i log( i=1 s=j ({f (xyi (s) )}))] n m (i) 1 1 E [sup (m - j + 1)(-BM ) n f F 1 (-BM ) (i) m i ( i=1 (i) s=j (f (xyi (s) )))] (6) (i) , E [sup s=j f F 1 n n i (f (xyi (s) ))]. i=1 (-BM )(log m + log m CRankCosine () = . 2(-BM ) , Here xyi (s) are i.i.d. since they are actually documents from different queries with same positions in the ground-truth permutations. Combining Eq. 4, Eq. 5 and Eq. 6, we obtain the following inequality: R(lListM LE F ) Proof. Due to space limitations, we take ListMLE as an example to illustrate the proof sketch. Substituting the listwise loss function of ListMLE into the definition of R(lA F), and using the properties of Rademacher Average (Eq. 1 and Eq. 3), we obtain the following inequality: R(lListM LE F ) + m j=1 2 (BM ) ) (-BM ) (-BM )(log m + log R( F ) = CListM LE ()R( F ), 1 where R( F) = E [supf F n i=1 i (f (xi ))]. Furthermore, using the fact that is differentiable, and the property of Rademacher Average (Eq.2), we have R( F) N ()R(F), where N () = supx[-BM,BM ] (x). Therefore we can conclude the theorem. n (4) n i=1 (i) i log((f (xyi (j) )))] (BM ) ) (-BM ) 1 E [supf F n m(log m + log m j=1 E [supf F 1 n n i=1 i log( m(log m + log m s=j (BM ) ) (-BM ) (f (xyi (s) )))] (i) . First, let us consider the term (i) n 1 E [supf F n i=1 i log((f (xyi (j) )))]. Define (t) = log(1 + t), t [(-BM ) - 1, (BM ) - 1], 1 1 then (0) = 0, and (t) = 1+t (-BM ) . With the property of Rademacher Average (Eq. 2), the following inequality holds: E [sup f F The theorem shows that R(lA F) of an algorithm A is determined by the following three factors: 1. CA (), an algorithm-dependent factor, which is a function of both the listwise loss function and the transformation function . CA () is also related to the range of (x), x [-BM, BM ]. More detailed discussions on CA () will be given in the next section. 2. N (), an algorithm-independent factor, which measures the smoothness of the transformation function . We will discuss this factor in the next section. 1 n n i log((f (xyi (j) )))] i=1 n (i) 1 1 E [sup (-BM ) f F n i=1 i ((f (xyi (j) )) - 1)]. (i) Generalization Analysis of Listwise Learning-to-Rank Algorithms 3. R(F), the Rademacher Average of the ranking function class. According to (Bartlett & Mendelson, 2003), in the case of a linear ranking function, we have R(F) 2BM . n 3.3. Discussion Combining Theorem 1 and Theorem 2, we obtain that with probability at least 1 - , the following inequality holds6 : 4BM sup (RlA (f ) - RlA (f ; S)) CA ()N () + n 2 ln . n 2 · The generalization bound of ListMLE is much tighter than that of ListNet, especially when m, the length of list, is large. Actually, the listwise loss for ListNet can be regarded as the weighted sum of the losses for ListMLE with respect to different targeted permutations. In this regard, it is easy to prove that CListN et () = m!CListM LE (). · The bound of ListMLE decreases monotonically, while the bounds of ListNet and RankCosine increase monotonically, with respect to m. For RankCosine, m can be regarded as the dimension of list representation (each dimension corresponds to a document). When the dimension is high, there are many terms to sum up when computing the Rademacher Average. Considering Eq. 3, the bound of RankCosine will increase monotonically with respect to m. For ListMLE, m determines ZListM LE . From the property of Rademacher Average (Eq. 1), we can clearly see that the bound will decrease monotonically with respect to m. For ListNet, CListN et () = m!CListM LE (). Since m! increases faster than ZListM LE , the bound eventually increases monotonically with respect to m. · When m 6, the generalization bound of ListMLE is always tighter than that of RankCosine; otherwise, certain conditions need to be satisfied to ensure that the generalization bound of ListMLE is still tighter than that of RankCosine. It has been shown above, when m becomes larger the generalization bound of ListMLE will become tighter while that of RankCosine will become looser. More precisely, when m 6, ListMLE will have a tighter bound than RankCosine. If m < 6 and we still want the bound of ListMLE to be tighter, then we need to carefully choose and other parameters. The corresponding conb ditions7 are as follows: for L , 1 < aBM < exp exp f F From the above inequality, we can see that the generalization bound depends on CA (), N (), and number of training samples n. Since CA () and N () are independent of n (according to Theorem 2), when n approaches infinity, the generalization bound vanishes 1 at the rate of O( n ). Furthermore, in general, smaller CA ()N () leads to tighter generalization bound. With some derivations, we obtain the concrete forms of N () and CA () as shown in Table 1, with respect to different transformation functions. We first look at the bounds of different algorithms when the transformation function is fixed. Then we discuss the bound of a given algorithm when the following transformation functions are used, Linear Function: L (x) = ax + b, x [-BM, BM ]. Exponential Function: E (x) = eax , x [-BM, BM ]. Sigmoid Function: 1 S (x) = 1+e-ax , x [-BM, BM ]. Note that in the above definitions, we assume that a > 0 and b > aBM + (where > 0 is a constant), to guarantee that the transformation function is an increasing and strictly positive function. Furthermore, a should not be too small. Otherwise, x, ax 0, which will make the learning process unreasonable. 3.3.1. Generalization Bounds of Different Algorithms For the same transformation function , the generalization bounds of the algorithms only depend on CA (). According to Table. 1, we have the following results with regard to the factor. The notations are the same as in Theorem 1 and Theorem 2. 6 for S , aBM > - log m. Otherwise, the bound of RankCosine will be tighter. In practice, when we employ the listwise ranking algorithms, the average length of list (e.g. the number of training documents per query) is usually not small. Take the benchmark LETOR dataset as an example, the average length of lists in TD2003 and TD2004 Simpler but stronger conditions: for L , 1 < 3 3 for E , aBM > 4 ; for S , aBM > 2 . 7 b aBM 4 +m m 4 -m m ; for E , aBM > 4 m 4 1 2( m - log m); and < 3; 2 Generalization Analysis of Listwise Learning-to-Rank Algorithms Table 1. N () and CA () for Listwise Ranking Algorithms L E S N () a aeaBM aeaBM (1+e-aBM )2 CListM LE () 2 b+aBM (b-aBM )(log m+log b-aBM ) 2eaBM log m+2aBM 2(1+eaBM ) log m+aBM CListN et () 2m! b+aBM (b-aBM )(log m+log b-aBM ) 2m!eaBM log m+2aBM 2m!(1+eaBM ) log m+aBM CRankCosine () m 2(b-aBM ) meaBM 2 m(1+eaBM ) 2 Table 2. Results of Different for RankCosine (Where K = e-aBM ) Table 3. Results of Different for ListMLE and ListNet When b > K 2 (1 + K) + aBM (Where K = e-aBM , b+aBM N = log b-aBM ) Condition E E L log m < log m > log m > 2aBM K 2 -(b-aBM )N b-aBM -e-2aBM 2aBM K 2 -(b-aBM )N b-aBM -e-2aBM Condition b - aBM > K (1 + K) K < b - aBM < K (1 + K) 0 < b - aBM < K 2 2 2 2 Result L S S S L E Result L E L S S E E L S L E S aBM K 2 (1+K)-(b-aBM )N b-aBM -K 2 (1+K) aBM K 2 (1+K)-(b-aBM )N b-aBM -K 2 (1+K) datasets is about 1000, and the average length of lists in OHSUMED dataset is about 150. In this case, the upper bound on the generalization ability of ListMLE will be the best among the three listwise ranking algorithms under investigation. 3.3.2. Generalization Bounds w.r.t. Different Transformation Functions By jointly considering N () and CA (), we get the results presented in Tables 2 and 3. Here we use 1 2 to represent the case in which the generalization bound of an algorithm with transformation function 1 is tighter than that with transformation function 2 . Due to space limitations, we omit the detailed proofs and only make some discussions here. · For RankCosine, the sigmoid transformation function is always better than the exponential transformation function in terms of the generalization bound. As seen from Table 1, CRankCosine (E )N (E ) = (1+eaBM )CRankCosine (S )N (S ). Since aBM > 0, 1 + eaBM > 2. Therefore, we always have: for RankCosine, S E . · For RankCosine, the linear transformation function is better than the sigmoid transformation function in terms of the generalization bound in most cases. log m < log m > aBM (eaBM - 1) log m < aBM (eaBM - 1) Note that B and M are fixed, and a is not very small (as discussed above). In this case, aBM will not be very small either. Since e-u decreases rapidly as u increases, both e-2aBM and e-2aBM (1 + e-aBM ) will be very close to zero. Therefore, given b > aBM + , it is very likely that we also have b > e-2aBM (1+e-aBM )+aBM . Referring to the results in Table 2, we obtain that for RankCosine L S holds in such a case. · For ListMLE and ListNet, the linear transformation function is the best choice in terms of generalization bound in most cases. For ListMLE and ListNet, when b > e-2aBM (1 + e-aBM )+aBM (which is likely to be true in most cases as discussed above), we have the results as listed in Table 3. Generally speaking, when m is not very small (which is true in practice as discussed above), we have L S E . For example, suppose a = 1, M = 1, B = 10, b = 11, then as long as m 1, we will have L S E . Generalization Analysis of Listwise Learning-to-Rank Algorithms In summary, the above discussions imply that for all the three algorithms the use of linear transformation function will result in tighter generalization bounds in most cases. Actually, the compound function of the transformation function and the ranking function f can be viewed as a new ranking function class. For the same f , determines the complexity of the new ranking function. Intuitively, the use of simpler ranking functions will result in better generalization ability. Obviously the linear transformation function is the simplest among all the three transformation functions. Therefore we can make the conclusion. and gaussian complexities risk bounds and structural results. Journal of Machine Learning Research, 3, 463­482. Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical learning theory. Advanced Lectures on Machine Learning, 169­207. Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., & Li, H. (2007). Learning to rank: from pairwise approach to listwise approach. Proceedings of 24th International Conference on Machine Learning (pp. 129­136). Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933­969. Herbrich, R., Obermayer, K., & Graepel, T. (1999). Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers., 115­132. Lan, Y., Liu, T.-Y., Qin, T., Ma, Z., & Li, H. (2008). Query-level stability and generalization in learning to rank. Proceedings of 25th International Conference on Machine Learning (pp. 512­519). Li, P., Burges, C., & Wu, Q. (2007). Mcrank: Learning to rank using multiple classification and gradient boosting. Proceedings of 21st Annual Conference on Neural Information Processing Systems (pp. 845­ 852). Liu, T.-Y., & Lan, Y. (2008). Generalization analysis of listwise learning-to-rank algorithms using rademacher average (Technical Report MSR-TR2008-155). Microsoft Research. Mendelson, S. (2001). Rademacher averages and phase transitions in glivenko-cantelli classes. IEEE Transactions on Information Theory, 48(1), 251­263. Mendelson, S. (2003). A few notes on statistical learning theory. Advanced Lectures in Machine Learning, 1­40. Qin, T., Zhang, X.-D., Tsai, M.-F., Wang, D.-S., Liu, T.-Y., & Li, H. (2007). Query-level loss functions for information retrieval. Information Processing & Management, 44(2), 838­855. Xia, F., Liu, T.-Y., Wang, J., Zhang, W., & Li, H. (2008). Listwise approach to learning to rank - theory and algorithm. Proceedings of 25th International Conference on Machine Learning (pp. 1192­1199). 4. Conclusions and Future Work In this paper, we have proposed a framework for ranking, and analyzed the generalization ability of listwise ranking algorithms with the framework. The analysis is based on the theory of Rademacher Average. We have proved a generalization bound for a listwise ranking algorithm that depends on the Rademacher Average of the class of compound functions. The compound functions take listwise loss functions as outer functions and ranking models as inner functions. We have computed the Rademacher Averages for three listwise ranking algorithms: ListMLE, ListNet, and RankCosine, and discussed the tightness of generalization bounds in different situations. To our knowledge, this is the first work that has formally discussed the generalization ability of listwise ranking algorithms. The major findings in this work are as follows: (1) when the number of training samples approaches infinity, the generalization bounds of the three listwise ranking algorithms will all converge to zero at the 1 same rate of O( n ); (2) when the length of the list is larger than or equal to six, the generalization ability of ListMLE is possibly the best among the three algorithms; (3) in most cases, a linear transformation function is the best choice, in terms of generalization ability. As future work, we plan to investigate the approximation error, and thus the generalization bound between the true expected risk (based on a true loss of ranking) and the empirical risk (based on the listwise loss). References Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. Proceedings of 18th Annual Conference on Learning Theory (pp. 32­47). Bartlett, P. L., & Mendelson, S. (2003). Rademacher