COFI R A N K Maximum Margin Matrix Factorization for Collaborative Ranking Markus Weimer Alexandros Karatzoglou Quoc Viet Le Alex Smola§ Abstract In this paper, we consider collaborative filtering as a ranking problem. We present a method which uses Maximum Margin Matrix Factorization and optimizes ranking instead of rating. We employ structured output prediction to optimize for specific non-uniform ranking scores. Experimental results show that our method gives very good ranking scores and scales well on collaborative filtering tasks. 1 Introduction Collaborative filtering has gained much attention in the machine learning community due to the need for it in webshops such as those of Amazon, Apple and Netflix. Webshops typically offer personalized recommendations to their customers. The quality of these suggestions is crucial to the overall success of a webshop. However, suggesting the right items is a highly nontrivial task: (1) There are many items to choose from. (2) Customers only consider very few (typically in the order of ten) recommendations. Collaborative filtering addresses this problem by learning the suggestion function for a user from ratings provided by this and other users on items offered in the webshop. Those ratings are typically collected on a five star ordinal scale within the webshops. Learning the suggestion function can be considered either a rating (classification) or a ranking problem. In the context of rating, one predicts the actual rating for an item that a customer has not rated yet. On the other hand, for ranking, one predicts a preference ordering over the yet unrated items. Given the limited size of the suggestion shown to the customer, both (rating and ranking) are used to compile a top-N list of recommendations. This list is the direct outcome of a ranking algorithm, and can be computed from the results of a rating algorithm by sorting the items according to their predicted rating. We argue that rating algorithms solve the wrong problem, and one that is actually harder: The absolute value of the rating for an item is highly biased for different users, while the ranking is far less prone to this problem. For example: Consider a pessimistic user, who rates the item he likes best with four stars out of five and an optimistic user who does not rate lower than three. The ratings of these two users can obviously not be the same. However, they may still end up with the same ranking of the items. Another approach is to solve the rating problem using regression. For example for the Netflix prize which uses root mean squared error as an evaluation criterion,1 the most straightforward approach is to use regression. However, the same arguments discussed above apply to regression. Thus, we present an algorithm that solves the ranking problem directly, without first computing the rating. Telecooperation Group, TU Darmstadt, Germany, mweimer@tk.informatik.tu-darmstadt.de Department of Statistics Vienna Univeristy of Technology, alexis@ci.tuwien.ac.at Computer Science Department, Stanford University, Stanford, CA 94305, quoc.le@stanford.edu § SML, NICTA, Northbourne Av. 218, Canberra 2601, ACT, Australia, alex.smola@nicta.com.au 1 We conjecture that this is the case in order to keep the rules simple, since ranking scores are somewhat nontrivial to define, and there are many different ways to evaluate a ranking, as we will see in the following. 1 For collaborative rating, Maximum Margin Matrix Factorization (MMMF) [11, 12, 10] has proven to be an effective means of estimating the rating function. MMMF takes advantage of the collaborative effects: rating patterns from other users are used to estimate ratings for the current user. One key advantage of this approach is that it works without feature extraction. Feature extraction is domain specific, e.g. the procedures developed for movies cannot be applied to books. Thus, it is very hard to come up with a consistent feature set in applications with many different types of items, as for example at Amazon. Our algorithm is based on this idea of MMMF, but optimizes ranking measures instead of rating measures. Given that only the top ranked items will actually be presented to the user, it is much more important to rank the first items right than the last ones. In other words, it is more important to predict what a user likes than what she dislikes. In more technical terms, the loss we incur for estimation is not uniform over the ratings. In general we want to do well in estimating the popular items while ensuring that our estimates of unpopular items are sufficiently low such that we will not end up suggesting them to users. All of above reasonings lead to the following goals: · The algorithm needs to be able to optimize ranking scores directly and adapt to different scores such as Normalized Discounted Cumulative Gain. · The algorithm should not require any features other than the ratings obtained from the collaborative ranking matrix. · The algorithm needs to scale well and parallelize such as to deal with millions of ratings arising from thousands of items and users with an acceptable memory footprint. We achieve these goals by combining a) recent results in optimization, in particular the application of bundle methods to convex optimization problems [14], b) techniques for representing functions on matrices, in particular maximum margin matrix factorizations [10, 11, 12] and c) the application of structured estimation for ranking problems. This combination allows us to obtain significant improvements in the estimation performance on publicly available data sets. The remainder of this paper describes C O F I R A N K . We will do so with NDCG as the example rank criterion. However, it can easily be applied to other criteria. 2 Problem Definition Assume that we have m items and u users. Xtrain is a set of (item,user) pairs, where (i, j ) Xtrain indicates that we have a rating of product j by user i. The rating is stored in Yi,j where Y m×u {1, . . . , r} , where r is some maximal score. In rating, one estimates the missing values in Y directly by means of regression or classification. As discussed above, this problem should be treated as a ranking task which optimizes the uniform pairwise ranking error [4]. Additionally, the correct ranking of higher ranked items is more important than that of lower ranked items. This argument lead to a ranking measure called Normalized Discounted Cumulative Gain[16]: Definition 1 ((Normalized) Discounted Cumulative Gain) Denote by y {1, . . . , r} a sorted vector of ratings in decreasing order and let be a permutation on the set {1, . . . , n}. i denotes the position of item i. Moreover, let k N be a truncation threshold. In this case the Discounted Cumulative Gains (DCG@k) score [5] and its normalized variant (NDCG@k) are given by i k (2yi - 1) DCG@k( , y ) DCG@k( , y ) = and NDCG@k( , y ) = . log(i + 1) DCG@k(argsort(y ), y ) =1 If sorts y in descending order, the DCG@k score is maximized, since any other permutation will increase the denominator for high ratings. The truncation threshold at k reflects the fact that users are only willing to consider few recommendations. NDCG is a normalized version of DCG so that the score is bounded [0, 1]. Unlike classification and regression measures, DCG is defined on permutations, not absolute values of the ratings. Departing from traditional pairwise ranking measures, DCG is position-dependent: Higher positions have more influence on the score than lower positions. Optimizing DCG has gained much interest in the machine learning and information retrieval (e.g. [2]) communities. However, we present the first effort to optimize this measure for collaborative filtering. 2 n To perform estimation, we need a recipe for obtaining the permutations . Since we want our system to be scalable, we need a method which scales not much worse than linearly in the number of the items to be ranked. The avenue we pursue is to estimate a matrix F Rm×u and to use the values Fij for the purpose of ranking the items j for user i. Given a set X of (item,user) pairs, such as Xtrain or Xtest we are now able to define the performance of F with respect to X: R(X, F, Y ) := i X NDCG(i , Y i ), (1) where i is a ranking of the (i, j ) pairs in X with respect to F and where Y i is the vector of corresponding ratings.2 While we would like to maximize R(Xtest , F, Y ) we only have access to R(Xtrain , F, Y ). Hence, we need to restrict the complexity of F to ensure good performance on the test set when maximizing the score on the training set. 3 Structured Estimation for Ranking However, R(X, F, Y ) is non-convex. In fact, it is piecewise constant and therefore clearly not amenable to any type of smooth optimization. To address this issue we take recourse to structured estimation [13, 15]. Note that the scores decompose into a sum over individual users' scores, hence we only need to show how minimizing -NDCG( , y ) can be replaced by minimizing a convex upper bound on the latter. Summing over the users then provides us with a convex bound for all of the terms.3 Our conversion works in three steps: 1. Converting NDCG( , y ) into a loss by computing the regret with respect to the optimal permutation argsort(y ). 2. Denote by a permutation (of the n items a user might want to see) and let f Rn be a estimated rating. We design a mapping ( , f ) R which is linear in f in such a way that maximizing ( , f ) with respect to yields argsort(f ). 3. We use the convex upper-bounding technique described by [15] to combine regret and linear map into a convex upper bound which we can minimize efficiently. We assume that y and f are sorted in descending order of relevance, which allows us to avoid double-index sorting in our derivation. Moreover, 1 denotes the identity permutation. Step 1 (Regret Conversion) Instead of maximizing NDCG( , y ) we may also minimize ( , y ) := NDCG(1 , y ) - NDCG( , y ). (2) ( , y ) is nonnegative and vanishes for = 1 , which is the optimal permutation we could obtain. Step 2 (Linear Mapping) Key in our reasoning is the use of the Polya-Littlewood-Hardy inequality: For any two vectors a, b Rn their inner product is maximized by sorting a and b in the same order, that is a, b sort(a), sort(b) . This allows us to encode the permuation = argsort(f ) in the following fashion: denote by c Rn a decreasing nonnegative sequence, then the function ( , f ) := c, f (3) is linear in f and maximized with respect to for argsort(f ). Since ci is decreasing by construction, the Polya-Littlewood-Hardy inequality applies. We found that choosing ci = (i + 1)-0.25 produced good results in our experiments. However, we did not formally optimize this parameter. Step 3 (Convex Upper Bound) We adapt a result of [15] which describes how to find convex upper bounds on nonconvex optimization problems. Lemma 2 Assume that is defined as in (3). Moreover let := argsort(f ) be the ranking induced by f . Then the following loss function l(f , y ) is convex in f and it satisfies l(f , y ) (y , ). l(f , y ) := max ( , y ) + c, f - f (4) 2 3 M denotes row i of matrix M . Matrices are written in upper case, while vectors are written in lower case. This also opens the possibility for parallelization in the implementation of the algorithm. i 3 Proof We show convexity first. The argument of the maximization over the permutations is a linear and thus convex function in f . Taking the maximum over a set of convex functions is convex itself, which proves the first claim. To see that it is an upper bound, we use the fact that l(f , y ) ( , y ) + c, f - f ( , y ). The second inequality follows from the fact that maximizes c, f . (5) 4 Maximum Margin Matrix Factorization Loss The reasoning in the previous section showed us how to replace the ranking score with a convex upper bound on a regret loss. This allows us to replace the problem of maximizing R(X, F, Y ) by that of minimizing a convex function in F , namely i l(F i i , sort(Y i )) where i = argsort(Y i ). (6) L(X, F, Y ) := X Here F and Y denote the scores on the items rated by user i on X. Moreover, F i i means that permutation i is applied to F i . This step is needed to ensure that the input to the convex upper bound l is properly sorted. i i Matrix Regularization Having addressed the problem of non-convexity of the performance scores we need to find an efficient way of performing capacity control with regard to F , since we only have L(Xtrain , F, Y ) at our disposition, whereas we would like to do well on L(Xtest , F, Y ). The idea to overcome this problem is by means of a regularizer on F , namely the one proposed for Maximum Margin Factorization by Srebro and coworkers[10, 11, 12]. The key idea in their reasoning is to introduce a regularizer on F via [F ] := 1 min [tr M M 2 M ,U + tr U U ] subject to M U = F. (7) More specifically, [12] show that the above is a proper norm on F . While we could use a semidefinite program as suggested in [11], the latter is intractable for anything but the smallest problems.4 Instead, we replace F by M U and solve the following problem: t ( r M M + tr U U 8) minimize L(Xtrain , M U, Y ) + M ,U 2 Note that the above matrix factorization approach effectively allows us to learn an item matrix M and a user matrix U which will store the specific properties of users and items respectively. In a way, this approach learns the features of the items and the users. The dimensionalities of M Rm×d and U Rd×u are chosen mainly based on computational concerns, since a full representation would require d to be of equal dimensionality as min(m, u). In particular, on large problems, such as Netflix, the storage requirements for the user matrix can be enormous and it is convenient to choose d = 10 or d = 100 in practice. Algorithm While (8) may not be jointly convex in M and U any more, it still is convex in M and U individually, whenever the other term is kept fixed. We use this insight to perform alternating subspace descent as proposed by [10]. Note that the algorithm does not guarantee global convergence, which is a small price to pay for computational tractability. repeat For fixed M minimize (8) with respect to U . For fixed U minimize (8) with respect to M . until No more progress is made or a maximum iteration count has been reached. Note that on problems of the size of Netflix the matrix Y has 108 entries, which means that the number of iterations is typically time limited. We now discuss a general optimization method for solving regularized convex optimization problems. For more details see [14]. 4 » In this case we optimize over A F F B ­ 0 where [F ] is replaced by 1 [tr A + tr B ]. 2 4 Algorithm 1 Bundle Method( ) Initialize t = 0, W0 = 0, b0 = 0 and H = repeat Find minimizer Mt and value L of the optimization problem minimize max U t 0j t r W j U + bj + tr U 2 U . Compute Wt+1 = U L(Xtrain , Mt U, Y ) Compute bt+1 = L(Xtrain , Mt U, Y ) - tr Mt Wt+1 if H := tr Wt+1 Mt + bt+1 + tr U U H then 2 e Update H H nd if 5 until H - L Optimization Bundle Methods We discuss the optimization over the user matrix U first, that is, consider the problem of minimizing ( R(U ) := L(Xtrain , M U, Y ) + tr U U 9) 2 The regularizer tr U U is rather simple to compute and minimize. On the other hand, L is expensive to compute, since it involves solving many linear assignment problems. Bundle methods, as proposed in [14] aim to overcome this problem by performing successive Taylor approximations of L and by using them as lower bounds. In other words, they exploit the fact that L(Xtrain , M U, Y ) L(Xtrain , M U , Y ) + tr(M - M ) M L(Xtrain , M U , Y ) for all M , M . Since this holds for arbitrary M , we may pick a set of Mi and use the maximum over the Taylor approximations at locations Mi to lower-bound L. Subsequently, we minimize this piecewise linear lower bound in combination with tr U U to obtain a new location where to compute our next 2 Taylor approximation and iterate until convergence is achieved. Algorithm 1 provides further details. As we proceed with the optimization, we obtain increasingly tight lower bounds on L(Xtrain , M U, Y ). One may show [14] that the algorithm converges to precision with respect to the minimizer of R(U ) in O(1/ ) steps. Moreover, the initial distance from the optimal solution enters the bound only logarithmically. After solving the optimization problem in U we switch to optimizing over the item matrix M . The algorithm is virtually identical to that in U , except that we now need to use the regularizer in M instead of that in U . We find experimentally that a small number of iterations (less than 10) is more than sufficient for convergence. Computing the Loss So far we simply used the loss l(f , y ) of (4) to define a convex loss without any concern to its computability. To implement Algorithm 1, however, we need to be able to solve the maximization with respect to the set of permutations efficiently. We begin by rewriting ( , y ) in a more convenient formulation: in O ( , y ) = NDCG@k (1 , y ) - NDCG@k ( , y ) = const. - ai b(y )i = const. - a , b(y ) =1 ne may check that the coefficients a and the function b(y ) are defined as follows: 1 2yi - 1 and b(y )i = . log(i + 1) DCG@k (1 , y ) (10) ai = This means that we may write the loss l(f , y ) as f l(f , y ) = const. + max , c - a , b(y ) 5 This maximization is a linear assignment problem in terms of the permutation . After all, we are optimizing over the positions i to which an item could be assigned to and the overall score is a sum of the assignments occurring at positions i. Efficient algorithms [7] based on the Hungarian Marriage algorithm (also referred to as the Kuhn-Munkres algorithm) exist for this problem [8], which can be used to obtain the maximizer of (10): it turns out that this integer programming ¯ problem can be solved by invoking a linear program. This in turn allows us to compute l(f , y ) efficiently. Computing the Gradients The second ingredient needed for applying the bundle method is to compute the gradients of L(X, F, Y ) with respect to F , since this allows us to compute gradients with respect to M and U by applying the chain rule: M L(X, M U, Y ) = U F L(X, F , Y )|F =M U and U L(X, M U, Y ) = M F L(X, F, Y )|F =M U . L decomposes into losses on individual users as described in (6). For each user i only row i of F (denoted by f ) matters. It follows that F L(X, F, Y ) is composed of the gradients of l(f , sort(y i )). Note that for l defined as in (4) we know that f l(f , y ) = [c - c-1 ]. ¯ Here we denote by the maximizer of of the loss and c-1 denotes the application of the inverse ¯ ¯ permutation -1 to the vector c. ¯ 6 Experiments We evaluated our system in two real world evaluation settings: "weak" and "strong" [9] generalization on three publicly available data sets: EachMovie, MovieLens and Netflix. Statistics for those can be found in table 1. Dataset EachMovie MovieLens Netflix Users 61265 983 480189 Movies 1623 1682 17770 Ratings 2811717 100000 100480507 Table 1: Data set statistics Weak generalization is evaluated by predicting the rank of unrated items for users known at training time. To do so, we randomly select N = 10, 20, 50 ratings for each user for training and and evaluate on the remaining ratings. We compare our approach C O F I R A N K -NDCG to C O F I R A N K -Ordinal, C O F I R A N K -Regression and MMMF [10]. Experimental results are shown in table 2. C O F I R A N K Ordinal applies the algorithm described above to preference ranking by optimizing the preference ranking loss. Similarly, C O F I R A N K -Regression optimizes for regression using the root mean squared loss. For all C O F I R A N K experiments, we choose = 10. We did not optimize for this parameter. The results for MMMF were obtained using MATLAB code available from the homepage of the authors of [10]. For those, we used = 119 for EachMovie, and = 116 for MovieLens as it is reported . . to yield the best results for MMMF. In all experiments, we choose the dimensionality of U and M to be 100. All C O F I R A N K experiments and those of MMMF on MovieLens were repeated ten times. Unfortunately, we underestimated the runtime and memory requirements of MMMF on EachMovie. Thus, we cannot report results on this data set using MMMF. Additionally, we performed some experiments on the Netflix data set. However, we cannot compare to any of the other methods on that data set as to the best of our knowledge, C O F I R A N K is the first ranking algorithm to be applied to this data set, supposedly because of its large size. Strong generalization is evaluated on users that were not present at training time. We follow the procedure described in [17]: Movies with less than 50 ratings are discarded. The 100 users with the most rated movies are selected as the test set and the methods are trained on the remaining users. In evaluation, 10, 20 or 50 ratings from those of the 100 test users are selected. For those ratings, 6 EachMovie Method C O F I R A N K NDCG C O F I R A N K Ordinal C O F I R A N K Regression C O F I R A N K NDCG C O F I R A N K Ordinal C O F I R A N K Regression MMMF C O F I R A N K NDCG C O F I R A N K Regression NDCG@10 for N=10 0.6673 ± 0.0015 0.5592 ± 0.0040 0.5349 ± 0.0214 p = 2.7e-14 0.6364 ± 0.0064 0.6291 ± 0.0004 0.6404 ± 0.0057 0.6061 ± 0.0037 p = 0.011 0.6081 0.6082 NDCG@10 for N=20 0.7589 ± 0.0006 0.6619 ± 0.0062 0.6291 ± 0.0161 p = 3.0e-12 0.7153 ± 0.0038 0.6601 ± 0.0013 0.7015 ± 0.0056 0.6937 ± 0.0039 p = 6e-5 0.6204 0.6287 NDCG@10 for N=50 0.7291 ± 0.0040 0.6415 ± 0.0086 0.6354 ± 0.0.0142 p =6.3e-10 0.6943 ± 0.0017 0.62066 ± 0.0006 0.6655 ± 0.0050 0.6989 ± 0.0051 p = 0.038 MovieLens Netflix Table 2: Results for the weak generalization setting experiments. We report the NDCG@10 accuracy for various numbers of training ratings used per user. For most results we report the mean over 10 runs and the standard deviation. We also report the p-values for the best vs. second best score. Method C O F I R A N K NDCG GPR CGPR GPOR CGPOR MMMF C O F I R A N K NDCG GPR CGPR GPOR CGPOR MMMF NDCG@10 for N=10 0.6367 ± 0.001 0.4558 ± 0.015 0.5734 ± 0.014 0.3692 ± 0.002 0.3789 ± 0.011 0.4746 ± 0.034 0.6237 ± 0.0241 0.4937 ± 0.0108 0.5101 ± 0.0081 0.4988 ± 0.0035 0.5053 ± 0.0047 0.5521 ± 0.0183 NDCG@10 for N=20 0.6619 ± 0.0022 0.4849 ± 0.0066 0.5989 ± 0.0118 0.3678 ± 0.0030 0.3781 ± 0.0056 0.4786 ± 0.0139 0.6711 ± 0.0065 0.5020 ± 0.0089 0.5249 ± 0.0073 0.5004 ± 0.0046 0.5089 ± 0.0044 0.6133 ± 0.0180 NDCG@10 for N=50 0.6771 ± 0.0019 0.5375 ± 0.0089 0.6341 ± 0.0114 0.3663 ± 0.0024 0.3774 ± 0.0041 0.5478 ± 0.0211 0.6455 ± 0.0103 0.5088 ± 0.0141 0.5438 ± 0.0063 0.5011 ± 0.0051 0.5049 ± 0.0035 0.6651 ± 0.0190 EachMovie MovieLens Table 3: The NGDC@10 accuracy over 10 runs and the standart deviation for the strong generalization evaluation. the user training procedure is applied to optimize U . M is kept fixed in this process to the values obtained during training. The remaining ratings are tested using the same procedure as for the weak generalization. We repeat the whole process 10 times and again use = 10 and a dimensionality of 100. We compare our method to Gaussian Process Ordinal Regression (GPOR) [3] Gaussian Process Regression (GPR) and the collaborative extensions (CPR, CGPOR) [17]. Table 3 shows our results compared to the ones from [17]. C O F I R A N K performs strongly compared to most of the other tested methods. Particularly in the strong generalization setting C O F I R A N K outperforms the existing methods in almost all the settings. Note that all methods except C O F I R A N K and MMMF use additional extracted features which are either provided with the dataset or extracted from the IMDB. MMMF and C O F I R A N K only rely on the rating matrix. In the weak generalization experiments on the MovieLens data, C O F I R A N K performs better for N = 20 but is marginally outperformed by MMMF for the N = 10 and N = 50 cases. We believe that with proper parameter tuning, C O F I R A N K will perform better in these cases. 7 Discussion and Summary C O F I R A N K is a novel approach to collaborative filtering which solves the ranking problem faced by webshops directly. It can do so faster and at a higher accuracy than approaches which learn a rating to produce a ranking. C O F I R A N K is adaptable to different ranking measures as NDCG or ERU in a plug-and-play manner. Additionally, is C O F I R A N K well suited for privacy concerned applications, as the optimization itself does not need ratings from the users, but only gradients. 7 Our results, which we obtained without parameters tuning, are on par or outperform several of the most successful approaches to collaborative filtering like MMMF, even when they are used with tuned parameters. C O F I R A N K performs best on data sets of realistic sizes as EachMovie and significantly outperforms other approaches in the strong generalization setting. In our experiments, C O F I R A N K shows to be very fast. For example, training on EachMovie with N = 10 can be done in less than ten minutes and uses less than 80M B of memory on a laptop. For N = 20, C O F I R A N K obtained a NDCG@10 of 0.72 after the first iteration, which also took less than ten minutes. This is the highest NDCG@10 score on that data set we are aware of (apart from the result of C O F I R A N K after convergence). A comparison to MMMF in that regard is difficult, as it is implemented in MATLAB and C O F I R A N K in C++. However, C O F I R A N K is more than ten times faster than MMMF while using far less memory. In the future, we will exploit the fact that the algorithm is easily parallelizable to obtain even better performance on current multi-core hardware as well as computer clusters. Even the current implementation allows us to report the first results on the Netflix data set for direct ranking optimization. References [1] J. S. Breese, D. Heckerman, and C. Kardie. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, pages 43­52, 1998. ¨ [2] C. J. Burges, Q. V. Le, and R. Ragno. Learning to rank with nonsmooth cost functions. In B. Scholkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, 2007. [3] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. J. Mach. Learn. Res., 6:1019­ 1041, 2005. [4] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In A. J. ¨ Smola, P. L. Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 115­132, Cambridge, MA, 2000. MIT Press. [5] K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In ACM Special Interest Group in Information Retrieval (SIGIR), pages 41­48. New York: ACM, 2002. [6] T. Joachims. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2006. [7] R. Jonker and A. Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38:325­340, 1987. [8] H.W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83­97, 1955. [9] B. Marlin. Collaborative filtering: A machine learning perspective. Masters thesis, University of Toronto, 2004. [10] J. Rennie and N. Srebro. Fast maximum margin matrix factoriazation for collaborative prediction. In Proc. Intl. Conf. Machine Learning, 2005. [11] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, Cambridge, MA, 2005. MIT Press. [12] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In P. Auer and R. Meir, editors, Proc. Annual Conf. Computational Learning Theory, number 3559 in Lecture Notes in Artificial Intelligence, pages 545­560. Springer-Verlag, June 2005. [13] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In S. Thrun, L. Saul, and ¨ B. Scholkopf, editors, Advances in Neural Information Processing Systems 16, pages 25­32, Cambridge, MA, 2004. MIT Press. [14] C.H. Teo, Q. Le, A.J. Smola, and S.V.N. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In Conference on Knowledge Discovery and Data Mining, 2007. [15] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:1453­1484, 2005. [16] E. Voorhees. Overview of the TREC 2001 question answering track. In Text REtrieval Conference (TREC) Proceedings. Department of Commerce, National Institute of Standards and Technology, 2001. NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001). [17] S. Yu, K. Yu, V. Tresp, and H. P. Kriegel. Collaborative ordinal regression. In W.W. Cohen and A. Moore, editors, Proc. Intl. Conf. Machine Learning, pages 1089­1096. ACM, 2006. 8 A Ranking Losses The main body of the paper describes the procedures to compute the function value l(f , y ) and its gradient f l(f , y ) (the 'bundle') for the NDCG loss. This is the information needed by the bundle methods to minimize the empirical risk. We can actually use Algorithm 1 in combination with any loss. This section will describe how to compute the loss value and its gradient efficiently for many ranking losses. A.1 Regression For least mean squared regression, the loss and the gradient are defined as 1 l(f , y ) = (f - y )2 2 f l(y , f ) = (y - f ) A.2 Ordinal Regression For ordinal regression, we would like fi - fj < 0 whenever yi < yj . Whenever this relationship is not satisfied, we incur a cost C (i, j ) for preferring item i to item j . n Denote by mi the number of items j for which yj = i. In this case, there are m2 - i=1 m2 pairs i yi , yj for which yi = yj . Normalizing by the total number of comparisons we may write the overall cost of the estimator as m . i 1 1y 2 2 C (yi , yj ) {fi > fj } where M = - mi (13) M yi . This problem can be solved as follows: sort f in ascending order and traverse it while keeping track of how many items with a lower value yj are no more than 1 apart in terms of their value of fi . This way we may compute the count statistics efficiently. Algorithm 2 describes the details, generalizing the results of [6]. Again, its runtime is O(m log m), thus allowing for efficient computation. A.3 Preference Relations In general, our loss may be described by means of a set of preference relations j i for arbitrary 2 pairs (i, j ) {1, . . . m} associated with a cost C (i, j ) which is incurred whenever i is ranked above j . This set of preferences may or may not form a partial or a total order on the domain of all observations. In these cases efficient computations along the lines of Algorithm 2 exist. In general, this is not the case and we need to rely on the fact that the set P containing all preferences is sufficiently small that it can be enumerated efficiently. The loss is then given by 1( C (i, j ) {fi > fj } (15) |P | i,j )P Again, the same majorization argument as before allows us to write a convex upper bound 1( l(f , y ) = C (i, j ) max [0, 1 + fi - fj ] |P | i,j )P 0 1( if fj - fi 1 where f l(f , y ) = C (i, j ) fi - fj otherwise |P | i,j )P (16) (17) The implementation is straightforward, as given in Algorithm 3. 9 Algorithm 2 (l, g ) = Ordinal(f , y , C ) input Vectors f and y score matrix C output Loss l := l(f , y ) and gradient g := f l(f , y ) for i = 1 to n do hi = 0 and ui = mi end for set c = [f - 1 , f + 1 ] R2m (concatenate the vectors) 2 2 2 compute M = 0.5(m2 - u 2 ) rescale C C /M index = QuickSort(c) initialize l = 0 and g = 0 for i = 1 to 2m do j = index(i) mod m and z = yj if index(i) m then for k = 1 to z - 1 do l l - C (k , z )uk cj gj gj - C (k , z )uk end for hz hz + 1 else for k = z + 1 to n do l l + C (z , k )hk cj +m gj gj + C (z , k )hk end for uz uz - 1 end if end for Algorithm 3 (l, g ) = Preference(f , P, C ) input Vectors f , preference set P , score matrix C output Loss l := l(f , y ) and gradient g := f (f , y ) initialize l = 0 and g = 0 while (i, j ) P do if fj - fi < 1 then l l + C (i, j )(1 + fi - fj ) gi gi + C (i, j ) and gj gj - C (i, j ) end if end while A.4 Expected Rank Utility Similar to the NDCG ranking measure, Expected Rank Utility is a permutation-dependent and position-dependent ranking measure. Definition 3 (Expected Rank Utility) Under the assumptions on y and as in Definition 1 the Expected Rank Utility (ERU) and its normalized variant (NERU) are defined as [1] ERU( , y ) = ik =1 2 -1 max(yi - d, 0) and NERU( , y ) = 1-i ERU( , y ) . ERU(argsort(y ), y ) (18) Here d is a "neutral" vote and is the viewing halflife. This means that the top positions have an exponentially higher weight than positions lower on the list. In other words, d is a cutoff beyond which we stop caring about the particular object. 10 Due to the similarities between NDCG and NERU, the optimization procedure described for NDCG can be applied directly for NERU. However, we have to use a new setting for vectors a and b ai := 2 -1 and bi := max(yi - d, 0) 1-i (19) 11