A Simple Algorithm for Nuclear Norm Regularized Problems Martin Jaggi jaggi@inf.ethz.ch Marek Sulovsk´ y smarek@inf.ethz.ch Institute of Theoretical Computer Science, ETH Zurich, CH-8092 Zurich, Switzerland Abstract Optimization problems with a nuclear norm regularization, such as e.g. low norm matrix factorizations, have seen many applications recently. We propose a new approximation algorithm building upon the recent sparse approximate SDP solver of (Hazan, 2008). The experimental efficiency of our method is demonstrated on large matrix completion problems such as the Netflix dataset. The algorithm comes with strong convergence guarantees, and can be interpreted as a first theoretically justified variant of Simon-Funk-type SVD heuristics. The method is free of tuning parameters, and very easy to parallelize. and the corresponding constrained variant t XRn×m , ||X|| 2 min f (X) (2) where f (X) is any differentiable convex function (usually called the loss function), ||.|| is the nuclear norm of a matrix, also known as the trace norm (sum of the singular values, or 1 -norm of the spectrum). Here µ > 0 and t > 0 respectively are given parameters, usually called the regularization parameter. When choosing f (X) := ||A(X) - b||2 for some lin2 ear map A : Rn×m Rp , the above formulation (1) is the matrix generalization of the problem minxRn ||Ax - b||2 + µ||x||1 , which is the important 2 1 -regularized least squares problem, also known as the basis pursuit de-noising problem in compressed sensing literature. The analogue vector variant of (2) is the Lasso problem (Tibshirani, 1996) which is minxRn ||Ax - b||2 ||x||1 t . 2 Recently (Toh & Yun, 2009; Liu et al., 2009) and (Ji & Ye, 2009) independently proposed algorithms that obtain an -accurate solution to (1) in O(1/ ) steps, by improving the algorithm of (Cai et al., 2008). More recently also (Mazumder et al., 2009) and (Ma et al., 2009) proposed algorithms in this line of so called singular value thresholding methods, but cannot guarantee a convergence speed. Each step of all those algorithms requires the computation of the singular value decomposition (SVD) of a matrix of the same size as the solution matrix, which is expensive even with the currently available fast methods such as PROPACK. Both (Toh & Yun, 2009) and (Ji & Ye, 2009) show that the primal error of their algorithm is smaller than after O(1/ ) steps, using an analysis in the spirit of (Nesterov, 1983). We present a much simpler algorithm to solve problems of the form (2), which does not need any SVD computations. We achieve this by transforming the problem to a convex optimization problem on positive semi-definite matrices, and then using the approximate SDP solver of Hazan (2008). Hazan's algorithm 1. Introduction This paper considers large scale convex optimization problems with a nuclear norm regularization, as for instance low norm matrix factorizations. Such formulations occur in many machine learning and compressed sensing applications such as dimensionality reduction, matrix classification, multi-task learning and matrix completion (Srebro et al., 2004; Candes & Tao, 2009). Matrix completion by using matrix factorizations of either low rank or low norm has gained a lot of attention in the area of recommender systems (Koren et al., 2009) with the recently ended Netflix Prize competition. Our new method builds upon the recent first-order optimization scheme for semi-definite programs (SDP) of (Hazan, 2008) and has strong convergence guarantees. We consider the following convex optimization problems over matrices: XR min f (X) + µ||X|| n×m (1) Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). A Simple Algorithm for Nuclear Norm Regularized Problems can be interpreted as the generalization of the coreset approach to problems on symmetric matrices. The algorithm has a strong approximation guarantee, in the sense of obtaining -small primal-dual error (not only small primal error). With the resulting approximate solution X, our algorithm also gives a matrix factorization X = U V T of rank O(1/ ) (with the desired bounded nuclear norm). Compared to (Nesterov, 1983), a moderately increased number of steps is needed in total, O(1/ ), which represents the price for the very severe simplification in each individual step of our method on the one hand and the improved (low) rank on the other hand. We demonstrate that our new algorithm on standard datasets improves the state of the art nuclear norm methods, and scales to large problems such as matrix factorizations on the Netflix dataset. Furthermore, the algorithm is easy to implement and parallelize, as it only uses the power method (or Lanczos steps) to approximate the largest eigenvalue of a matrix. Our method can also be interpreted as a modified, theoretically justified variant of Simon Funk's popular SVD heuristic (Webb, 2006), making it suitable for low norm matrix factorization. To our knowledge this is the first guaranteed convergence result for this class of SVD-like gradient descent algorithms. Unlike most other comparable algorithms, our general method is free of tuning parameters (apart from the regularization parameter). Notation. For arbitrary real matrices, the standard inner product is defined as A, B := T r(AT B), and the (squared) Frobenius matrix norm ||A||2 ro := F A, A is the sum of all squared entries of the matrix. By Sd×d we denote the set of symmetric d × d matrices. A Rd×d is called positive semi-definite (PSD), written as A 0, iff v T Av 0 v Rd . vector of a matrix M Sd×d . In practice for example Lanczos or the power method can be used. Algorithm 1 Hazan's Algorithm Input: Convex f with curvature constant Cf , target accuracy . T Initialize Z (1) := v0 v0 for arbitrary unit vector v0 . 4Cf do for k = 1 to Compute vk := ApproxEV - f (Z (k) ), Let k := T Set Z (k+1) := Z (k) + k vk vk - Z (k) . end for Here ApproxEV(M, ) is a sub-routine that delivers an approximate largest eigenvector to a matrix M with the desired accuracy , meaning a unit length vector v such that v T M v max (M ) - . Note that as our convex function f takes a symmetric matrix Z as an argument, its gradient f (Z) is a symmetric matrix. The actual running time for a given convex function f : Sd×d R depends on its curvature constant Cf (also called the modulus of convexity) defined as Cf := sup Z,V S,R, Z =Z+(V -Z) Cf k2 . 1 k. 1 2 (f (Z ) - f (Z) + Z - Z, f (Z) ) , which turns out to be small for many applications1 . The algorithm can be seen as a matrix generalization of the sparse greedy approximation algorithm of (Clarkson, 2008) for vectors in the unit simplex, called the coreset method, which has seen many successful applications in a variety of areas ranging from clustering to support vector machine training, smallest enclosing ball/ellipsoid, boosting and others. Here sparsity just gets replaced by low rank. The same Algorithm 1 with a well-crafted function f can also be used to solve arbitrary SDPs in feasibility form. 2. Hazan's Algorithm Our main ingredient is the following simple gradientdescent type algorithm of (Hazan, 2008), to obtain sparse solutions to any convex optimization problems of the form min f (Z) , (3) ZS 3. Transformation to convex problems 3.1. Motivation: Formulating Matrix Factorizations as Convex Optimization Problems Approximate matrix factorization refers to the setting of approximating a given matrix Y Rn×m (typically given only partially) by a product X = U V T , under an additional low rank or low norm constraint, such that some error function f (X) is minimized. Most of the currently known gradient-descent-type algorithms for matrix factorization suffer from the following problem: 1 An overview of values of Cf for several classes of functions f can be found in (Clarkson, 2008) where S := Z Sd×d Z 0, T r(Z) = 1 is the set of PSD matrices of unit trace. The set S is sometimes called Spectrahedron and is a generalization of the unit simplex to the space of symmetric matrices. The algorithm guarantees -small primal-dual error after at most O 1 iterations, where each iteration only involves the calculation of a single approximate eigen- A Simple Algorithm for Nuclear Norm Regularized Problems Even if the loss-function f (X) is convex in X, the same function expressed as a function f (U V T ) of both the factor variables U and V usually becomes a non-convex problem (consider for example U, V R1×1 together with the identity function f (x) = x). Therefore many of the popular methods such as for example (Rennie & Srebro, 2005; Lin, 2007) can get stuck in local minima and so are neither theoretically nor practically well justified, see also (DeCoste, 2006). These shortcomings can be overcome as follows: One can equivalently transform any low-norm matrix factorization problem (which is usually not convex in its two factor variables) into an optimization problem over symmetric matrices: For any function f on Rn×m , the optimization problem U Rn×r V Rm×r Lemma 1. For any non-zero matrix X Rn×m and t R: t ||X|| 2 iff symmetric matrices A Sn×n , B Sm×m A X s.t. 0 and T r(A) + T r(B) = t . XT B Proof. This is a slight variation of the argument of (Fazel et al., 2001; Srebro et al., 2004). From the characterization ||X|| = 1 minU V T =X 2 (||U ||2 ro + ||V ||2 ro ) we get that F F U, V , U V T = X s.t. ||U ||2 ro + ||V ||2 ro = F F T T T r(U U ) + T r(V V ) t, or in other words we have UUT X found a matrix of trace say s t. If XT V V T s < t, we add (t - s) to the top-left entry of A, i.e. we add to A the PSD matrix e1 eT (which again gives a 1 PSD matrix). As the matrix is symmetric and PSD, it can be (Cholesky) factorized to (U ; V )(U ; V )T s.t. U V T = X and t = T r(U U T ) + T r(V V T ) = ||U ||2 ro + ||V ||2 ro , F F t therefore ||X|| 2 . Corollary 2. Any nuclear norm regularized problem of the form (2) is equivalent to min ZS(n+m)×(n+m) Z 0, T r(Z)=t min s.t. f (U V T ) ||U ||2 ro + ||V ||2 ro = t F F (4) is equivalent to min ZS(n+m)×(n+m) rank(Z)r ^ f (Z) Z 0, T r(Z) = t. (5) s.t. where "equivalent" means that for any feasible solution of each problem, there is a feasible solution of the other problem with the same objective value. Here ^ f is the same function as f , just acting on the corresponding off-diagonal rectangle of the larger, sym^ metric matrices Z S(n+m)×(n+m) . Formally, f (Z) = ^ f Z1 Z2 T Z2 Z3 ^ f (Z) . (6) := f (Z2 ). The equivalence holds simU V ply because every PSD matrix Z can be written as some product Z = (U T V T ) = UUT UV T , V UT V V T for U Rn×r and V Rm×r , when r rank(Z). On the other hand, of course any arbitrary valued matrices U, V give rise to a PSD matrix Z of this form. Furthermore T r(Z) = T r(U U T ) + T r(V V T ) = ||U ||2 ro + ||V ||2 ro holds by definition. F F The main advantage of this reformulation is that if the rank r := n + m is not restricted, the new problem (5) is now a convex problem over a nice well-studied convex domain (the cone of PSD matrices of fixed trace), whereas the original formulation (4) is usually not convex in both arguments U and V . 3.2. Nuclear Norm Regularized Problems In the same spirit, we obtain that any nuclear norm regularized problem of the form (2) is equivalent to the convex problem given by the following Corollary 2. Note that both transformations in this section are equivalent formulations and not just relaxations. As already mentioned above, an explicit factorization of any feasible solution to (5) or (6) -- if needed -- can always be directly obtained since Z 0. Alternatively, algorithms for solving the transformed problem (5) or (6) can directly maintain the approximate solution Z in a factorized representation, as achieved for example by Hazan's algorithm. 3.3. Two Variants of Regularization The two original problem formulations (1) and (2) are very closely related, and used interchangeably in many applications: If X is an optimal solution to trade-off variant (1), then the same solution is also optimal for (2) when using the value ||X || as the norm constraint. On the other hand (1) is just the Lagrangian version of (2), with µ being the Lagrange multiplyer belonging to the single constraint. This is the same change in formulation as when going from regularized least squares formulation (the vector analogue of (1)), to the Lasso problem corresponding to (2) and vice versa. A Simple Algorithm for Nuclear Norm Regularized Problems 4. Solving Nuclear Norm Regularized Problems By the equivalent reformulation of the previous section, as in Corollary 2, we can now solve both general nuclear norm regularized problems and low norm matrix factorizations by using Hazan's algorithm. Algorithm 2 Nuclear Norm Regularized Solver ^ 1. Consider the transformed problem for f given by Corollary 2. ^ 2. Adjust the function f by re-scaling all matrix entries by 1 . t ^ 3. Run Algorithm 1 for f (Z). The following theorem shows that Algorithm 2 runs in time linear in the number Nf of non-zero entries of the gradient f . This makes it very attractive in particular for recommender systems applications and matrix completion, where f is a sparse matrix (same sparsity pattern as the observed entries). Theorem 3. Algorithm 2 obtains an approximate solution of primal-dual error for problems of the 4Cf form (2) after at most many steps (or in other words approximate eigenvector computations). In the k-th call of ApproxEV(), it is sufficient to perform O(k) iterations of Lanczos method. Then the f overall running time is O 2 , or equivalently O many sparse matrix-vector multiplications. of low rank (k after k steps), and that by incremenT tally adding the rank-1 matrices vk vk , the algorithm automatically maintains a matrix factorization of the approximate solution. Also, Hazan's algorithm is designed to automatically stay within the feasible region S, where most of the existing approximate SDP-like methods do need a projection step to get back to the feasible region (as e.g. (Lin, 2007; Liu et al., 2009)), which makes both their theoretical analysis and implementation much more complicated. 4.1. The Structure of the Eigenvalue Problem For the actual computation of the approximate largest Cf ^ ^ eigenvector in ApproxEV - f (Z (k) ), k2 , either Lanczos method or the power method (as in PageRank, see e.g. (Berkhin, 2005)) can be used. Both methods are known to scale well to very large problems and can be parallelized easily, as each iteration consists of just one matrix-vector multiplication. However, we have to be careful that we obtain the eigenvector for the largest eigenvalue which is not necessarily the principal one (largest in absolute value). In that case the spectrum can be shifted by adding an appropriate constant to the diagonal of the matrix. (Hazan, 2008) made use of the fact that Lanczos method, which is theoretically better understood, provably obtains the required approximation quality in a bounded number of steps if the matrix is PSD (Arora et al., 2005). ^ For arbitrary loss function f , the gradient - f (Z), which is the matrix whose largest eigenvector we have to compute in the algorithm, is always a symmet0 G ^ ric matrix of the block form f (Z) = for GT 0 Z1 Z2 G = f (Z2 ), when Z = . In other words T Z2 Z3 ^ f (Z) is the adjacency matrix of a weighted bipartite graph. One vertex class corresponds to the n rows of the original matrix Z2 (users in recommender systems), the other class corresponds to the m columns ^ (items or movies). The spectrum of f is always v symmetric: Whenever is an eigenvector for some w v eigenvalue , then is an eigenvector for -. -w Hence, we have exactly the same setting as in the established Hubs and Authorities (HITS) model (Kleinberg, 1999). The first part of any eigenvector is always an eigenvector of the hub matrix GT G, and the second part is an eigenvector of the authority matrix GGT . N 1 2 Proof. We use Corollary 2 and then rescale all matrix entries by 1 . Then the running time of follows from t Theorem 2 of (Hazan, 2008). The fact that each iteration of our algorithm is computationally very cheap -- consisting only of the computation of an approximate eigenvector -- strongly contrasts the existing "singular value thresholding" methods, which in each step need to compute an entire SVD. Such a single incomplete SVD computation (first k singular vectors) amounts to the same computational cost as an entire run of our algorithm (for k steps). Furthermore, those existing methods which come with a theoretical guarantee, i.e. (Toh & Yun, 2009; Liu et al., 2009; Ji & Ye, 2009; Ma et al., 2009), in their analysis assume that all SVDs used during the algorithm are exact, which is not feasible in practice. By contrast, our analysis is rigorous even if the used eigenvectors are only -approximate. Another nice property of Hazan's method is that the returned solution is guaranteed to be simultaneously A Simple Algorithm for Nuclear Norm Regularized Problems Repeated squaring. In the special case that the matrix X is very rectangular (n m or n m), one of the two matrices GT G or GGT is very small. Then it is known that one can obtain an exponential speed-up in the power method by repeatedly squaring the smaller one of the matrices. In other words we can perform O(log 1 ) many matrix-matrix multiplications instead of O( 1 ) matrix-vector multiplications. 4.2. Application to Matrix Completion and Low Norm Matrix Factorizations For matrix factorization problems as for example from recommender systems (Koren et al., 2009), our algorithm is particularly suitable as it retains the sparsity of the observations, and constructs the solution in a factorized way. In the setting of a partially observed matrix such as in the Netflix case, the loss function f (X) only depends on the observed positions, which are very sparse, so f (X) -- which is all we need for our algorithm -- is also sparse. We again suppose that we want to approximate a partially given matrix Y (let P be the set of known entries of the matrix) by a product X = U V T such that some convex loss function f (X) is minimized. By T we denote the unknown test entries of the matrix we want to predict. Our algorithm applies to any convex loss function on a low norm matrix factorization problem, and we will only mention two cases in particular: Our algorithm directly applies to Maximum Margin Matrix Factorization (MMMF) (Srebro et al., 2004), whose original (soft margin) formulation is the tradeoff formulation (1) with f (X) := ijP |Xij - yij | being the hinge or 1 -loss. Because this is not differentiable, the authors recommend using the differentiable smoothed hinge loss instead. When using the standard squared loss function 2 f (X) := ijP (Xij - yij ) , the problem is known as Regularized Matrix Factorization (Wu, 2007), and our algorithm directly applies. This loss function is widely used in practice, has a nice gradient structure, and is just the natural matrix generalization of the 2 loss (notice the analogous Lasso and regularized least squares formulation). The same function is known as the rooted mean squared error, which was the quality measure used in the Netflix competition. We write RMSEtrain and RMSEtest for the rooted error on the training ratings P and test ratings T respectively. Running time and memory. From Theorem 3 we obtain that the running time of our Algorithm 2 is linear in the size of the input: Each matrix-vector multiplication in Lanczos or the power method exactly costs |P | (the number of observed positions of the matrix) operations, and we know that in total we need at most O 1 many such matrix-vector multiplications. The 2 same holds for the memory requirement: There is no need to store the entire factorization of X (k) (meaning all the vectors vk ), but instead we only update and (k) store the prediction values Xij for ij P T in each step. This, together with the known ratings yij determines the sparse gradient matrix f (X (k) ) during the algorithm. Therefore, the total memory requirement is only |P T | (the size of the output) plus the size n + m of a single feature vector vk . The constant Cf in the running time of Hazan's algorithm. Lemma 1 2 4. For the squared error f (X) (Xij - yij )2 , it holds that Cf 1. ^ ijP = Proof. It is known that the constant Cf is upper ^ bounded by the largest eigenvalue of the Hessian 2^ ^ f (Z) (here we consider f as a function on vectors). On can directly compute that the diagonal entries of 2^ f (Z) are 1 at the entries corresponding to P , and zero everywhere else, hence Cf 1. ^ 4.3. Two Improved Variants of Algorithm 1 The optimum on the line segment. Instead of 1 fixing the step width to k := k in Algorithm 1, the k [0, 1] of best improvement in the objective function f can be found by line search. (Hazan, 2008) has already proposed binary search to find better values for k . In many cases, however, we can even compute it analytically in a straightforward manner: Consider (k+1) T = f Z (k) + vk vk - Z (k) and f := f Z() compute . (k+1) T 0= f = f Z() , vk vk - Z (k) (7) If this equation can be solved for , then the optimal such k can directly be used as the step size, and the convergence guarantee of Theorem 3 still holds. For the squared error f (X) = 1 ijP (Xij - yij )2 , 2 when we write v for the approximate eigenvector vk in ¯ step k, the optimality condition (7) is equivalent to k = ijP (Xij - yij )(Xij - vi vj ) ¯¯ - vi vj )2 ¯¯ ijP (Xij (8) Immediate Feedback in the Power Method. As a second small improvement, we propose a heuristic to speed up the eigenvector computation in ApproxEV - f (Z (k) , : Instead of multiplying the current candidate vector vk with the matrix A Simple Algorithm for Nuclear Norm Regularized Problems f (Z (k) ) in each power iteration, we multiply with 1 T f (Z (k) ) + f Z (k) + k vk vk , i.e. the average of the old and the new gradient. This means we immediately take into account the effect of the new feature vector vk . This heuristic (which unfortunately does not fall into our current theoretical guarantee) is inspired by stochastic gradient descent as in Simon Funk's method, which we describe in the following: 1 2 4.4. Relation to Simon Funk's SVD Method Interestingly, our proposed framework can also be seen as a theoretically justified variant of Simon Funk's (Webb, 2006) and related approximate SVD methods, which were used as a building block by most of the teams participating in the Netflix competition (including the winner team). Those methods have been further investigated by (Paterek, 2007; Tak´cs a et al., 2009) and also (Kurucz et al., 2007), which already proposed a heuristic using the HITS formulation. These approaches are algorithmically extremely similar to our method, although they are aimed at a slightly different optimization problem, and do not directly guarantee bounded nuclear norm. Very recently, (Salakhutdinov & Srebro, 2010) observed that Funk's algorithm can be seen as stochastic gradient descent to optimize (1) when the regularization term is replaced by a weighted variant of the nuclear norm. Simon Funk's method considers the standard squared loss function f (X) = 1 ijS (Xij - yij )2 , and finds 2 the new rank-1 estimate (or feature) v by iterating ^ v := v + (- f (Z)v - Kv), or equivalently ^ v := - f (Z) + 1 -K I v , (9) the decay K, making the convergence very sensitive to these parameters. This might be one of the reasons that so far no results on the convergence speed could be obtained. Our method is free of these parameters, the k-th new feature vector is always a unit 1 vector scaled by k . Also, we keep the Frobenius norm 2 2 ||U ||F ro + ||V ||F ro of the obtained factorization exactly fixed during the algorithm, whereas in Funk's method -- which has a different optimization objective -- this norm strictly increases with every newly added feature. Our described framework therefore gives a solid theoretical foundation for a modified variant of the experimentally successful method (Webb, 2006) and its related variants such as (Kurucz et al., 2007; Paterek, 2007; Tak´cs et al., 2009), with proved approximation a quality and running time. 5. Experimental Results We run our algorithm for the following standard datasets3 for matrix completion problems, using the squared error function. dataset MovieLens 100k MovieLens 1M MovieLens 10M Netflix #ratings 105 106 107 108 n 943 6040 69878 480189 m 1682 3706 10677 17770 Any eigenvector method can be used as a black-box in our algorithm. To keep the experiments simple, we used the power method4 , and performed 0.2 · k power iterations in step k. If not stated otherwise, the only optimization we used is the improvement by averaging the old and new gradient as explained in Section 4.3. All results were obtained by our (single-thread) implementation in Java 6 on a 2.4 GHz Intel C2D laptop. Sensitivity. The generalization performance of our method is relatively stable under different choices of the regularization parameter, see Figure 1: 0.95 a fixed number of times. Here is a small fixed constant called the learning rate. Additionally a decay rate K > 0 is used for regularization, i.e. to penalize the magnitude of the resulting feature v. Clearly this matrix multiplication formulation (9) is equivalent to a step of the power method applied within our framework2 , and for small enough learning rates the resulting feature vector will converge to the largest ^ eigenvector of - f (Z). However in Funk's method, the magnitude of each new feature strongly depends on the starting vector v0 , the number of iterations, the learning rate as well as 2 Another difference of our method to Simon Funk's lies in the stochastic gradient descent type of the later, i.e. "immediate feedback": During each matrix multiplication, it already takes the modified current feature v into account ^ when calculating the loss f (Z), whereas our Algorithm 1 alters Z only after the eigenvector computation is finished. RMSE test k=1000 0.93 0.91 0.89 0 15000 30000 45000 Trace regularization t 60000 Figure 1. Sensitivity of the method on the choice of the regularization parameter t in (2), on MovieLens 1M. See www.grouplens.org and archive.ics.uci.edu/ml. We used the power method starting with the uniform unit vector. 1 of the approximate eigenvalue corresponding 2 to the previously obtained feature vk-1 was added to the matrix diagonal to ensure good convergence. 4 3 A Simple Algorithm for Nuclear Norm Regularized Problems Movielens. Table 1 reports the running times of our algorithm on the three MovieLens datasets. Our algorithm gives an about 5.6 fold speed increase over (Toh & Yun, 2009), which is a very similar method to (Ji & Ye, 2009). (Toh & Yun, 2009) already improves the "singular value thresholding" methods (Cai et al., 2008) and (Ma et al., 2009). For MMMF, (Rennie & Srebro, 2005) report an optimization time of about 5 hours on the 1M dataset, but use the different smoothed hinge loss function so that the results cannot be directly compared. (Ma et al., 2009), (Srebro & Jaakkola, 2003) and (Ji & Ye, 2009) only obtained results on much smaller datasets. Table 1. Running times tour (in seconds) of our algorithm on the three MovieLens datasets compared to the reported timings tTY of (Toh & Yun, 2009). The ratings {1, . . . , 5} were used as-is and not normalized to any user and/or movie means. In accordance with (Toh & Yun, 2009), 50% of the ratings were used for training, the others were used as the test set. Here NMAE is the mean absolute error, 1 times 5-1 , over the total set of ratings. k is the number of iterations of our algorithm, #mm is the total number of sparse matrix-vector multiplications performed, and tr is the used trace parameter t in (2). They used Matlab/PROPACK on an Intel Xeon 3.20 GHz processor. 100k 1M 10M NMAE 0.205 0.176 0.164 tTY 7.39 24.5 202 tour 0.156 1.376 36.10 k 15 35 65 #mm 33 147 468 tr 9975 36060 281942 0.94 MovieLens 10M rb 0.863 1/k, test best on line segm., test gradient interp., test 0.785 RMSE 0.708 1/k, train best on line segm., train gradient interp., train 0.63 0 100 k 200 300 400 Figure 2. Improvements for the two algorithm variants described in Section 4.3, when running on MovieLens 10M. Table 2. Running times tour (in hours) of our algorithm on the Netflix dataset compared to the reported timings tM of (Mazumder et al., 2009). RMSEtest 0.986 0.977 0.965 0.9478 tM 3.3 5.8 6.6 n.a. tour 0.144 0.306 0.504 13.6 k 20 30 40 200 #mm 50 109 185 4165 tr 99592 " " " In all the following experiments we have prenormalized all training ratings to the simple average µi +µj of the user and movie mean values, for the sake 2 of being consistent with comparable literature. For MovieLens 10M, we used partition rb provided with the dataset (10 test ratings per user). The regularization parameter t was set to 48333. We obtained a RMSEtest of 0.8617 after k = 400 steps, in a total running time of 52 minutes (16291 matrix multiplications). Our best RMSEtest value was 0.8573, compared to 0.8543 obtained by (Lawrence & Urtasun, 2009) using their non-linear improvement of MMMF. Algorithm Variants. Comparing the proposed algorithm variants from Section 4.3, Figure 2 demonstrates moderate improvements compared to our original Algorithm 2. Netflix. Table 2 shows an about 13 fold speed increase of our method over the "Hard Impute" singular value thresholding algorithm of (Mazumder et al., 2009) on the Netflix dataset, where they used Matlab/PROPACK on an Intel Xeon 3 GHz processor. Note that the primary goal of this experimental section is not to compete with the prediction quality of the best engineered recommender systems (which are usually ensemble methods). We just demonstrate that our method solves nuclear norm regularized problems of the form (2) on large sample datasets, obtaining strong performance improvements. 6. Conclusion We have introduced a new method to solve arbitrary convex problems with a nuclear norm regularization, which is simple to implement and parallelize and scales very well. The method is parameter-free and comes with a convergence guarantee. This is, to our knowledge, the first guaranteed convergence speed result for the class of Simon-Funk-type algorithms. Further interesting questions include whether a similar algorithm could be used if a strict low-rank constraint as in (4), (5) is simultaneously applied. This corresponds to fixing the sparsity of a solution in the coreset setting. Also, it remains to investigate if our algorithm can be applied to other matrix factorization problems such as (potentially only partially observed) kernel matrices as e.g. PSVM (Chang et al., 2007), PCA or [p]LSA, because our method could exploit the even simpler form of f for symmetric matrices. A Simple Algorithm for Nuclear Norm Regularized Problems 7. Acknowledgements We would like to thank Arkadi Nemirovski and Elad Hazan for helpful discussions. M. Jaggi is supported by a Google Research Award and the Swiss National Science Foundation (SNF Project 20PA21-121957). M. Sulovsk´ is supported by the Swiss National Sciy ence Foundation (SNF Project 200020-125027). Lawrence, N and Urtasun, R. Non-linear matrix factorization with gaussian processes. ICML, 2009. Lin, C-J. Projected gradient methods for nonnegative matrix factorization. Neural Comput., 19(10):2756­ 2779, 2007. Liu, Y-J, Sun, D, and Toh, K-C. An implementable proximal point algorithmic framework for nuclear norm minimization. Optimization Online, 2009. Ma, S, Goldfarb, D, and Chen, L. Fixed point and bregman iterative methods for matrix rank minimization. Mathematical Programming, 2009. Mazumder, R, Hastie, T, and Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. Submitted to JMLR, 2009. Nesterov, Y. A method of solving a convex programming problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady, 27:372­376, 1983. Paterek, A. Improving regularized singular value decomposition for collaborative filtering. SIGKDD, 2007. Rennie, J and Srebro, N. Fast maximum margin matrix factorization for collaborative prediction. ICML, 2005. Salakhutdinov, R and Srebro, N. Collaborative filtering in a non-uniform world: Learning with the weighted trace norm. arXiv, cs.LG, Feb 2010. Srebro, N, Rennie, J, and Jaakkola, T. MaximumMargin Matrix Factorization. NIPS, 17:1329­1336, 2004. Srebro, N and Jaakkola, T. Weighted low-rank approximations. ICML, 2003. Tak´cs, G, Pil´szy, I, N´meth, B, and Tikk, D. Scala a e able collaborative filtering approaches for large recommender systems. JMLR, 10, 2009. Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B, pp. 267­288, 1996. Toh, K and Yun, S. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Optimization Online, 2009. Webb, B. Netflix update: Try this at home. Simon Funk's personal blog, 2006. http://sifter.org/ ~simon/journal/20061211.html. Wu, M. Collaborative filtering via ensembles of matrix factorizations. SIGKDD, 2007. References Arora, S, Hazan, E, and Kale, S. Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. FOCS, pp. 339­348, 2005. Berkhin, P. A survey on pagerank computing. Internet Mathematics, 2(1):73, 2005. Cai, J-F, Candes, E, and Shen, Z. A singular value thresholding algorithm for matrix completion. arXiv, math.OC, 2008. Candes, E and Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inform. Theory (to appear). arXiv, cs.IT, 2009. Chang, E, Zhu, K, Wang, H, Bai, H, Li, J, Qiu, Z, and Cui, H. PSVM: Parallelizing support vector machines on distributed computers. NIPS, 20:257­ 264. 2007. Clarkson, K. Coresets, Sparse Greedy Approximation, and the Frank-Wolfe Algorithm. SODA, 2008. DeCoste, D. Collaborative prediction using ensembles of maximum margin matrix factorizations. ICML, 2006. Fazel, M, Hindi, H, and Boyd, S. A rank minimization heuristic with application to minimum order system approximation. Proc. American Control Conference, 6:4734­4739, 2001. Hazan, E. Sparse approximate solutions to semidefinite programs. LATIN, pp. 306­316, 2008. Ji, S and Ye, Y. An accelerated gradient method for trace norm minimization. ICML, 2009. Kleinberg, J. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 1999. Koren, Y, Bell, R, and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 42(8):30­37, 2009. Kurucz, M, Benczur, A, and Csalogany, K. Methods for large scale SVD with missing values. SIGKDD, 2007.