An Accelerated Gradient Method for Trace Norm Minimization
Shuiwang Ji shuiwang.ji@asu.edu Jieping Ye jieping.ye@asu.edu Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA

Abstract
We consider the minimization of a smooth loss function regularized by the trace norm of the matrix variable. Such formulation finds applications in many machine learning tasks including multi-task learning, matrix classification, and matrix completion. The standard semidefinite programming formulation for this problem is computationally expensive. In addition, due to the non-smooth nature of the trace norm, the optimal first-order black-box method for solving such class of 1 problems converges as O( k ), where k is the iteration counter. In this paper, we exploit the special structure of the trace norm, based on which we propose an extended gradient al1 gorithm that converges as O( k ). We further propose an accelerated gradient algorithm, which achieves the optimal convergence rate 1 of O( k2 ) for smooth problems. Experiments on multi-task learning problems demonstrate the efficiency of the proposed algorithms.

(Fazel et al., 2001), defined as the sum of the singular values of the matrix, since it is the convex envelope of the rank function over the unit ball of spectral norm. A number of recent work has shown that the low rank solution can be recovered exactly via minimizing the trace norm under certain conditions (Recht et al., 2008a; Recht et al., 2008b; Cand´s & Recht, 2008). e In practice, the trace norm relaxation has been shown to yield low-rank solutions and it has been used widely in many scenarios. In (Srebro et al., 2005; Rennie & Srebro, 2005; Weimer et al., 2008a; Cai et al., 2008; Ma et al., 2008) the matrix completion problem was formulated as a trace norm minimization problem. In problems where multiple related tasks are learned simultaneously, the models for different tasks can be constrained to share certain information. Recently, this constraint has been expressed as the trace norm regularization on the weight matrix in the context of multitask learning (Abernethy et al., 2006; Argyriou et al., 2008; Abernethy et al., 2009; Obozinski et al., 2009), multi-class classification (Amit et al., 2007), and multivariate linear regression (Yuan et al., 2007; Lu et al., 2008). For two-dimensional data such as images, the matrix classification formulation (Tomioka & Aihara, 2007; Bach, 2008) applies a weight matrix, regularized by its trace norm, on the data. It was shown (Tomioka & Aihara, 2007) that such formulation leads to improved performance over conventional methods. A practical challenge in employing the trace norm regularization is to develop efficient algorithms to solve the resulting non-smooth optimization problems. It is well-known that the trace norm minimization problem can be formulated as a semidefinite program (Fazel et al., 2001; Srebro et al., 2005). However, such formulation is computationally expensive. To overcome this limitation, a number of algorithms have been developed recently (Rennie & Srebro, 2005; Weimer et al., 2008a; Weimer et al., 2008b; Cai et al., 2008; Ma et al., 2008). In these algorithms some form of approximation is usually employed to deal with the non-smooth trace norm term. However, a fast global convergence

1. Introduction
The problem of minimizing the rank of a matrix variable subject to certain constraints arises in many fields including machine learning, automatic control, and image compression. For example, in collaborative filtering we are given a partially filled rating matrix and the task is to predict the missing entries. Since it is commonly believed that only a few factors contribute to an individual's tastes, it is natural to approximate the given rating matrix by a low-rank matrix. However, the matrix rank minimization problem is NPhard in general due to the combinatorial nature of the rank function. A commonly-used convex relaxation of the rank function is the trace norm (nuclear norm)
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

An Accelerated Gradient Method for Trace Norm Minimization

rate for these algorithms is difficult to guarantee. Due to the non-smooth nature of the trace norm, a simple approach to solve these problems is the subgradient method (Bertsekas, 1999; Nesterov, 2003), which 1 converges as O( k ) where k is the iteration counter. It is known from the complexity theory of convex optimization (Nemirovsky & Yudin, 1983; Nesterov, 2003) that this convergence rate is already optimal for nonsmooth optimization under the first-order black-box model, where only the function values and first-order derivatives are used. In this paper we propose efficient algorithms with fast global convergence rates to solve trace norm regularized problems. Specifically, we show that by exploiting the special structure of the trace norm, the classical gradient method for smooth problems can be adapted to solve the trace norm regularized nonsmooth problems. This results in an extended gradient algorithm with the same convergence rate of 1 O( k ) as that for smooth problems. Following the Nesterov's method for accelerating the gradient method (Nesterov, 1983; Nesterov, 2003), we show that the extended gradient algorithm can be further acceler1 ated to converge as O( k2 ), which is the optimal convergence rate for smooth problems. Hence, the nonsmoothness effect of the trace norm regularization is effectively removed. The proposed algorithms extend the algorithms in (Nesterov, 2007; Tseng, 2008; Beck & Teboulle, 2009) to the matrix case. Experiments on multi-task learning problems demonstrate the efficiency of the proposed algorithms in comparison with existing ones. Note that while the present paper was under review, we became aware of a recent preprint by Toh and Yun (2009) who independently developed an algorithm that is similar to ours.

· Multi-task learning (Argyriou et al., 2008): n si j T f (W ) = i=1 j=1 (yi , wi xj ), where n is the i j j number of tasks, (xi , yi )  Rm ×R is the jth sample in the ith task, si is the number of samples in the ith task, and W = [w1 , · · · , wn ]  Rm×n . · Matrix classification (Tomioka & Aihara, 2007; s T Bach, 2008): f (W ) = i=1 (yi , Tr(W Xi )), m×n where (Xi , yi )  R × R is the ith sample. · Matrix completion (Srebro et al., 2005; Cand´s e & Recht, 2008; Recht et al., 2008a; Ma et al., 2008): f (W ) = (i,j) (Mij , Wij ), where M  Rm×n is the partially observed matrix with the entries in  being observed. Since the trace norm term in the objective function in Eq. (1) is non-smooth, a natural approach for solving this problem is the subgradient method in which a sequence of approximate solutions are generated as Wk = Wk-1 - 1 F (Wk-1 ), tk (2)

where Wk is the approximate solution at the kth iteration, t1 is the step size, and F (W )  F (W ) is k the subgradient of F (W ) at W and F (W ) denotes the subdifferential (Bertsekas, 1999; Nesterov, 2003) of F (W ) at W . It is known (Nesterov, 2003) that the 1 subgradient method converges as O( k ), i.e., 1 F (Wk ) - F (W  )  c  , k for some constant c, where W  = arg minW F (W ). It is known from the complexity theory of convex optimization (Nemirovsky & Yudin, 1983; Nesterov, 2003) that this convergence rate is already optimal for non-smooth problems under the first-order blackbox model. Hence, the convergence rate cannot be improved if a black-box model, which does not exploit any special structure of the objective function, is employed. We show in the following that by exploiting the structure of the trace norm, its non-smoothness can be effectively overcome and the convergence rate of the algorithm for solving the trace norm regularized problem in Eq. (1) can be improved significantly. (3)

2. Problem Formulation
In this paper we consider the following problem: min F (W ) = f (W ) + ||W ||
W

(1)

where W  Rm×n is the decision matrix, f (·) represents the loss induced by some convex smooth (differentiable) loss function (·, ·), and || · || denotes the trace norm defined as the sum of the singular values. We assume that the gradient of f (·), denoted as f (·), is Lipschitz continuous with constant L, i.e., || f (X) - f (Y )||F  L||X - Y ||F , X, Y  Rm×n , where || · ||F denotes the Frobenius norm. Such formulation arises in many machine learning tasks such as in multi-task learning, matrix classification, and matrix completion problems.

3. An Extended Gradient Method
First, consider the minimization of the smooth loss function without the trace norm regularization: min f (W ).
W

(4)

An Accelerated Gradient Method for Trace Norm Minimization

It is known (Bertsekas, 1999) that the gradient step Wk = Wk-1 - 1 tk f (Wk-1 ) (5)

The proof of this theorem is in the Appendix. The above discussion shows that the problem in Eq. (8) can be readily solved by SVD. Furthermore, we show in the following that if the step size t1 of the k gradient method is chosen properly, we can achieve the same convergence rate as in the smooth case, i.e., 1 O( k ), despite the presence of the non-smooth trace norm regularization. 3.1. Step Size Estimation

for solving this smooth problem can be reformulated equivalently as a proximal regularization of the linearized function f (W ) at Wk-1 as Wk = arg min Ptk (W, Wk-1 ),
W

(6)

where Ptk (W, Wk-1 ) = f (Wk-1 ) + W - Wk-1 , tk + ||W - Wk-1 ||2 , F 2 f (Wk-1 ) (7)

and A, B = Tr(AT B) denotes the matrix inner product. It has been shown (Nesterov, 2003) that the con1 vergence rate of this algorithm is O( k ). Note that the function Ptk defined in Eq. (7) can be considered as a linear approximation of the function f at point Wk-1 regularized by a quadratic proximal term. Based on this equivalence relationship, we propose to solve the optimization problem in Eq. (1) by the following iterative step: Wk =arg min Qtk (W, Wk-1 ) Ptk (W, Wk-1 ) + ||W || . (8) A key motivation for this formulation is that if the optimization problem in Eq. (8) can be easily solved by exploiting the structure of the trace norm, the convergence rate of the resulting algorithm is expected to be the same as that of gradient method, since no approximation on the non-smooth term is employed. By ignoring terms that do not depend on W , the objective in Eq. (8) can be expressed equivalently as tk 2 W - Wk-1 - 1 tk
2 W

To choose an appropriate step size we impose a condition on the relationship between the function values of F and Qtk at a certain point in Lemma 3.1. We show in Theorem 3.2 below that once this condition is satisfied at each step by choosing an appropriate step size, the convergence rate of the resulting sequence can be guaranteed. Lemma 3.1. Let pµ (Y ) = arg min Qµ (X, Y )
X

(11)

where Q is defined in Eq. (8). Assume the following inequality holds: F (pµ (Y ))  Qµ (pµ (Y ), Y ). Then for any X  Rm×n we have F (X) - F (pµ (Y ))  µ ||pµ (Y ) - Y ||2 (13) F 2 +µ Y - X, pµ (Y ) - Y . (12)

The proof of this lemma is in the Appendix. At each step of the algorithm we need to find an appropriate value for µ such that Wk = pµ (Wk-1 ) and the condition F (Wk )  Qµ (Wk , Wk-1 ) (14)

f (Wk-1 )
F

+  ||W || .

(9) It turns out that the minimization of the objective in Eq. (9) can be solved by first computing the singular value decomposition (SVD) of Wk-1 - t1 f (Wk-1 ) k and then applying some soft-thresholding on the singular values. This is summarized in the following theorem (Cai et al., 2008). Theorem 3.1. Let C  Rm×n and let C = U V T be the SVD of C where U  Rm×r and V  Rn×r have orthonormal columns,   Rr×r is diagonal, and r = rank(C). Then T (C)  arg min
W

is satisfied. Note that since the gradient of f (·) is Lipschitz continuous with constant L, we have (Nesterov, 2003) f (X)  f (Y ) + X - Y, f (Y ) + L ||X - Y ||2 , X, Y. F 2

Hence, when µ  L we have F (pµ (Y ))Pµ (pµ (Y ), Y )+||pµ (Y )|| =Qµ (pµ (Y ), Y ). This shows that the condition in Eq. (14) is always satisfied if the update rule Wk = pL (Wk-1 ) (15)

1 ||W - C||2 + ||W || F 2

(10)

is given by T (C) = U  V T , where  is diagonal with ( )ii = max{0, ii - }.

An Accelerated Gradient Method for Trace Norm Minimization

is applied. However, L may not be known or it is expensive to compute in practice. We propose to employ the following step size estimation strategy to ensure the condition in Eq. (14): Given an initial estimate of L as L0 , we increase this estimate with a multiplicative factor  > 1 repeatedly until the condition in Eq. (14) is satisfied. This results in the extended gradient method in Algorithm 1 for solving the problem in Eq. (1). Algorithm 1 Extended Gradient Algorithm Initialize L0 , , W0  Rm×n Iterate: ¯ 1. Set L = Lk-1 2. While F (pL (Wk-1 )) > QL (pL (Wk-1 ), Wk-1 ), set ¯ ¯ ¯ ¯ ¯ L :=  L ¯ 3. Set Lk = L and Wk = pLk (Wk-1 ) Since when Lk  L the condition in Eq. (14) is always satisfied, we have Lk  L, k. (16)

part and a non-smooth part provided that the nonsmooth part is "simple". In particular, it was shown that the 1 -norm regularized problems can be accelerated even though they are not smooth. In this section we show that the extended gradient method in Algorithm 1 can also be accelerated to achieve the optimal convergence rate of smooth problems even though the trace norm is not smooth. This results in the accelerated gradient method in Algorithm 2. Algorithm 2 Accelerated Gradient Algorithm Initialize L0 , , W0 = Z1  Rm×n , 1 = 1 Iterate: ¯ 1. Set L = Lk-1 2. While F (pL (Zk-1 )) > QL (pL (Zk-1 ), Zk-1 ), set ¯ ¯ ¯ ¯ ¯ L :=  L ¯ 3. Set Lk = L and update Wk = pLk (Zk ) k+1 = Zk+1
2 1 + 4k 2 k - 1 = Wk + k+1

1+

(18) (Wk - Wk-1 ) (19)

Note that the sequence of function values generated by this algorithm is non-increasing as F (Wk )QLk (Wk ,Wk-1 )QLk (Wk-1 ,Wk-1 )=F (Wk-1 ). 3.2. Convergence Analysis We show in the following theorem that when the condition in Eq. (14) is satisfied at each iteration, the 1 extended gradient algorithm converges as O( k ). Theorem 3.2. Let {Wk } be the sequence generated by Algorithm 1. Then for any k  1 we have F (Wk ) - F (W  )  L||W0 - W  ||2 F , 2k (17)

4.1. Discussion In the accelerated gradient method, two sequences {Wk } and {Zk } are updated recursively. In particular, Wk is the approximate solution at the kth step and Zk is called the search point (Nesterov, 1983; Nesterov, 2003), which is constructed as a linear combination of the latest two approximate solutions Wk-1 and Wk-2 . The key difference between the extended and the accelerated algorithms is that the gradient step is performed at the current approximate solution Wk in the extended algorithm, while it is performed at the search point Zk in the accelerated scheme. The idea of constructing the search point is motivated by the investigation of the information-based complexity (Nemirovsky & Yudin, 1983; Nesterov, 2003), which reveals that for smooth problems the convergence rate of the gradient method is not optimal, and thus methods with a faster convergence rate should exist. The derivation of the search point is based on the concept of estimate sequence and more details can be found in (Nesterov, 2003). Note that the sequence k can be updated in many ways as long as certain conditions are satisfied (Nesterov, 2003). Indeed, it was shown in (Tseng, 2008) that other schemes of updating k

where W  = arg minW F (W ). The proof of this theorem is in the Appendix.

4. An Accelerated Gradient Method
It is known (Nesterov, 1983; Nesterov, 2003) that when the objective function is smooth, the gradient method can be accelerated to achieve the optimal convergence 1 rate of O( k2 ). It was shown recently (Nesterov, 2007; Tseng, 2008; Beck & Teboulle, 2009) that a similar scheme can be applied to accelerate optimization problems where the objective function consists of a smooth

An Accelerated Gradient Method for Trace Norm Minimization Table 1. Comparison of the three multi-task learning algorithms (EGM, AGM, and MFL) in terms of the computation time (in seconds). In each case, the computation time reported is the time used to train the model for a given parameter value obtained by cross validation, and the averaged training time over ten random trials is reported. Data set Percentage EGM AGM MFL yeast 5% 2.24 0.34 2.33 10% 3.37 0.49 17.27 letters 5% 4.74 0.62 2.49 10% 5.67 0.91 9.66 digits 5% 62.51 2.41 15.50 10% 29.59 2.39 42.64 5% 133.21 1.59 74.24 dmoz 10% 146.58 1.42 31.49

600

AGM EGM

1050 1000 Objective value 950 900 850 800 750 0

AGM EGM

550 Objective value

500

450

400

350 0

20

40 60 Iteration

80

100

20

40

60 Iteration

80

100

120

Figure 1. The convergence of EGM and AGM on the yeast data set when 5% (left figure) and 10% (right figure) of the data are used for training. On the first data set EGM and AGM take 81 and 1122 iterations, respectively, to converge, while on the second data set they take 108 and 773 iterations, respectively.

can lead to better practical performance, though the theoretical convergence rate remains the same. Note that the sequence of objective values generated by the accelerated scheme may increase. It, however, can be made non-increasing by a simple modification of the algorithm as in (Nesterov, 2005). 4.2. Convergence Analysis We show in the following that by performing the gradient step at the search point Zk instead of at the approximate solution Wk , the convergence rate of the 1 gradient method can be accelerated to O( k2 ). This result is summarized in the following theorem. Theorem 4.1. Let {Wk } and {Zk } be the sequences generated by Algorithm 2. Then for any k  1 we have F (Wk ) - F (W  )  2L||W0 - W  ||2 F . (k + 1)2 (20)

5. Experiments
We evaluate the proposed extended gradient method (EGM) and the accelerated gradient method (AGM) on four multi-task data sets. The yeast data set was derived from a yeast gene classification problem consisting of 14 tasks. The letters and digits are handwritten words and digits data sets (Obozinski et al., 2009), which consist of 8 and 10 tasks, respectively. The dmoz is a text categorization data set obtained from DMOZ (http://www.dmoz.org/) in which each of the 10 tasks corresponds to one of the subcategories of the Arts category. For each data set we randomly sample 5% and 10% of the data from each task for training. To evaluate the efficiency of the proposed formulations, we report the computation time of the multitask feature learning (MFL) algorithm (Argyriou et al., 2008), as MFL involves a formulation that is equivalent to EGM and AGM. For all methods, we terminate the algorithms when the relative changes in the objective is below 10-8 , since the objective values of MFL and EGM/AGM are not directly comparable.

The proof of this theorem follows the same strategy as in (Beck & Teboulle, 2009) and it is in the Appendix.

An Accelerated Gradient Method for Trace Norm Minimization

The averaged computation time over ten random trials for each method is reported in Table 1. We can observe that AGM is by far the most efficient method in all cases. The relative efficiency of EGM and AGM differs significantly across data sets, demonstrating that the performance of AGM is very stable for different problems. In order to investigate the convergence behaviors of EGM and AGM, we plot the objective values of these two methods on the yeast data set in Figure 1. We can observe that in both cases AGM converges much faster than EGM, especially at early iterations. This is consistent with our theoretical results and confirms that the proposed accelerated scheme can reach the optimal objective value rapidly.

function at W  , i.e., 0  W  - C + ||W  || .
T P1 P2 n×s

(22)

Let W = be the SVD of W where P1  Rm×s and P2  R have orthonormal columns,   Rs×s is diagonal, and s = rank(W ). It can be verified that (Bach, 2008; Recht et al., 2008a) ||W || =
T T {P1 P2 + S : S  Rm×n , P1 S = 0,

SP2 = 0, ||S||2  1},

(23)

where || · ||2 denotes the spectral norm of a matrix. Decomposing the SVD of C as C = U0 0 V0T + U1 1 V1T , where U0 0 V0T corresponds to the part of SVD with singular values greater than . Then we have the SVD of T (C) as T (C) = U0 (0 - I)V0T and thus C - T (C) = (U0 V0T + S)

6. Conclusion and Discussion
In this paper we propose efficient algorithms to solve trace norm regularized problems. We show that by exploiting the special structure of the trace norm, the 1 optimal convergence rate of O( k ) for general non1 smooth problems can be improved to O( k ). We further show that this convergence rate can be accelerated 1 to O( k2 ) by employing the Nesterov's method. Experiments on multi-task learning problems demonstrate the efficiency of the proposed algorithms. As pointed out in the paper, another important application of the trace norm regularization is in matrix completion problems. We plan to apply the proposed formulations to this problem in the future. The proposed algorithms require the computation of SVD, which may be computationally expensive for largescale problems. We will investigate approximate SVD techniques in the future to further reduce the computational cost.

1 where S =  U1 1 V1T . It follows from the facts that T U0 S = 0, SV0 = 0, and ||S||2  1 that

C - T (C)  ||T (C)|| , which shows that T (C) is an optimal solution. Proof of Lemma 3.1 Proof. Since both the loss function f and the trace norm are convex, we have f (X)  ||X||  f (Y ) + X - Y, f (Y ) , ||pµ (Y )|| +  X - pµ (Y ), g(pµ (Y )) ,

Appendix
Proof of Theorem 3.1 Proof. Since the objective function in Eq. (10) is strongly convex, a unique solution exists for this problem and hence it remains to show that the solution is T (C). Recall that Z  Rm×n is the subgradient of a convex function h : Rm×n  R at X0 if h(X)  h(X0 ) + Z, X - X0 (21)

where g(pµ (Y ))  ||pµ (Y )|| is the subgradient of the trace norm at pµ (Y ). Summing up the above two inequalities we obtain that F (X)  f (Y ) + X - Y, f (Y ) (24) +||pµ (Y )|| +  X - pµ (Y ), g(pµ (Y )) .

By combining the condition in Eq. (14), the result in Eq. (24), and the relation Qµ (pµ (Y ), Y ) = Pµ (pµ (Y ), Y ) + ||pµ (Y )|| , we obtain that F (X) - F (pµ (Y ))  F (X) - Qµ (pµ (Y ), Y ) µ  X - pµ (Y ), f (Y ) + g(pµ (Y )) - ||pµ (Y ) - Y ||2 F 2 µ = µ X - pµ (Y ), Y - pµ (Y ) - ||pµ (Y ) - Y ||2 F 2 µ = µ Y - X, pµ (Y ) - Y + ||pµ (Y ) - Y ||2 , F 2

for any X. The set of subgradients of h at X0 is called the subdifferential of h at X0 and it is denoted as h(X0 ). It is well-known (Nesterov, 2003) that W  is the optimal solution to the problem in Eq. (10) if and only if 0  Rm×n is a subgradient of the objective

An Accelerated Gradient Method for Trace Norm Minimization

where the first equality follows from that pµ (Y ) is a minimizer of Qµ (X, Y ) as in Eq. (11), and thus f (Y ) + µ(pµ (Y ) - Y ) + g(pµ (Y )) = 0. This completes the proof of the lemma. Proof of Theorem 3.2 Proof. Applying Lemma 3.1 with (X = W  , Y = Wn , µ = Ln+1 ) and making use of the fact that for any three matrices A, B, and C of the same size ||B - A||2 + 2 B - A, A - C = ||B - C||2 - ||A - C||2 , F F F (25) we obtain that 2 (F (W  )-F (Wn+1 ))||Wn+1-W  ||2 -||Wn-W  ||2 . F F Ln+1 Summing the above inequality over n = 0, · · · , k - 1 and making use of the inequality in Eq. (16), we get
k-1

Multiplying the last inequality by k+1 and making 2 2 use of the equality k = k+1 - k+1 derived from Eq. (18), we get 2 2 (2 vk - k+1 vk+1 )  ||k+1 (Wk+1 - Zk+1 )||2 F Lk+1 k +2k+1 Wk+1-Zk+1 , k+1 Zk+1-(k+1 - 1)Wk -W  . Applying the equality in Eq. (25) to the right-hand side of the above inequality, we get
2 (2 vk-k+1 vk+1 ) ||k+1 Wk+1 - (k+1 - 1)Wk Lk+1 k -W  ||2 - ||k+1 Zk+1 - (k+1 - 1)Wk - W  ||2 . F F

2

It follows from Eq. (19) and the definition of Uk that 2 2 (2 vk - k+1 vk+1 )  ||Uk+1 ||2 - ||Uk ||2 , F F Lk+1 k which combined with Lk+1  Lk leads to 2 2 2  vk - 2 vk+1  ||Uk+1 ||2 - ||Uk ||2 . (28) F F Lk k Lk+1 k+1 Applying Lemma 3.1 with (X = W  , Y = Z1 , L = L1 ), we obtain F (W  ) - F (W1 ) = F (W  ) - F (pL1 (Z1 )) L1 2  ||pL1 (Z1 ) - Z1 || + L1 Z1 - W  , pL1 (Z1 ) - Z1 2 L1 2 = ||W1 - Z1 || + L1 Z1 - W  , W1 - Z1 2 L1 L1 2 2 = ||W1 - W  || - ||Z1 - W  || . 2 2 Hence, we have 2 2 2 v1  ||Z1 - W  || - ||W1 - W  || . L1 It follows from Eqs. (28) and (29) that
 2

(F (Wn+1 )-F (W  ))
n=0

L (||W0-W  ||2 -||Wk-W  ||2 ). F F 2

It follows from F (Wn+1 )  F (Wn ) and F (Wn )  F (W  ) that
k-1

k(F (Wk ) - F (W  )) 
n=0

(F (Wn+1 ) - F (W  )) L ||W0 - W  ||2 , F 2

 which leads to Eq. (17). Proof of Theorem 4.1 Proof. Let us denote vk Uk = =

F (Wk ) - F (W  ), k Wk - (k - 1)Wk-1 - W  .

Applying Lemma 3.1 with (X = Wk , Y = Zk+1 , L = Lk+1 ) and (X = W  , Y = Zk+1 , L = Lk+1 ), respectively, we obtain the following two inequalities: 2 (vk - vk+1 )  ||Wk+1 - Zk+1 ||2 (26) F Lk+1 +2 Wk+1 - Zk+1 , Zk+1 - Wk , 2 - vk+1  ||Wk+1 - Zk+1 ||2 (27) F Lk+1 +2 Wk+1 - Zk+1 , Zk+1 - W  . Multiplying both sides of Eq. (26) by (k+1 - 1) and adding it to Eq. (27), we get ((k+1 - 1)vk -k+1 vk+1 )k+1 ||Wk+1 -Zk+1 ||2 F Lk+1  +2 Wk+1 - Zk+1 , k+1 Zk+1 - (k+1 - 1)Wk - W . 2

(29) 

2 2 Lk k vk

||W0 - W || , which combined with k  (k + 1)/2 yields F (Wk ) - F (W  ) 2L||W0 -W  || 2Lk||W0 -W  ||  . 2 (k + 1) (k + 1)2
2 2

This completes the proof of the theorem.

Acknowledgments
This work was supported by NSF IIS-0612069, IIS0812551, CCF-0811790, NIH R01-HG002516, and NGA HM1582-08-1-0016.

An Accelerated Gradient Method for Trace Norm Minimization

References
Abernethy, J., Bach, F., Evgeniou, T., & Vert, J.P. (2006). Low-rank matrix factorization with attributes (Technical Report N24/06/MM). Ecole des Mines de Paris. Abernethy, J., Bach, F., Evgeniou, T., & Vert, J.-P. (2009). A new approach to collaborative filtering: Operator estimation with spectral regularization. J. Mach. Learn. Res., 10, 803­826. Amit, Y., Fink, M., Srebro, N., & Ullman, S. (2007). Uncovering shared structures in multiclass classification. In Proceedings of the International Conference on Machine Learning, 17­24. Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73, 243­272. Bach, F. R. (2008). Consistency of trace norm minimization. J. Mach. Learn. Res., 9, 1019­1048. Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, 183­202. Bertsekas, D. P. (1999). Nonlinear programming. Athena Scientific. 2nd edition. Cai, J.-F., Cand´s, E. J., & Shen, Z. (2008). A sine gular value thresholding algorithm for matrix completion (Technical Report 08-77). UCLA Computational and Applied Math. Cand´s, E. J., & Recht, B. (2008). Exact matrix come pletion via convex optimization (Technical Report 08-76). UCLA Computational and Applied Math. Fazel, M., Hindi, H., & Boyd, S. P. (2001). A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the American Control Conference, 4734­4739. Lu, Z., Monteiro, R. D. C., & Yuan, M. (2008). Convex optimization methods for dimension reduction and coefficient estimation in multivariate linear regression. Submitted to Mathematical Programming. Ma, S., Goldfarb, D., & Chen, L. (2008). Fixed point and Bregman iterative methods for matrix rank minimization (Technical Report 08-78). UCLA Computational and Applied Math. Nemirovsky, A. S., & Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. John Wiley & Sons Ltd. Nesterov, Y. (1983). A method for solving a convex programming problem with convergence rate O(1/k 2 ). Soviet Math. Dokl., 27, 372­376.

Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course. Kluwer Academic Publishers. Nesterov, Y. (2005). Smooth minimization of nonsmooth functions. Mathematical Programming, 103, 127­152. Nesterov, Y. (2007). Gradient methods for minimizing composite objective function (Technical Report 2007/76). CORE, Universit´ catholique de Louvain. e Obozinski, G., Taskar, B., & Jordan, M. I. (2009). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing. In press. Recht, B., Fazel, M., & Parrilo, P. (2008a). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. Submitted to SIAM Review. Recht, B., Xu, W., & Hassibi, B. (2008b). Necessary and sufficient condtions for success of the nuclear norm heuristic for rank minimization. In Proceedings of the 47th IEEE Conference on Decision and Control, 3065­3070. Rennie, J. D. M., & Srebro, N. (2005). Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the International Conference on Machine Learning, 713­719. Srebro, N., Rennie, J. D. M., & Jaakkola, T. S. (2005). Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems, 1329­ 1336. Toh, K.-C., & Yun, S. (2009). An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Preprint, Department of Mathematics, National University of Singapore, March 2009. Tomioka, R., & Aihara, K. (2007). Classifying matrices with a spectral regularization. In Proceedings of the International Conference on Machine Learning, 895­902. Tseng, P. (2008). On accelerated proximal gradient methods for convex-concave optimization. Submitted to SIAM Journal on Optimization. Weimer, M., Karatzoglou, A., Le, Q., & Smola, A. (2008a). COFIrank - maximum margin matrix factorization for collaborative ranking. In Advances in Neural Information Processing Systems, 1593­1600. Weimer, M., Karatzoglou, A., & Smola, A. (2008b). Improving maximum margin matrix factorization. Machine Learning, 72, 263­276. Yuan, M., Ekici, A., Lu, Z., & Monteiro, R. (2007). Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B, 69, 329­346.