Stochastic Methods for 1 Regularized Loss Minimization Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave., Chicago, IL 60637, USA SHAI @ TTI - C . ORG TEWARI @ TTI - C . ORG Abstract We describe and analyze two stochastic methods for 1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature/example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets. Throughout the paper, we assume that L is convex w.r.t. its first argument. This implies that Eq. (1) is a convex optimization problem, and therefore can be solved using standard optimization techniques, such as interior point methods. However, standard methods scale poorly with the size of the problem (i.e. m and d). In recent years, machine learning methods are proliferating in data-laden domains such as text and web processing in which data sets of millions of training examples or features are not uncommon. Since traditional methods for solving Eq. (1) generally scale very poorly with the size of the problem, their usage is inappropriate for data-laden domains. In this paper, we discuss how to overcome this difficulty using stochastic methods. We describe and analyze two practical methods for solving Eq. (1) even when the size of the problem is very large. The first method we propose is a stochastic version of the familiar coordinate descent approach. The coordinate descent approach for solving 1 regularized problems is not new (as we survey below in Section 1.1). At each iteration of coordinate descent, a single element of w is updated. The only twist we propose here regarding the way one should choose the next feature to update. We suggest to choose features uniformly at random from the set [d] = {1, . . . , d}. This simple modification enables us to show that the runtime required to achieve (expected) accuracy is upper bounded by md w 2 2 1. Introduction We present optimization procedures for solving problems of the form: wRd min 1 m m L( w, xi , yi ) + w i=1 1 , (1) where (x1 , y1 ), . . . , (xm , ym ) ([-1, +1]d × Y)m is a sequence of training examples, L : Rd × Y [0, ) is a non-negative loss function, and > 0 is a regularization parameter. This generic problem includes as special cases 1 the Lasso (Tibshirani, 1996), in which L(a, y) = 2 (a - 2 y) , and logistic regression, in which L(a, y) = log(1 + exp(-ya)). Our methods can also be adapted to deal with additional boxed constraints of the form wi [ai , bi ], which enables us to utilize them for solving the dual problem of Support Vector Machine (Cristianini & Shawe-Taylor, 2000). For concreteness, and due to the lack of space, we focus on the formulation given in Eq. (1). Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). , (2) where is a constant which only depends on the loss function (e.g. = 1 for the quadratic loss function) and w is the optimal solution. This bound tells us that the runtime grows only linearly with the size of the problem. Furthermore, the stochastic method we propose is parameters-free and is very simple to implement. Another well known stochastic method, which has been successfully applied for loss minimization problems, is stochastic gradient descent (e.g. Bottou & LeCunn, 2005; Shalev-Shwartz et al., 2007). In stochastic gradient descent, at each iteration we pick one example from the training set, uniformly at random, and update the weight vec- Stochastic Methods for 1 Regularized Loss Minimization tor based on the chosen example. The attractiveness of stochastic gradient descent methods is that their runtime do not depend at all on the number of examples, and can even sometime decrease with the number of examples (see Bottou & Bousquet, 2008; Shalev-Shwartz & Srebro, 2008). Regretfully, the stochastic gradient descent method fails to produce sparse solutions, which makes the algorithm both slower and less attractive as sparsity is one of the major reasons to use 1 regularization. To overcome this obstacle, two variants were recently proposed. First, Duchi et al., 2008 suggested to replace the 1 regularization term with a constraint of the form w 1 B, and then to use stochastic gradient projection procedure. Another solution, which uses the regularization form given in Eq. (1), has been proposed by Langford et al., 2009 and is called truncated gradient descent. In this approach, the elements of w which cross 0 after the stochastic gradient step are truncated to 0, hence sparsity is achieved. The disadvantage of both Duchi et al., 2008 and Langford et al., 2009 methods is that in some situations, their runtime might grow quadratically with the dimension d, even if the optimal predictor w is very sparse (see Section 1.1 below for details). This quadratic dependence on d can be avoided if one uses mirror descent updates (Beck & Teboulle, 2003) such as the exponentiated gradient approach (Littlestone, 1988; Kivinen & Warmuth, 1997; Beck & Teboulle, 2003). However, this approach again fails to produce sparse solutions. In this paper, we combine the idea of truncating the gradient (Langford et al., 2009) with another variant of stochastic mirror descent, which is based on p-norm updates (Grove et al., 2001; Gentile, 2003). The resulting algorithm both ~ produces sparse solutions and has O(d) dependence on the dimension. We call the algorithm SMIDAS for "Stochastic MIrror Descent Algorithm made Sparse". We provide runtime guarantees for SMIDAS as well. In particular, for the logistic-loss and the squared-loss we obtain the following upper bound on the runtime to achieving expected accuracy: O d log(d) w 2 2 1 coordinate descent method is parameters-free, the success of SMIDAS and of the method of Langford et al., 2009, depends on a careful tuning of a learning rate parameter. 1.1. Related Work We now survey several existing methods and in particular show how our stochastic twist enables us to give superior runtime guarantees. Coordinate descent methods for 1 regularization Following the Gauss-Siedel approach of Zhang & Oles, 2001, Genkin et al., 2007 described a coordinate descent method (called BBR) for minimizing 1 regularized objectives. This approach is similar to our method, with three main differences. First, and most important, at each iteration we choose a coordinate uniformly at random. This allows us to provide theoretical runtime guarantees. We note that no theoretical guarantees are provided by Zhang & Oles, 2001; Genkin et al., 2007. Second, we solely use gradient information which makes our algorithm parameters-free and extremely simple to implement. In contrast, the Gauss-Siedel approach is more complicated and involves second order information, or a line search procedure, or a trusted region Newton step. Last, the generality of our derivation allows us to tackle a more general problem. For example, it is easy to deal with additional boxed constraints. Friedman et al., 2008 generalized the approach of Genkin et al., 2007 to include the case of elastic-net regularization. In a series of experiments, they observed that cyclical coordinate descent outperforms many alternative popular methods such as LARS (Efron et al., 2004), an interior point method called l1lognet (Koh et al., 2007), and the Lasso Penalized Logistic (LPL) program (Wu & Lange, 2008). However, no theoretical guarantees are provided in Friedman et al., 2008 as well. Our analysis can partially explain the experimental result of Friedman et al., 2008 since updating the coordinates in a cyclical order can in practice be very similar to stochastic updates. Luo & Tseng, 1992 established a linear convergence result for coordinate descent algorithms. This convergence result tells us that after an unspecified number of iterations, the algorithm converges very fast to the optimal solution. However, this analysis is useless in data laden domains as it can be shown that the initial unspecified number of iterations depends at least quadratically on the number of training examples. In an attempt to improve the dependence on the size of the problem, Tseng & Yun, 2009 recently studied other variants of block coordinate descent for optimizing `smooth plus separable' objectives. In particular, 1 regularized loss minimization (Eq. (1)) is of this form, provided that the loss function is smooth. The algorithm proposed by Tseng & Yun, 2009 is not stochastic. Translated to our notation, the runtime bound given in Tseng & . (3) Comparing the above with the runtime bound of the stochastic coordinate descent method given in Eq. (2) we note three major differences. First, while the bound in Eq. (2) depends on the number of examples, m, the runtime of SMIDAS does not depend on m at all. On the flip side, the dependence of stochastic coordinate descent on the dimension is better both because the lack of the term log(d) and because w 2 is always smaller than w 2 (the ratio 2 1 is at most d). Last, the dependence on 1 is linear in Eq. (2) and quadratic in Eq. (3). If is the same order as the objective value at w it is possible to improve the dependence on 1/ . We omit the better bound due to lack of space. Finally, we would like to point out that while the stochastic Stochastic Methods for m d2 w 2 1 Regularized Loss Minimization 2 . This bound is infeYun, 2009 is of order1 rior to our runtime bound for stochastic coordinate descent given in Eq. (2) by a factor of the dimension d. Coordinate descent methods for 1 domain constraints A different, but related, optimization problem is to min1 imize the loss, m i L( w, xi , yi ), subject to a domain constraint of the form w 1 B. Many authors presented a forward greedy selection algorithm (a.k.a. Boosting) for this problem. We refer the reader to (Frank & Wolfe, 1956; Zhang, 2003; Clarkson, 2008). These authors derived the upper bound O w 2 / on the number of 1 iterations required by this algorithm to find an -accurate solution. Since at each iteration of the algorithm, one needs to calculate the gradient of the loss at w, the runtime of each iteration is m d. Therefore, the total runtime becomes O m d w 2 / . Note that this bound is better than the 1 bound given by Tseng & 2009, since for any vector Yun, in Rd we have w 1 d w 2 . However, the boosting bound given above is still inferior to our bound given in Eq. (2) since w 1 w 2 . Furthermore, in the extreme case we have w 2 = d w 2 , thus our bound 1 2 can be better than the boosting bound by a factor of d. It is possible to show (proof is omitted due to lack of space) that the iteration bound (not runtime) of any algorithm cannot be smaller than ( w 2 / ). This seems to imply that 1 any deterministic method, which goes over the entire data at each iteration, will induce a runtime which is inferior to the runtime we derive for stochastic coordinate descent. Stochastic Gradient Descent and Mirror Descent Stochastic gradient descent (SGD) is considered to be one of the best methods for large scale loss minimization, when we measure how fast a method achieves a certain generalization error. This has been observed in experiments (Bottou, Web Page) and also has been analyzed theoretically by Bottou & Bousquet, 2008; Shalev-Shwartz & Srebro, 2008. As mentioned before, one can apply SGD for solving Eq. (1), however, SGD fails to produce sparse solutions. Langford et al., 2009 proposed an elegant simple modification of the SGD update rule, which yields a variant of SGD with sparse intermediate solutions. They also provide bounds on the runtime of the resulting algorithm. In the general case (i.e. without assuming low objective relative to ), their analysis implies the following runtime bound O d w 2 2 2 2 X2 Eq. (3) we observe that none of the bounds dominates the other, and their relative performance depends on properties of the training set and the optimal solution w . Specifically, if w has only k d non-zero elements and each xi is dense (say xi {-1, +1}d ), then the ratio between the above bound of SGD and the bound in Eq. (3) becomes d 1. On the other hand, if xi has only k nonk log(d) zeros while w is dense, then the ratio between the bounds k can be d log(d) 1. Although the relative performance is data dependent, in most applications if one prefers 1 regularization over 2 regularization, he should also believe that w is sparse, and thus our runtime bound in Eq. (3) is likely to be superior2 . The reader familiar with the online learning and mirror descent literature will not be surprised by the above discussion. Bounds that involved w 1 and xi , as in Eq. (3), are well known and the relative performance discussed above was pointed out in the context of additive vs. multiplicative updates (see e.g. Kivinen & Warmuth, 1997). However, the most popular algorithm for obtaining bounds of the form given in Eq. (3) is the EG approach (Kivinen & Warmuth, 1997), which involves the exponential potential, and this algorithm cannot yield intermediate sparse solutions. One of the contributions of this paper is to show that with a different potential, which is called the p-norm potential, one can obtain the bound given in Eq. (3) while still enjoying sparse intermediate solutions. 2. Stochastic Coordinate Descent In this section we describe and analyze a simple stochastic coordinate descent algorithm for solving 1 regularized problems of the form given in Eq. (1). We first present the following equivalent optimization problem: min 1 m m 2d wR2d ^ L( w, xi , yi ) + i=1 i=1 wi s.t. w 0 , (5) , (4) 1 2 where X2 = m i xi 2 is the average squared norm 2 of an instance. Comparing this bound with our bound in 1 ^ where xi = [xi ; -xi ]. It is easy to verify that if v R2d minimizes Eq. (5) then w Rd defined by wi = vd+i - vi minimizes Eq. (1). Furthermore, if the objective of Eq. (5) at v is at most away from the optimum of Eq. (5), then the objective of Eq. (1) at w, s.t. wi = vd+i - vi , is at most away from the optimum of Eq. (1). Therefore, it suffices to present an algorithm for approximately solving Eq. (5) and the output of the algorithm immediately translates to an approximate solution of Eq. (1). To further simplify the presentation and for the purpose of generality, we derive and analyze the algorithm for opti2 One important exception is the large scale text processing application described in Langford et al., 2009 where the dimension is so large and 1 is used simply because we cannot store a dense weight vector in memory. To see this, note that the iterations bound in Equation (21) w 2 2 of Tseng & Yun, 2009 is: , and using Equation (25) in Section 6, we can set the value of to be = 1/d (since in our case there are no linear constraints). The complexity bound now follows from the fact that the cost of each iteration is O(d m). Stochastic Methods for 1 Regularized Loss Minimization mization problems of the form: wR2d Algorithm 2 SCD for Logistic-loss and squared-loss (6) let w = 0 R2d , z = 0 Rm for t = 1, 2, . . . sample j uniformly at random from {1, . . . , 2d} let L and be as defined in Eq. (8) and Eq. (9) 1 x let gj = m i:^i,j =0 L (zi , yi )^i,j + x let = max{-wj , -gj /} let wj = wj + i s.t. xi,j = 0 let zi = zi + ^i,j ^ x end 2.2. Runtime Guarantee Theorem 1 Let R(w) : R2d R be a convex objective function and assume that there exists > 0 such that for all vector w, scalar and index j we have R(w + ej ) R(w) + ( R(w))j + 2 2 min R(w) s.t. w 0 , where R : R2d R. If convenient, one may think about R as being the function R(w) = 1 m m 2d ^ L( w, xi , yi ) + i=1 i=1 wi . (7) We are now ready to present the stochastic coordinate descent algorithm. The algorithm initializes w to be 0. At each iteration, we pick a coordinate j uniformly at random from [2d]. Then, the derivative of R(w) w.r.t. the jth element of w, gj = ( R(w))j , is calculated. For example, if R(w) is as defined in Eq. (7) then gj = m 1 ^ x i=1 L ( w, xi , yi )^i,j + , where L is the derivam tive of the loss function w.r.t. its first argument. Simple calculus yields L (a, y) = (a - y) -y 1+exp(a y) . for squared-loss for logistic-loss (8) Let w be a minimizer of Eq. (6) and let wo be the output of Algorithm 1 after performing T iterations. Then, E[R(wo )] - R(w ) d w 2 2 Next, a step size is determined based on the value of gj and a parameter of the loss function denoted . This parameter is an upper bound on the second derivative of the loss. Again, for our running examples we have = 1 1/4 for squared-loss for logistic-loss (9) + 2R(0) 2T . The step size is trimmed so as to ensure that after performing the update, the constraint wj 0 will not be violated. Finally, we update wj according to the calculated step size. Algorithm 1 Stochastic Coordinate Descent (SCD) let w = 0 for t = 1, 2, . . . sample j uniformly at random from {1, . . . , 2d} let gj = ( R(w))j let = max{-wj , -gj /} let wj = wj + end 2.1. Efficient Implementation We now present an efficient implementation of Algorithm 1, assuming that R is of the form given in Eq. (7). The simple idea is to maintain a vector z Rm such that zi = w, xi . Once we have this vector, calculating gj on average requires O(s m) iterations, where s= |{(i,j) : xi,j =0}| ^ md Proof Sketch We define the following `double potential': (w) = w - w 2 + R(w). The main 2 2 component of the proof is to show that for any vector w and index j, if is defined as in Algorithm 1 we have (w) - (w + ej ) (wj - wj )gj . We show this using the assumption on the smoothness of R and algebraic manipulations. Denote by wt the value of w at iteration t of the algorithm. Taking expectation of the aforementioned inequality w.r.t. the choice of j we obtain 1 that E[(wt ) - (wt+1 )] d E[ wt - w , R(wt ) ]. But since R is convex, the right-hand side of the above upper bounds E[ t ]/d, where t = R(wt ) - R(w ) is the sub-optimality at iteration t. Summing over t and using the fact that t is a monotonically non-increasing sequence we obtain T E[ T +1 ] E[ t t ] d((w1 ) - (wT +1 )), which concludes our proof. Next, we specify the runtime bound for the case of 1 regularized logistic-regression and squared-loss. First, it is easy to show that for R as defined in Eq. (7), if the second derivative of L is bounded by then the condition on R given in Theorem 1 holds. Additionally, for the logisticloss we have R(0) 1. Therefore, for logistic-loss, af2 ter performing 4 iterations of Algorithm 2 we have that E[R(wo )] - R(w ) . Since the average cost of each iteration is s m, where s is as defined in Eq. (10), d( 1 w 2 +2) (10) is the average number of non-zeros in our training set. Concretely, we obtain the following procedure for logistic-loss and squared-loss. we end up with the total runtime the squared-loss we have R(0) = 1 smd(4 w 1 m . For 2 yi . Assuming i 2 2 +2) Stochastic Methods for 1 Regularized Loss Minimization that the targets are normalized so that R(0) 1, and using similar derivation we obtain the total runtime bound s m d ( w 2 +2) 2 . 3. Stochastic Mirror Descent made Sparse In this section we describe our mirror descent approach for 1 regularized loss minimization, which maintains intermediate sparse solutions. To simplify the notation throughout this section, we rewrite the problem in Eq. (1) using the notation P (w) m wRd ^ ~ crossed the zero value, i.e. sign(j ) = sign(j ), then we truncate the jth element to be zero. Intuitively, the goal of the first step is to decrease the value of C(w) and this is done by a (mirror) gradient step, while the goal of the second and third steps is to decrease the value of w 1 . So, by truncating at zero we make the value of w 1 even smaller. Algorithm 3 Stochastic Mirror Descent Algorithm made Sparse (SMIDAS) parameter: > 0 let p = 2 ln(d) and let f -1 be as in Eq. (12) let = 0 for t = 1, 2, . . . let w = f -1 () sample i uniformly at random from {1, . . . , m} let v = L ( w, xi , yi ) xi (L is the derivative of L. See e.g. Eq. (8)) ~ let = - v ~ ~ let j, j = sign(j ) max{0, |j | - } end min 1 m i=1 L( w, xi , yi ) + w C(w) 1 . (11) Mirror descent algorithms maintain two weight vectors: primal w and dual . The connection between the two vectors is via a link function = f (w), where f : Rd Rd . The link function is invertible, and therefore w = f -1 (). In our mirror descent variant, we use the p-norm link function. That is, the jth element of f is fj (w) = (sign(wj ) |wj |q-1 )/ w q-2 , where q w q = ( j |wj |q )1/q and q = p/(p - 1). The inverse function is (see e.g. Gentile, 2003) sign(j ) |j |p-1 -1 fj () = . (12) p-2 p We first describe how mirror descent algorithms can be applied to the objective C(w) without the 1 regularization term. At each iteration of the algorithm, we first sample a training example i uniformly at random from {1, . . . , m}. We then estimate the gradient of C(w) by calculating the vector: v = L ( w, xi , yi ) xi . Note that the expectation of v over the random choice of i is E[v] = C(w). That is, v is an unbiased estimator of the gradient of C(w). Next, we update the dual vector according to = - v. If the link function is the identity mapping, this step is identical to the update of stochastic gradient descent. However, in our case f is not the identity function and it is important to distinguish between and w. The above update of translates to an update of w by applying the link function w = f -1 (). So far, we ignored the additional 1 regularization term. The simplest way to take this term into account is by also subtracting from the gradient of the term w 1 . (More precisely, since the 1 norm is not differentiable, we will use any subgradient of w 1 instead, e.g. the vector whose jth element is sign(wj ), where we interpret sign(0) = 0.) Therefore, we could have redefined the update of to be j = j -(vj + sign(wj )). Regretfully, as noted in Langford et al., 2009, this update leads to a dense vector , which in turn leads to a dense vector w. The solution proposed in Langford et al., 2009 breaks the ~ update into three phases. First, we let = - v. Second, ^ ~ ~ we let = - sign(). Last, if in the second step we 3.1. Efficient Implementation A naive implementation of Algorithm 3 leads to an O(d) runtime for each iteration. We now present two improvements for two specific scenarios. First, assume that the data is sparse, that is xi contains only s non-zero elements on average (see Eq. (10)). In this case, we can implement each iteration of Algorithm 3 in time O(s). The idea is to maintain two additional scalars, 1 and 2 , such that 1-2/p -1 1 = j |j |p and 2 = 1 . Since wj = fj (), p-1 we can then rewrite wj = |j | /2 . Therefore, instead of maintaining w we can maintain a vector z, where zj = |j |p-1 , and the scalar 2 . Clearly, we can calculate w, xi based on z and 2 by making a single pass over the non-zero elements of xi . Additionally, we can update , z and 1 by making an additional single pass over the non-zeros elements of xi . Finally, we calculate 2 from 1 using O(1) operations. Therefore, each iteration of Algorithm 3 requires two passes over the non-zero elements of xi , and the total runtime is O(s). Next, we describe a different scenario in which the main bottleneck is to calculate the values of the features. In this case, when calculating w, xi we only need to calculate the features of xi for which wj = 0. Since the algorithm maintains sparse intermediate solutions, this can be much faster than calculating all the features of xi . Next, when updating , we note that if xi 1 and L ( w, xi , yi ) , then any element of j which is currently zero will remain zero also after the update. So, again, in this case we do not need to calculate all the features of xi . Stochastic Methods for 1 Regularized Loss Minimization 3.2. Runtime Guarantee We now provide runtime guarantees for Algorithm 3. We introduce two types of assumptions on the loss function: |L (a, y)| |L (a, y)| 2 4. Experiments We consider 4 datasets for our experiments: REUTERS, ARCENE, MAGIC 04 S , and MAGIC 04 D . REUTERS is a dataset obtained from the Reuters RCV1 collection. Examples in this collection can have more than 1 label. We created a binary classification dataset out of this by treating any example that was labelled CCAT as having label +1. Rest were assigned the label -1. This gave us a 378,452 dimensional dataset with 806,791 examples. There were 9.8 × 107 non-zero entries in the example matrix corresponding to a sparsity level of 0.03%. ARCENE is a dataset from the UCI Machine Learning repository where the task is to distinguish cancer patterns from normal ones based on 10,000 mass-spectrometric features. Out of these, 3,000 features are synthetic features as this dataset was designed for the NIPS 2003 variable selection workshop. There are 100 examples in this dataset and the example matrix contains 5.4 × 105 non-zero entries corresponding to a sparsity level of 54%. The datasets MAGIC 04 S and MAGIC 04 D were obtained by adding 1,000 random features to the MAGIC Gamma Telescope dataset from the UCI Machine Learning repository. The original dataset has 19,020 examples with 10 features. This is also a binary classification dataset and the task is to distinguish high-energy gamma particles from background using a gamma telescope. Following the experimental setup of Langford et al., 2009, we added 1,000 random features, each of which takes value 0 with probability 0.95 or 1 with probability 0.05, to create a sparse dataset, MAGIC 04 S. We also created a dense dataset, MAGIC 04 D , in which the random features took value -1 or +1, each with probability 0.5. MAGIC 04 S and MAGIC 04 D has sparsity levels of 5.81% and 100% respectively. We ran 4 algorithms on these datasets: SCD, DETCD, SMIDAS, and T RUNC G RAD. SCD is the stochastic coordinate descent algorithm given in Section 2 above. DETCD is the corresponding deterministic version of the same algorithm. The coordinate to be updated at each iteration is chosen in a deterministic manner to maximize a lower bound on the guaranteed decrease in the objective function. This type of deterministic criterion for choosing features is common in Boosting approaches. Since choosing a coordinate (or feature in our case) in a deterministic manner involves significant computation in case of large datasets, we expect that the deterministic algorithm will converge much slower than the stochastic algorithm. We also tried using a deterministic coordinate descent algorithm that works with the domain constraint formulation as discussed in Section 1.1. It was also slower than SCD although we do not report its results here as it does not work directly with the regularized loss. SMIDAS is the mirror descent algorithm given in Section 3 above. T RUNC G RAD is the truncated gradient algorithm of Langford et al., 2009 (In fact, Langford et al., 2009 suggests another way to trun- L(a, y) (13) (14) In the above, L is the derivative w.r.t. the first argument and can also be a sub-gradient of L if L is not differentiable. It is easy to verify that Eq. (14) holds for the squared-loss with = 4 and that Eq. (13) holds for the hinge-loss, L(a, y) = max{0, 1 - ya}, with = 1. Interestingly, for the logistic-loss, both Eq. (13) holds with = 1 and Eq. (14) holds with = 1/2. Theorem 2 Let w be a minimizer of Eq. (11) and let L satisfy Eq. (13) or Eq. (14). Then, if Algorithm 3 is run with = w 1 2(p - 1)e/T for T iterations and wo = wr for r chosen uniformly at random from [T ], we have E[P (wo )] - P (w ) O w 1 log(d) T . Proof Sketch Due to lack of space, we only sketch the ~ proof for the case that Eq. (13) holds. Let t , t be ~ the values of , at iteration t of the algorithm. Let ~ ~ wt = f -1 ( t ) and wt = f -1 ( t ). Define the Bregman 1 1 divergence potential (w) = 2 w 2 - 2 w 2 - q q f (w), w - w . We first rewrite (wt ) - (wt+1 ) = ~ ~ ((wt ) - (wt )) + ((wt ) - (wt+1 )). Standard ~ analysis of mirror descent yields (wt ) - (wt ) p (L( wt , xi , yi ) - L( w , xi , yi )) - . See 2 e.g., Beck & Teboulle, 2003. From Eq. (13) we obtain ~ that vt 2 2 d2/p = 2 e. Thus, (wt ) - (wt ) p 2 (p-1) vt 2 (L( wt , xi , yi ) - L( w , xi , yi )) - (p-1) e . 2 The more involved part of the proof is to show ~ that (wt ) - (wt+1 ) ( wt+1 1 - w 1 ). To show this, we use the definition of and the non-negativity of Bregman divergence to get that ~ ~ (wt ) - (wt+1 ) w - wt+1 , t+1 - t . Using the definition of t+1 and wt+1 we have ~ wt+1 , t+1 - t = wt+1 1 .3 In addition, Holder in~ equality implies that w , t+1 - t w 1 . Combining all the above together and taking expectation w.r.t. i we get E[P (wt )] - P (w ) E [ wt 1 - wt+1 1 ] + (p-1)2 e 1 . Summing over t, E [(wt ) - (wt+1 )] + 2 rearranging, and optimizing over concludes our proof. The bound in the above theorem can be improved if Eq. (14) holds and the desired accuracy is the same order as P (w ). We omit the details due to lack of space. 3 Note that this equality does not hold for the Bregman potential corresponding to exponentiated gradient. 2 2 Stochastic Methods for REUTERS ( = 10-6) 0.7 0.65 0.6 L1 regularized loss 0.55 0.5 0.45 0.4 0.35 0 0.5 1 1.5 Number of Data Accesses 2 x 10 9 1 Regularized Loss Minimization ARCENE ( = 10-2) MAGIC04D ( = 10-4) 1200 1000 800 600 400 SCD DET CD SMIDAS TruncGrad 0.7 SCD DET CD SMIDAS TruncGrad 0.6 L1 regularized loss SCD DET CD SMIDAS TruncGrad Density of w 0.5 0.4 0.3 200 0 0 0.2 0 2 4 6 8 Number of Data Accesses MAGIC04S ( = 10 ) -4 10 12 8 x 10 1 2 3 Number of Data Accesses MAGIC04D ( = 10 ) -4 4 x 10 9 ARCENE ( = 10-6) 0.7 0.6 L1 regularized loss 0.5 0.4 0.3 0.2 0.1 0 0 0.45 0 SCD DET CD SMIDAS TruncGrad L1 regularized loss 0.7 SCD DET CD SMIDAS TruncGrad 0.7 SCD DET CD SMIDAS TruncGrad 0.65 0.65 L1 regularized loss 0.6 0.6 0.55 0.55 0.5 0.5 2 4 6 8 Number of Data Accesses 10 12 8 x 10 0.5 1 1.5 Number of Data Accesses 2 2.5 8 x 10 0.45 0 1 2 3 Number of Data Accesses 4 x 10 9 Figure 1. Performance of SCD, DETCD, SMIDAS, and T RUNC G RAD on 4 datasets (see Section 4 for details) cate the gradient. Here, we refer to the variant corresponding to SMIDAS.) Of these 4, the first two are parameterfree algorithms while the latter two require a parameter . In our experiments, we ran SMIDAS and T RUNC G RAD for a range of different values of and chose the one that yielded the minimum value of the objective function (i.e. the regularized loss). The top-left plot in Figure 1 is for the REUTERS dataset. It shows the regularized loss objective achieved by the 4 algorithms as a function of the number of times they access the data matrix (xi,j ) of the REUTERS dataset. We choose to use this as opposed to, say CPU time, as this is an implementation independent quantity. Moreover, the actual time taken by these algorithms will be roughly proportional to this quantity. The regularization parameter was set to 10-6 (but similar results hold for other values of as well). The parameter of SMIDAS and T RUNC G RAD was searched over the range 10-6 to 10-1 in exponentially increasing step sizes. It is clear that SCD outperforms the other three algorithms. DETCD is much slower compared to SCD because, as we mentioned above, it spends a lot of time in finding the best coordinate to update. The two algorithms having a tunable parameter do much worse here. The situation is even worse if we add up the time to perform several runs of these algorithms for tuning . This illustrates a problem practitioners have to deal with when faced with large datasets. A parameter-free algorithm that is also quicker to decrease the objective function is clearly preferred in such situations. The bottom-left and top-middle plots in Figure 1 are for the ARCENE dataset. We have plotted the regularized loss objective against the number of data matrix accesses for a small and a large value of (10-6 and 10-2 respectively). SMIDAS does much better than T RUNC G RAD for the large value and is worse, but still competitive, for the small value. Note that SCD does well and DETCD is quite slow in both scenarios. For the MAGIC datasets, SMIDAS does much better than T RUNC G RAD for the MAGIC 04 D dataset (where the example vectors are dense). T RUNC G RAD is competitive for the MAGIC 04 S dataset (where the example vectors are sparse). This is illustrated in the bottom-middle and bottom-right plots in Figure 1. Note that this behavior is consistent with the bounds Eq. (3) and Eq. (4) given above. These bounds suggest that if the true solution has low 1 norm, SMIDAS will require fewer iterations than T RUNC G RAD when the examples are dense. For MAGIC 04 D, we also plotted the density (or the 0 norm) of the weight vector w as a function of number of accesses to the data matrix for all 4 algorithms. This is shown in the top-right plot in Figure 1. It is interesting to note that SCD not only minimizes the objective quickly but it also maintains sparsity along the way. If we compare SCD with T RUNC G RAD in the two plots on the right, we find that T RUNC G RAD is better at minimizing the objective while SCD finds slightly sparser w's. All plots for the MAGIC datasets are for = 10-3 . 5. Discussion Focusing on the problem of 1 regularized loss minimization, we showed how stochastic approaches can yield simple and practical methods for large data sets. Furthermore, the stochastic methods described in this paper outperform their corresponding deterministic approaches. To the best of our knowledge, the only known provably correct deter- Stochastic Methods for 1 Regularized Loss Minimization ministic coordinate method for 1 regularization is that of Tseng & Yun, 2009, which both has worse runtime and is much more complicated to implement.4 There are several possible extension to this work. For example, the p-norm algorithm has been originally suggested as an interpolation between additive and multiplicative updates. Naturally, one can use the SMIDAS algorithm with different values of p, yielding an interpolation between Langford et al., 2009 and SMIDAS. Another possible extension of this work is the usage of our analysis for better understanding stochastic coordinate descent methods for optimizing the dual of Support Vector Machines. Genkin, A., Lewis, D., & Madigan, D. (2007). Largescale Bayesian logistic regression for text categorization. Technometrics, 49, 291­304. Gentile, C. (2003). The robustness of the p-norm algorithms. Machine Learning, 53, 265­299. Grove, A. J., Littlestone, N., & Schuurmans, D. (2001). General convergence results for linear discriminant updates. Machine Learning, 43, 173­210. Kivinen, J., & Warmuth, M. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1­64. Koh, K., Kim, S., & Boyd, S. (2007). An interior-point method for large-scale 1 -regularized logistic regression. Journal of Machine Learning Research, 8, 1519­1555. Langford, J., Li, L., & Zhang, T. (2009). Sparse online learning via truncated gradient. Advances in Neural Information Processing Systems 21 (pp. 905­912). Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285­318. Luo, Z., & Tseng, P. (1992). On the convergence of coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl., 72, 7­35. Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Proceedings of the 24th International Conference on Machine Learning (pp. 807­814). Shalev-Shwartz, S., & Srebro, N. (2008). SVM optimization: Inverse dependence on training set size. Proceedings of the 25th International Conference on Machine Learning (pp. 928­935). Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58, 267­288. Tseng, P., & Yun, S. (2009). A block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. J. Optim. Theory Appl., 140, 513­ 535. Wu, T. T., & Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Annals of Applied Statistics, 2, 224­244. Zhang, T. (2003). Sequential greedy approximation for certain convex optimization problems. IEEE Transaction on Information Theory, 49, 682­691. Zhang, T., & Oles, F. J. (2001). Text categorization based on regularized linear classification methods. Information Retrieval, 4, 5­31. References Beck, A., & Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31, 167­175. Bottou, L. (Web Page). Stochastic gradient descent examples. http://leon.bottou.org/projects/ sgd. Bottou, L., & Bousquet, O. (2008). The tradeoffs of large scale learning. Advances in Neural Information Processing Systems 20 (pp. 161­168). Bottou, L., & LeCunn, Y. (2005). On-line learning for very large datasets. Appl. Stoch. Model. Bus. and Ind., 21, 137­151. Clarkson, K. (2008). Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 922­931). Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge University Press. Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto the 1 -ball for learning in high dimensions. Proceedings of the 25th International Conference on Machine Learning (pp. 272­279). Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Ann. Statist., 32, 407­499. Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval Res. Logist. Quart., 3, 95­110. Friedman, J., Hastie, T., & Tibshirani, R. (2008). Regularized paths for generalized linear models via coordinate descent (Technical Report). Department of Statistics, Stanford University. 4 The deterministic coordinate descent method for loss minimization with 1 constraint is maybe the best deterministic competitor, but as we showed, it is both theoretically and empirically inferior to our stochastic gradient descent algorithm.