A Quasi-Newton Approach to Nonsmo oth Convex Optimization Jin Yu jin.yu@anu.edu.au S.V. N. Vishwanathan svn.vishwanathan@nicta.com.au Simon Gunter ¨ guenter simon@hotmail.com Nicol N. Schraudolph nic@schraudolph.org Canberra Research Laboratory, NICTA, Canberra, Australia Research School of Information Sciences & Engineering, Australian National University, Canberra, Australia Abstract We extend the well-known BFGS quasiNewton method and its limited-memory variant LBFGS to the optimization of nonsmooth convex ob jectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: The local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We apply the resulting subLBFGS algorithm to L2 -regularized risk minimization with binary hinge loss, and its directionfinding component to L1 -regularized risk minimization with logistic loss. In both settings our generic algorithms perform comparable to or better than their counterparts in specialized state-of-the-art solvers. which is used for the parameter update: wt+1 = wt + t pt . (3) The step size t R+ is normally determined by a line search obeying the Wolfe conditions: J (wt+1 ) J (wt ) + c1 t J(wt ) and J(wt+1 ) p t p t p t, c2 J(wt ) (4) with 0 < c1 < c2 < 1. The matrix Bt is then modified via the incremental rank-two update Bt+1 = (I - tst yt )Bt (I - tyt st ) + tst st , (5) where st := wt+1 - wt and yt := J(wt+1 ) - J(wt ) denote the most recent step along the optimization trajectory in parameter and gradient space, respectively, and t := (yt st )-1 . Given a descent direction pt , the Wolfe conditions ensure that (t) st yt > 0 and hence B0 0 = (t) Bt 0. Limited-memory BFGS (LBFGS) is a variant of BFGS designed for solving large-scale optimization problems where the O(d2 ) cost of storing and updating Bt would be prohibitively expensive. LBFGS approximates the quasi-Newton direction directly from the last m pairs of st and yt via a matrix-free approach. This reduces the cost to O(md) space and time per iteration, with m freely chosen (Nocedal and Wright, 1999). Smoothness of the ob jective function is essential for standard (L)BFGS because both the local quadratic model (1) and the Wolfe conditions (4) require the existence of the gradient J at every point. Even though nonsmooth convex functions are differentiable everywhere except on a set of Lebesgue measure zero (Hiriart-Urruty and Lemar´chal, 1993), in practice e (L)BFGS often fails to converge on such problems (Lukan and Vlek, 1999; Haarala, 2004). Various s c subgradient-based approaches, such as subgradient descent (Nedich and Bertsekas, 2000) or bundle methods (Teo et al., 2007), are therefore preferred. 1. Intro duction The (L)BFGS quasi-Newton method (Nocedal and Wright, 1999) is widely regarded as the workhorse of smooth nonlinear optimization due to its combination of computational efficiency with good asymptotic convergence. Given a smooth ob jective function J : Rd R and a current iterate wt Rd , BFGS forms a local quadratic model of J : Qt (p) := J (wt ) + 1 2 p B-1 tp + J(wt ) p , (1) where Bt 0 is a positive-definite estimate of the inverse Hessian of J . Minimizing Qt (p) gives the quasiNewton direction pt := -Bt J(wt ), (2) Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). A Quasi-Newton Approach to Nonsmo oth Convex Optimization Although a convex function might not be differentiable everywhere, a subgradient always exists. Let w be a point where a convex function J is finite. Then a subgradient is the normal vector of any tangential supporting hyperplane of J at w. Formally, g is called a subgradient of J at w if and only if Algorithm 1 Subgradient BFGS (subBFGS) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: Initialize: t := 0, w0 = 0, B0 = I ; Set direction-finding stopping tolerances , kmax R+ ; Compute subgradient g0 J (w0 ); while not converged do pt = descentDirection(gt , , kmax ); (Algorithm 2) if pt = failure then Return wt ; end if Find t that obeys (14); (e.g., Algorithm 3) st = t pt ; wt+1 = wt + st ; Compute subgradient gt+1 J (wt+1 ); yt = gt+1 - gt ; Update Bt+1 via (5); t := t + 1; end while J (w ) J (w) + (w - w) g w . (6) The set of all subgradients at a point is called the subdifferential, and is denoted by J (w). If this set is not empty then J is said to be subdifferentiable at w. If it contains exactly one element, i.e., J (w) = { J(w)}, then J is differentiable at w. In this paper we systematically modify the standard (L)BFGS algorithm so as to make it amenable to subgradients. This results in sub(L)BFGS, a new subgradient quasi-Newton method which is applicable to a wide variety of nonsmooth convex optimization problems encountered in machine learning. In the next section we describe our new algorithm generically, before we discuss its application to L2 regularized risk minimization with hinge loss in Section 3. Section 4 compares and contrasts our work with other recent efforts in this area. Encouraging experimental results are reported in Section 5. We conclude with an outlook and discussion in Section 6. Figure 1. Quadratic models (dashed) vs. tightest pseudoquadratic fit (7) (bold dashes) to the ob jective function (solid line) at a subdifferentiable point (solid disk). local quadratic model as follows: Qt (p) := J (wt ) + Mt (p), where Mt (p) := 1 2 p B-1 tp + sup g g J (wt ) p . (7) 2. Subgradient BFGS Metho d We modify the standard BFGS algorithm to derive our new algorithm (subBFGS, Algorithm 1) for nonsmooth convex optimization. These modifications can be grouped into three areas, which we elaborate on in turn: generalizing the local quadratic model, finding a descent direction, and finding a step size that obeys a subgradient reformulation of the Wolfe conditions. 2.1. Generalizing the Lo cal Quadratic Mo del Recall that BFGS assumes the ob jective function J is differentiable everywhere, so that at the current iterate wt we can construct a local quadratic model (1) of J (wt ). For a nonsmooth ob jective function, such a model becomes ambiguous at non-differentiable points (Figure 1). To resolve the ambiguity, we could simply replace the gradient J(wt ) in (1) with some subgradient gt J (wt ). However, as will be discussed later, the resulting quasi-Newton direction pt := -Bt gt is not necessarily a descent direction. To address this fundamental modeling problem, we first generalize the Note that where J is differentiable, (7) reduces to the familiar BFGS quadratic model (1). At nondifferentiable points, however, the model is no longer quadratic, as the supremum may be attained at different elements of J (wt ) for different directions p. Instead it can be viewed as the tightest pseudo-quadratic fit to J at wt (Figure 1). Ideally, we would like to minimize Qt (p), or equivalently Mt (p), in (7) to obtain the best search direction, p := arginf Mt (p). pRd (8) This is generally intractable due to the presence of a supremum over the entire subdifferential set J (wt ). In many machine learning problems, however, the set J (wt ) has some special structure that simplifies calculation of the supremum in (7). In what follows, we develop an iteration that is guaranteed to find a quasi-Newton descent direction, assuming an oracle that supplies argsupg J (wt ) g p for a given direction A Quasi-Newton Approach to Nonsmo oth Convex Optimization Algorithm 2 pt = descentDirection(g (1) , , kmax ) input subgradient g (1) J (wt ), tolerance R+ , iteration limit kmax ; output descent direction pt ; ¯ 1: Initialize: i := 1, g (1) = g (1) , p(1) = -Bt g (1) ; 2: g (2) = argsupg J (wt ) g p(1) ; 3: Calculate (1) via (13); 4: while (g (i+1) p(i) > 0 or (i) > ) and i < kmax do » ­ ¯ (g (i+1) - g (i) ) p(i) 5: µ := min 1, (i+1) ; ¯ ¯ (g - g (i) ) Bt (g (i+1) - g (i) ) (i+1) (i) (i+1) ¯ ¯ g = (1 - µ )g + µ g ; 6: 7: p(i+1) = (1 - µ )p(i) - µ Bt g (i+1) ; 8: g (i+2) = argsupg J (wt ) g p(i+1) ; 9: Calculate (i+1) via (13); 10: i := i + 1; 11: end while 12: if g (i+1) p(i) > 0 then 13: return failure; 14: else 15: return argminj i Mt (p(j ) ). 16: end if ( ) c 1 sup g J (wt ) g p t g J (wt ) inf g p t c2 sup g J (wt ) g p t sup g J (wt ) g p t 0 acceptable interval Figure 2. Geometric interpretation of the subgradient Wolfe conditions (14). Solid disks are subdifferentiable points; the slopes of dashed lines are indicated. such that Mt (p) Mt (p) Mt (p) p Rd . (1) Here we set g J (wt ), and assume that g (i+1) is ( i) provided by an oracle. To solve inf pRd Mt (p), we rewrite it as a constrained optimization problem: 1 s inf 2 p B-1 p + .t. g (j ) p j i. (12) t p, ( i) (i+1) p R . In Section 3.1 we provide an efficient implementation of such an oracle for L2 -regularized risk minimization with the hinge loss. 2.2. Finding a Descent Direction A direction pt is a descent direction if and only if g pt < 0 g J (wt ) (Belloni, 2005), or equivalently sup g g J (wt ) p t d < 0. (9) In particular, for a smooth convex function the quasiNewton direction (2) is always a descent direction because J(wt ) pt = - J(wt ) Bt J(wt ) < 0 holds due to the positivity of Bt . For nonsmooth functions, however, the quasi-Newton direction pt := -Bt gt for a given gt J (wt ) may not fulfill the descent condition (9), making it impossible to find a step that obeys (4), thus causing a failure of the line search. We now present an iterative approach to finding a quasi-Newton descent direction. Inspired by bundle methods (Teo et al., 2007), we build the following convex lower bound on Mt (p): Mt (p) := ( i) 1 2 This problem can be solved exactly via quadratic programming, but doing so may incur substantial computational expense. Instead we adopt an alternative approach (Algorithm 2) which does not solve ( i) inf pRd Mt (p) to optimality. The key idea is to write the proposed descent direction at iteration i + 1 as a convex combination of p(i) and -Bt g (i+1) . The optimal combination coefficient µ can be computed exactly (Step 5 of Algorithm 2) using an argument based on maximizing dual progress. Finally, to derive an implementable stopping criterion, we define (i) to be p , ¯ ¯ (13) min (j ) g(j+1) - 1 (p(j ) g (j ) + p(i) g (i) ) 2 j i ¯ where g (i) is an aggregated subgradient (Step 6 of Algorithm 2) which lies in the convex hull of g (j ) J (wt ) j i. (i) is monotonically decreasing, and upper bounds the distance from the optimal value of the dual of Mt (p), leading us to a practical stopping criterion (Step 4 of Algorithm 2) for our directionfinding procedure. Yu et al. (2008) provide details, and prove that Algorithm 2 converges to the optimal dual ob jective value with precision at an O(1/ ) rate. 2.3. Subgradient Line Search Given the current iterate wt and a search direction pt , the task of a line search is to find a step size R+ which decreases the ob jective function along the line wt + pt , i.e., J (wt + pt ) =: ( ). The Wolfe conditions (4) are used in line search routines to enforce a sufficient decrease in the ob jective value, and p B-1 tp + sup g (j ) j i d p , (10) where i, j N. Given a p R the lower bound (10) is successively tightened by computing g (i+1) := argsup g g J (wt ) p(i) ( i) , (11) A Quasi-Newton Approach to Nonsmo oth Convex Optimization to exclude unnecessarily small step sizes (Nocedal and Wright, 1999). However, the original Wolfe conditions require the ob jective function to be smooth. To extend them to nonsmooth convex problems, we propose the following subgradient reformulation: J (wt+1 ) J (wt ) + c1 t sup g g J (wt ) p t p t, E , M, and W index the set of points which are in error, on the margin, and well-classified, respectively. 3.1. Realizing the Direction-Finding Metho d Recall that our sub(L)BFGS algorithm requires an oracle that provides argsupg J (wt ) g p for a given direction p. For L2 -regularized risk minimization with binary hinge loss we can implement such an oracle at computational cost linear in the number | Mt | of current marginal points. (Normally | Mt | n.) Towards this end we use (17) to obtain p ¯ 1i p sup g = sup wt - i zi xi n i ,iMt g J (wt ) Mt 1i ¯ = wt p - inf i zi xi p. (18) n i [0,1] Mt and g sup g J (w t+1 ) p t c2 sup g g J (wt ) (14) where 0 < c1 < c2 < 1. Figure 2 illustrates how these conditions enforce acceptance of non-trivial step sizes that decrease the ob jective value. Yu et al. (2008) formally show that for any given descent direction we can always find a positive step size that satisfies (14). 2.4. Limited-Memory Subgradient BFGS It is straightforward to implement an LBFGS variant of our subBFGS algorithm: We simply modify Algorithms 1 and 2 to compute all products of Bt with a vector by means of the standard LBFGS matrix-free scheme (Nocedal and Wright, 1999). 3. sub(L)BFGS Implementation for L2 -Regularized Risk Minimization Many machine learning algorithms can be viewed as minimizing the L2 -regularized risk c w J (w) := 2 2 n 1i l(w + n =1 x i , zi ), Since for a given p the first term of the right-hand side of (18) is a constant, the supremum is attained when we set i i Mt via the following strategy: 0 if zi xi pt 0, i := (19) 1 if zi xi pt < 0. 3.2. Implementing the Line Search The one-dimensional convex function obtained by restricting (15) to a line can be evaluated efficiently. To see this, rewrite the ob jective as J (w) := c w 2 2 (15) where xi X Rd are the training instances, zi Z R the corresponding labels, and the loss l is a non-negative convex function of w which measures the discrepancy between zi and the predictions arising from w via w xi . A loss function commonly used for binary classification is the hinge loss l(w x + 1 1 n m ax(0, 1 - z · X w), (20) , z ) := max(0, 1 - z w x ), (16) where z {±1}. L2 -regularized risk minimization with binary hinge loss is a convex but nonsmooth optimization problem; in this section we show how sub(L)BFGS (Algorithm 1) can be applied to it. Differentiating (15) after plugging in (16) yields J (w) = c w - n 1i 1i ¯ i zi xi = w - n =1 n where 0 and 1 are column vectors of zeros and ones, respectively, · denotes the Hadamard (componentwise) product, and z Rn collects correct labels corresponding to each row of data in X := [x1 , x2 , · · · , xn ] Rn×d . Given a search direction pt at an iterate wt , this allows us to write ( ) := J (wt + pt ) = 1 ( ) [1 - (f + f )] (21) n c c 2 + wt 2 + c wt pt + pt 2 2 2 i zi xi , M (17) ¯ where w := c w - E zi xi and E := {i : 1 - zi w 1 if i E , i := [0, 1] if i M, M := {i : 1 - zi w 0 if i W , W := {i : 1 - zi w 1 n i where f := z · X wt , f := z · X pt , and if fi + fi < 1, 1 [0, 1] if fi + fi = 1, i ( ) := 0 if fi + fi > 1 (22) x x x i i i > 0}, = 0}, < 0}. for 1 i n. We cache f and f , expending O(nd) c computational effort. We also cache 2 wt 2, cwt pt , c and 2 pt 2, each of which requires O(n) work. The A Quasi-Newton Approach to Nonsmo oth Convex Optimization ( ) step size search direction ( ) step size search direction Algorithm 3 = linesearch(wt , pt , c, f , f ) input wt , pt , c, f , and f as in (21); output step size ; 1: b = c wt pt , h = c pt 2; 2: n = length(f ), j := 1; 3: := [(1 - f )/f , 0]; (subdifferentiable points) 4: = sort(); (index vector) 5: while j 0 do 6: j := j + 1; 7: end while 8: := j /2; 9: for i := to n do 1 1 if fi + fi < 1; 10: i := 0 otherwise; 11: end for 12: := f /n; 13: while j length( ) do 14: := ( - b)/h; (candidate step) 15: if [j -1 , j ] then 16: return ; 17: else if < j -1 then 18: return := j -1 ; 19: else 20: rep eat - fj /n if j = 1, 21: := + fj /n otherwise; 22: j := j + 1; 23: until j = j -1 24: end if 25: end while 0 candidate intervals 0 candidate intervals Figure 3. Nonsmooth convex function of step size . Solid disks are subdifferentiable points; the optimal falls on such a point (left), or between two such points (right). evaluation of ( ) and its inner product with 1 - (f + f ) both take O(n) effort. All other terms in (21) can be computed in constant time, thus reducing the amortized cost of evaluating ( ) to O(n). We are now in a position to introduce an exact line search which takes advantage of this scheme. 3.2.1. Exact Line Search Differentiating (21) with respect to and setting the gradient to zero shows that := argmin ( ) satisfies = ( ( ) f /n - c wt pt )/(c pt 2). It is easy to verify that ( ) is piecewise quadratic, and differentiable everywhere except at i := (1 - fi )/fi , where it becomes subdifferentiable. At these points an element of the indicator function ( ) (22) changes from 0 to 1 or vice versa; otherwise ( ) remains constant. Thus for a smooth interval (a , b ) between subdifferentiable points a and b (cf. Figure 3), if the candidate step a,b ant of nonsmooth BFGS, the Orthant-Wise Limitedmemory Quasi-Newton (OWL-QN) algorithm, suitable for optimizing L1 -regularized log-linear models: J (w) := c w 1 = ( ) f /n - c wt pt , c pt 2 (a , b ) (23) + n 1i ln(1 + e-zi w n =1 x i ), (24) lies in the interval, it is optimal: a,b [a , b ] = a,b . Otherwise, the interval boundary is optimal if its subdifferential contains zero: 0 (a ) = a . Sorting the subdifferentiable points i facilitates efficient search over intervals; see Algorithm 3 for details. where the logistic loss is smooth, but the regularizer is only subdifferentiable at points where w has zero elements. From the optimization viewpoint this objective is very similar to the L2 -regularized hinge loss; the direction finding and line search methods that we discussed in Sections 3.1 and 3.2, respectively, can be applied to this problem with slight modifications. OWL-QN is based on the observation that the L1 regularizer is linear within any given orthant. Therefore, it maintains an approximation B ow to the inverse Hessian of the logistic loss, and uses an efficient scheme to select orthants for optimization. In fact, its success greatly depends on its direction-finding subroutine, which demands a specially chosen subgradient g ow (Andrew and Gao, 2007, Equation 4) to produce the quasi-Newton direction, pow = (p, g ow ), where p := -B ow g ow and the pro jection returns a search direction by setting the ith element of p to zero wheno ever pi gi w > 0. As shown in Section 5, the direction- 4. Related Work Lukan and Vlek (1999) propose an extension of s c BFGS to nonsmooth convex problems. Their algorithm samples gradients around non-differentiable points in order to obtain a descent direction. In many machine learning problems evaluating the ob jective function and its gradient is very expensive. Therefore, our direction-finding algorithm (Algorithm 2) repeatedly samples subgradients from the set J (w) via the oracle, which is computationally more efficient. Recently, Andrew and Gao (2007) introduced a vari- A Quasi-Newton Approach to Nonsmo oth Convex Optimization Table 1. Datasets, regularization constants c, direction-finding convergence criterion , and the overall number k of direction-finding iterations for L1 -regularized logistic loss and L2 -regularized hinge loss minimization tasks, respectively. Dataset Covertype CCAT Astro MNIST Tr./Test Data 522911/58101 781265/23149 29882/32487 60000/10000 Dimen. 54 47236 99757 780 Density 22.22% 0.16% 0.077% 19.22% cL1 10-6 10-6 10-5 10-4 L1 10-5 10-5 10-3 10-5 kL 1 0 356 1668 60 kL1 rand 0 467 2840 102 cL2 10-6 10-4 5 · 10-5 1.4286 · 10-6 L2 10-8 10-8 10-8 10-8 kL 2 44 66 17 244 finding subroutine of OWL-QN can be replaced by Algorithm 2, which in turn makes the algorithm more robust to the choice of subgradients. Many optimization techniques use past gradients to build a model of the ob jective function. Bundle method solvers like SVMStruct (Joachims, 2006) and BMRM (Teo et al., 2007) use them to lower-bound the ob jective by a piecewise linear function which is minimized to obtain the next iterate. This fundamentally differs from the BFGS approach of using past gradients to approximate the (inverse) Hessian, hence building a quadratic model of the ob jective function. Vo jtch and Sonnenburg (2007) speed up the convere gence of a bundle method solver for the L2 -regularized binary hinge loss. Their main idea is to perform a line search along the line connecting two successive iterates of a bundle method solver. Although developed independently, their line search algorithm is very reminiscent of the method we describe in Section 3.2.1. alisation performance is therefore pointless. We combined training and test datasets to evaluate the convergence of each algorithm in terms of the ob jective function value vs. CPU seconds. All experiments were carried out on a Linux machine with dual 2.8 GHz Xeon processors with 4GB RAM. 5.1. L2 -Regularized Hinge Loss For our first set of experiments, we applied subLBFGS together with our exact line search (Algorithm 3) to the task of L2 -regularized hinge loss minimization. Our control methods are the bundle method solver BMRM (Teo et al., 2007) and an optimized cutting plane algorithm, OCAS version 0.6.0 (Vo jtch and e Sonnenburg, 2007), both of which demonstrated strong results on the L2 -regularized hinge loss minimization in their corresponding papers. Figure 4 shows that subLBFGS (solid) reaches the neighbourhood of the optimum (less than 10-3 away) noticeably (up to 7 times) faster than BMRM (dashed). As BMRM's approximation to the ob jective function improves over the course of optimization, it gradually catches up with subLBFGS, though ultimately subLBFGS still converges faster on 3 out of 4 datasets. The performance of subLBFGS and OCAS (dash-dotted) are very similar: OCAS converges slightly faster than subLBFGS on the Astrophysics dataset but is outperformed by subLBFGS on the MNIST dataset. 5.2. L1 -Regularized Logistic Loss To demonstrate the utility of our direction-finding routine (Algorithm 2) in its own right, we plugged it into the OWL-QN algorithm (Andrew and Gao, 2007) as an alternative direction-finding method, such that pow = descentDirection(g ow , , kmax ),1 and compared this variant (denoted by OWL-QN*) with the original on L1 -regularized logistic loss minimization. Using the stopping criterion suggested by Andrew and Gao (2007), we run experiments until the averaged relNote for the ob jective (24) it is trivial to construct an oracle that supplies argsupg J (wt ) g p. 1 5. Exp eriments We now evaluate the performance of our subLBFGS algorithm, and compare it to other state-of-the-art nonsmooth optimization methods on L2 -regularized hinge loss minimization. We also compare a variant of OWLQN that uses our direction-finding routine to the original on L1 -regularized logistic loss minimization. Our experiments used four datasets: the Covertype dataset of Blackard, Jock & Dean, CCAT from the Reuters RCV1 collection, the Astro-physics dataset of abstracts of scientific papers from the Physics ArXiv (Joachims, 2006), and the MNIST dataset of handwritten digits with two classes: even and odd digits. We used subLBFGS with a buffer of size m = 15 throughout. Table 1 summarizes our parameter settings, and reports the overall number of direction-finding iterations for all experiments. We followed the choices of Vo jtch and Sonnenburg (2007) for the L2 regularizae tion constants; for L1 they were chosen from the set 10{-6,-5,··· ,-1} to achieve the lowest test error. On convex problems such as these every convergent optimizer will reach the same solution; comparing gener- A Quasi-Newton Approach to Nonsmo oth Convex Optimization x10-1 7.2 Covertype BMRM OCAS subLBFGS 4.2 x10-1 CCAT BMRM OCAS subLBFGS Objectvie Value x10 5.34 6.2 5.33 -1 Objectvie Value x10 2.29 3.2 2.28 101 -1 101 102 102 103 5.2 2.2 100 101 102 101 102 103 CPU Seconds x10-1 4 CPU Seconds x10-1 BMRM OCAS subLBFGS 7.5 Astro MNIST BMRM OCAS subLBFGS Objectvie Value x10 1.08 -1 Objectvie Value 3 x10 4.5 2.50 -1 2 1.07 100 101 102 103 2.49 101 102 103 1 100 101 102 103 2.5 10-1 100 101 102 103 CPU Seconds CPU Seconds Figure 4. Ob jective function value vs. CPU seconds on L2 -regularized hinge loss minimization tasks. ative change in the ob jective value over the previous 5 iterations falls below 10-5 . Figure 5 shows only minor differences in convergence between the two algorithms. To examine the algorithms' sensitivity to the choice of subgradients, we also ran them with random subgradients (as opposed to the specially chosen g ow used before) fed to their corresponding direction-finding routines. OWL-QN relies heavily on its particular choice of subgradients, hence breaks down completely under these conditions: The only dataset where we could even plot its (poor) performance was Covertype (dotted OWL-QN(2) line in Figure 5). Our directionfinding routine, by contrast, is self-correcting and thus not affected by this manipulation: The curves for OWL-QN*(2) (plotted for Covertype in Figure 5) lie virtually on top of those for OWL-QN*. Table 1 shows that in this case more direction-finding iterations are needed, i.e., kL1 rand kL1 . This empirically confirms that as long as argsupg J (wt ) g p is given, Algorithm 2 can indeed be used as a canned quasi-Newton direction-finding routine. our algorithms are versatile and applicable to many problems, while their performance is comparable to if not better than that of their counterparts in custombuilt solvers. In some experiments we observe that subLBFGS initially makes rapid progress towards the solution but slows down closer to the optimum. We hypothesize that initially its quadratic model allows subLBFGS to make rapid progress, but closer to the optimum it is no longer an accurate model of an ob jective function dominated by the nonsmooth hinges. We are therefore contemplating hybrid solvers which seamlessly switch between sub(L)BFGS and bundle solvers. In this paper we applied subLBFGS to L2 -regularized risk minimization with binary hinge loss. It can also be extended to deal with generalizations of the hinge loss, such as multi-class, multi-category, and ordinal regression problems; this is part of our ongoing research. Finally, to put our contributions in perspective, recall that we modified three aspects of the standard BFGS algorithm, namely the quadratic model (Section 2.1), the descent direction finding (Section 2.2), and the line search (Section 2.3). Each of these modifications is versatile enough to be used as a component in other nonsmooth optimization algorithms. This not only offers the promise of improving existing algorithms, but 6. Outlo ok and Discussion We proposed an extension of BFGS suitable for handling nonsmooth problems often encountered in the machine learning context. As our experiments show, A Quasi-Newton Approach to Nonsmo oth Convex Optimization x10-1 8 Covertype OWL-QN OWL-QN* OWL-QN(2) OWL-QN*(2) CCAT 100 OWL-QN OWL-QN* x10-1 Objectvie Value 4.75 4.65 Objectvie Value 7 x10 1.50 1.40 -1 6 102 103 103 104 5 4.5 10-1 101 102 103 102 103 104 CPU Seconds CPU Seconds x10-1 OWL-QN OWL-QN* Astro 100 OWL-QN OWL-QN* MNIST x10-1 2.76 2.66 102 103 104 x10-1 Objectvie Value 1.33 1.23 10 2 6 Objectvie Value 5 4 10 3 10 4 3 10-1 100 101 102 103 104 101 102 103 104 CPU Seconds CPU Seconds Figure 5. Ob jective function value vs. CPU seconds on L1 -regularized logistic loss minimization tasks. may also help clarify connections between them. We hope that this will focus attention on those core subroutines that need to be made more efficient in order to handle larger and larger datasets. Proc. ACM Conf. Know ledge Discovery and Data Mining (KDD). ACM, 2006. L. Lukan and J. Vlek. Globally convergent variable s c metric method for convex nonsmooth unconstrained minimization. Journal of Optimization Theory and Applications, 102(3):593­613, 1999. A. Nedich and D. P. Bertsekas. Convergence rate of incremental subgradient algorithms. In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications, pages 263­304. Kluwer Academic Publishers, 2000. J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 1999. C. Teo, Q. Le, A. Smola, and S. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In Proc. ACM Conf. Know ledge Discovery and Data Mining (KDD). ACM, 2007. F. Vo jtch and S. Sonnenburg. e Optimized cutting plane algorithm for support vector machines. Technical Report 1, Fraunhofer Institute FIRST, December 2007. http://publica.fraunhofer.de/ documents/N- 66225.html. J. Yu, S. V. N. Vishwanathan, S. Gunter, and N. N. ¨ Schraudolph. A quasi-Newton approach to nonsmooth convex optimization. Technical Report arXiv:0804.3835, April 2008. http://arxiv.org/pdf/ 0804.3835. Acknowledgement NICTA is funded by the Australian Government's Backing Australia's Ability and the Centre of Excellence programs. This work is also supported by the IST Program of the European Community, under the FP7 Network of Excellence, ICT-216886-NOE. References G. Andrew and J. Gao. Scalable training of l1 regularized log-linear models. In Proc. Intl. Conf. Machine Learning, pages 33­40, New York, NY, USA, 2007. ACM. A. Belloni. Introduction to bundle methods. Technical report, Operation Research Center, M.I.T., 2005. M. Haarala. Large-Scale Nonsmooth Optimization. PhD thesis, University of Jyv¨skyl¨, 2004. a a J. Hiriart-Urruty and C. Lemar´chal. Convex Analysis e and Minimization Algorithms, I and II, volume 305 and 306. Springer-Verlag, 1993. T. Joachims. Training linear SVMs in linear time. In