Bundle Methods for Machine Learning

Alexander J. Smola, S.V.N. Vishwanathan, Quoc Le NICTA, ANU Canberra, Australia Alex.Smola@gmail.com, {SVN.Vishwanathan, Quoc.Le}@nicta.com

Abstract
We present a globally convergent method for regularized risk minimization problems. Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. SVMPerf can be shown to be a special case of our approach. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ ) steps to precision for general convex problems and in O(log ) steps for continuously differentiable problems. We demonstrate in experiments the performance of our approach.

1

Introduction

In recent years optimization methods for convex models have seen significant progress. Starting from the active set methods described by Vapnik [16] increasingly sophisticated algorithms for solving regularized risk minimization problems have been developed. Some of the most exciting recent developments are SVMPerf [5] and the Pegasos gradient descent solver [12]. The former computes gradients of the current solution at every step and adds those to the optimization problem. Joachims [5] prove an O( -2 ) rate of convergence. For Pegasos Shalev-Shwartz et al. [12] prove an O(1/ ) rate of convergence, which suggests that Pegasos should be much more suitable for optimization. In this paper we extend the ideas of SVMPerf to general convex optimization problems and a much wider class of regularizers. In addition to this, we present a formulation which does not require the solution of a quadratic program whilst in practice enjoying the same rate of convergence as algorithms of the SVMPerf family. Our error analysis shows that the rates achieved by this algorithm are considerably better than what was previously known for SVMPerf, namely the algorithm enjoys O(1/ ) convergence and O(log ) convergence, whenever the loss is sufficiently smooth. An important feature of our algorithm is that it automatically takes advantage of smoothness in the problem. Our work builds on [14], which describes the basic extension of SVMPerf to general convex problems. The current paper provides a) significantly improved performance bounds which match better what can be observed in practice and which apply to a wide range of regularization terms, b) a variant of the algorithm which does not require quadratic programming, yet enjoys the same fast rates of convergence, and c) experimental data comparing the speed of our solver to Pegasos and SVMPerf. Due to space constraints we relegate the proofs to an appendix which is attached to the submission.

2

Problem Setting

Denote by x  X and y  Y patterns and labels respectively and let l(x, y , w) be a loss function which is convex in w  W, where either W = Rd (linear classifier), or W is a Reproducing Kernel Hilbert Space for kernel methods. Given a set of m training patterns {xi , yi }m 1 the regularized risk i= 1


functional which many estimation methods strive to minimize can be written as J (w) := Remp (w) + (w) where Remp (w) :=
2 m 1i l(xi , yi , w). m =1

(1)

(w) is a smooth convex regularizer such as 1 w , and  > 0 is a regularization term. Typically 2  is cheap to compute and to minimize whereas the empirical risk term Remp (w) is computationally expensive to deal with. For instance, in the case of intractable graphical models it requires approximate inference methods such as sampling or semidefinite programming. To make matters worse the number of training observations m may be huge. We assume that the empirical risk Remp (w) is nonnegative. If J is differentiable we can use standard quasi-Newtons methods like LBFGS even for large values of m [8]. Unfortunately, it is not straightforward to extend these algorithms to optimize a non-smooth objective. In such cases one has to resort to bundle methods [3], which are based on the following elementary observation: for convex functions a first order Taylor approximation is a lower bound. So is the maximum over a set of Taylor approximations. Furthermore, the Taylor approximation is exact at the point of expansion. The idea is to replace Remp [w] by these lower bounds and to optiFigure 1: A lower bound on the mize the latter in conjunction with (w). Figure 1 gives geometric convex empirical risk Remp (w) intuition. In the remainder of the paper we will show that 1) This exobtained by computing three tan- tends a number of existing algorithms; 2) This method enjoys good gents on the entire function. rates of convergence; and 3) It works well in practice. Note that there is no need for Remp [w] to decompose into individual losses in an additive fashion. For instance, scores, such as Precision@k [4], or SVM Ranking scores do not satisfy this property. Likewise, estimation problems which allow for an unregularized common constant offset or adaptive margin settings using the  -trick fall into this category. The only difference is that in those cases the derivative of Remp [w] with respect to w no more decomposes trivially into a sum of gradients.

3

Bundle Methods

3.1 Subdifferential and Subgradient Before we describe the bundle method, it is necessary to clarify a key technical point. The subgradient is a generalization of gradients appropriate for convex functions, including those which are not necessarily smooth. Suppose w is a point where a convex function F is finite. Then a subgradient is the normal vector of any tangential supporting hyperplane of F at w. Formally Е is called a subgradient of F at w if, and only if, F (w )  F (w) + w
-

w, Е

w

.

(2)

The set of all subgradients at a point is called the subdifferential, and is denoted by w F (w). If this set is not empty then F is said to be subdifferentiable at w. On the other hand, if this set is a singleton then, the function is said to be differentiable at w. 3.2 The Algorithm Denote by wt  W the values of w which are obtained by successive steps of our method, Let at  W, bt  R, and set w0 = 0, a0 = 0, b0 = 0. Then, the Taylor expansion coefficients of Remp [wt ] can be written as at+1 := w Remp (wt ) and bt+1 := Remp (wt ) - at+1 , wt . (3)

Note that we do not require Remp to be differentiable: if Remp is not differentiable at wt we simply choose any element of the subdifferential as at+1 . Since each Taylor approximation is a lower bound, we may take their maximum to obtain that Remp (w)  maxt at , w + bt . Moreover, by 2


Algorithm 1 Bundle Method( ) Initialize t = 0, w0 = 0, a0 = 0, b0 = 0, and J0 (w) = (w) repeat Find minimizer wt := argminw Jt (w) Compute gradient at+1 and offset bt+1 . Increment t  t + 1. v until t 

irtue of the fact that Remp is a non-negative function we can write the following lower bounds on Remp and J respectively: Rt (w) := mx at , w + bt a
t t
and

Jt (w) := (w) + Rt (w).


(4)

By construction Rt


Rt  Remp and Jt


Jt  J for all t

t. Define

w := argmin J (w),
w

t := Jt+1 (wt ) - Jt (wt ), and t := mn Jt +1 (wt ) - Jt (wt ). i
t t

wt := argmin Jt (w),
w

The following lemma establishes some useful properties of t and t. Lemma 1 We have Jt (wt )  Jt (wt )  J (w )  J (wt ) = Jt+1 (wt ) for all t  t. Furthermore, t is monotonically decreasing with t - t+1  Jt+1 (wt+1 ) - Jt (wt )  0. Also, t upper bounds the distance from optimality via t  t  mint t J (wt ) - J (w ). 3.3 Dual Problem Optimization is often considerably easier in the dual space. In fact, we will show that we need not know (w) at all, instead it is sufficient to work with its Fenchel-Legendre dual  (Е) := supw w, Е - (w). If  is a so-called Legendre function [e.g. 10] the w at which the supremum is attained can be written as w = Е  (Е). In the sequel we will always assume that  is twice i 2 1 exp[Е]i . differentiable and Legendre. Examples include  (Е) = 2 Е or  (Е) = Theorem 2 Let   Rt , denote by A = [a1 , . . . , at ] the matrix whose columns are the (sub)gradients, let b = [b1 , . . . , bt ]. The dual problem of minimize Jt (w) := mx at , w + bt a
w t t 
+

(w) is
b

(5)
1

 maximize Jt () := -  (--1 A) + 

subject to   0 and 

= 1.

(6)

Furthermore, the optimal wt and t are related by the dual connection wt =   (--1 At ). Recall that for (w) = 1 w 2 the Fenchel-Legendre dual is given by  (Е) = 1 Е 2 . This is 2 2 commonly used in SVMs and Gaussian Processes. The following corollary is immediate: Corollary 3 Define Q := A A, i.e. Quv := au , av . For quadratic regularization, i.e. 2 minimizew max(0, maxt t at , w + bt ) +  w 2 the dual becomes 2 maximize -
 Q 1  2  2 2

+

b

subject to   0 and 

1

= 1.

(7)

This means that for quadratic regularization the dual optimization problem is a quadratic program where the number of variables equals the number of gradients computed previously. Since t is typically in the order of 10s to 100s, the resulting QP is very cheap to solve. In fact, we don't even need to know the gradients explicitly. All that is required to define the QP are the inner products between gradient vectors au , av . Later in this section we propose a variant which does away with the quadratic program altogether while preserving most of the appealing convergence properties of Algorithm 1. 3


3.4 Examples Structured Estimation Many estimation problems [9, 13, 15] can be written in terms of a piecewise linear loss function l(x, y , w) = m x (x, y ) - (x, y ), w + (y , y a
y Y )

(8)

for some suitable joint feature map , and a loss function (y , y ). It follows from Section 3.1 that a subdifferential of (8) is given by w l(x, y , w) = (x, y  ) - (x, y ) where y  := argmax (x, y ) - (x, y ), w + (y , y ). (9)
y
Y

Since Remp is defined as a summation of loss terms, this allows us to apply Algorithm 1 directly for risk minimization: at every iteration t we find all maximal constraint violators for each (xi , yi ) pair and compute the composite gradient vector. This vector is then added to the convex program we have so far. Joachims [5] pointed out this idea for the special case of (x, y ) = y (x) and y  {Б1}, that is, binary loss. Effectively, by defining a joint feature map as the sum over individual feature maps and by defining a joint loss  as the sum over individual losses SVMPerf performs exactly the same operations as we described above. Hence, for losses of type (8) our algorithm is a direct extension of SVMPerf to structured estimation. Exponential Families One of the advantages of our setting is that it applies to any convex loss function, as long as there is an efficient way of computing the gradient. That is, we can use it for cases where we are interested in modeling Y ( p(y |x; w) = exp( (x, y ), w - g (w|x)) where g (w|x) = log exp (x, y ), w dy 10) That is, g (w|x) is the conditional log-partition function. This type of losses includes settings such as Gaussian Process classification and Conditional Random Fields [1]. Such settings have been studied by Lee et al. [6] in conjunction with an 1 regularizer (w) = w 1 for structure discovery in graphical models. Choosing l to be the negative log-likelihood it follows that Remp (w) = im
=1

g (w|xi ) - (xi , yi ), w and w Remp (w) =

im
=1

Ey p(y |xi ;w) [(xi , y )] - (xi , yi ).

This means that column generation methods are therefore directly applicable to Gaussian Process estimation, a problem where large scale solvers were somewhat more difficult to find. It also shows that adding a new model becomes a matter of defining a new loss function and its corresponding gradient, rather than having to build a full solver from scratch.

4

Convergence Analysis

While Algorithm 1 is intuitively plausible, it remains to be shown that it has good rates of convergence. In fact, past results, such as those by Tsochantaridis et al. [15] suggest an O( -2 ) rate, which would make the application infeasible in practice. We use a duality argument similar to those put forward in [11, 15], both of which share key techniques with [17]. The crux of our proof argument lies in showing that t - t+1  Jt+1 (wt+1 ) - Jt (wt ) (see Lemma 1) is sufficiently bounded away from 0. In other words, since t bounds the distance from the optimality, at every step Algorithm 1 makes sufficient progress towards the optimum. Towards this end, we first observe that by strong duality the values of the primal and dual problems (5) and (6) are equal at optimality. Hence, any progress in Jt+1 can be computed in the dual. Next, we observe that the solution of the dual problem (6) at iteration t, denoted by t , forms a feasible set of parameters for the dual problem (6) at iteration t + 1 by means of the parameterization (t , 0), i.e. by padding t with a 0. The value of the objective function in this case equals Jt (wt ). To obtain a lower bound on the improvement due to Jt+1 (wt+1 ) we perform a line search along ((1-  )t ,  ) in (6). The constraint   [0, 1] ensures dual feasibility. We will bound this improvement 4


in terms of t . Note that, in general, solving the dual problem (6) results in an increase which is larger than that obtained via the line search. The line search is employed in the analysis only for analytic tractability. We aim to lower-bound t - t+1 in terms of t and solve the resultant difference equation. Depending on J (w) we will be able to prove two different convergence results.  2 H we first experience a regime of progress (a) For regularizers (w) for which Е  (Е) linear in t and a subsequent slowdown to imprvements which are quadratic in t . o  2 (b) Under the above conditions, if furthermore w J (w) H , i.e. the Hessian of J is bounded, we have linear convergence throughout. We first derive lower bounds on the improvement Jt+1 (wt+1 ) - Jt (wt ), then the fact that for (b) the bounds are better. Finally we prove the convergence rates by solving the difference equation in t. This reasoning leads to the following theorem (for details of the reasoning see the appendix): Theorem 4 Assume that w Remp (w)  G for all w  W , where W is some d main of interest o    2 for t containing all wt t. Also assume that  has bounded curvature, i.e. let Е  (Е) H - -1 . ЏЏ for all Е  Џ Џ  A where   0 and  1  1 In this case we have t - t+1  t min(1, t /4G2 H  )  2t min(1,  t/4G2 H  ). 2 2  Furthermore, if w J (w) H , then we have  if t > 4G2 H  /  t / 2  t - t+1  /8H if 4G2 H  /  t  H/2  t /4H H  otherwise (11)

(12)

Note that the error keeps on halving initially and settles for a somewhat slower rate of convergence after that, whenever the Hessian of the overall risk is bounded from above. The reason for the difference in the convergence bound for differentiable and non-differentiable losses is that in the former case the gradient of the risk converges to 0 as we approach optimality, whereas in the former case, no such guarantees hold (e.g. when minimizing |x| the (sub)gradient does not vanish at the optimum). Two facts are worthwhile noting: a) The dual of many regularizers, e.g. squares norm, squared p norm, and the entropic regularize have bounded second derivative. See e.g. [11] for a discussion r  2 H is not unreasonable. b) Since the improvements and details. Thus our condition Е  (Е) decrease with the size of t we may replace t by t in both bounds and conditions without any ill effect (the bound only gets worse). Applying the previous result we obtain a convergence theorem for bundle methods. Theorem 5 Assume that J (w)  0 for all w. Under the assumptions of Theorem 4 we can give the following convergence guarantee for Algorithm 1. For any < 4G2 H  / the algorithm converges to the desired precision after J (0) 8G2 H  + 4 (13) 2H  G  - steps. If furthermore the Hessian of J (w) is bounded, convergence to any  H/2 takes at most the following number of steps: 0 + 4H H  J (0) 4H  n  log2 + max , H - 8G2 H  / log(H/2 ) (14) 4G2 H    n  log2 Several observations are in order: firstly, note that the number of iterations only depends logarithmically on how far the initial value J (0) is away from the optimal solution. Compare this to the result of Tsochantaridis et al. [15], where the number of iterations is linear in J (0). Secondly, we have an O(1/ ) dependence in the number of iterations in the non-differentiable case. This matches the rate of Shalev-Shwartz et al. [12]. In addition to that, the convergence is O(log ) for continuously differentiable problems. 5


Note that whenever Remp (w) is the average over many piecewise linear functions Remp (w) behaves essentially like a function with bounded Hessian as long as we are taking large enough steps not to "notice" the fact that the term is actually nonsmooth.
1 Remark 6 For (w) = = w the dual Hes.sian is exactly H  = 1. Moreover we know that 2 2 2 H   since w J (w)  + w Remp (w) 2

Effectively the rate of convergence of the algorithm is governed by upper bounds on the primal and dual curvature of the objective function. This acts like a condition number of the problem -- for 1 (w) = 2 w Qw the dual is  (z ) = 1 z Q-1 z , hence the largest eigenvalues of Q and Q-1 2 would have a significant influence on the convergence. In terms of  the number of iterations needed for convergence is O(-1 ). In practice the iteration count does increase with , albeit not as badly as predicted. This is likely due to the fact that the empirical risk Remp (w) is typically rather smooth and has a certain inherent curvature which acts as a natural regularizer in addition to the regularization afforded by [w].

5

A Linesearch Variant

The convergence analysis in Theorem 4 relied on a one-dimensional line search. Algorithm 1, however, uses a more complex quadratic program to solve the problem. Since even the simple updates promise good rates of convergence it is tempting to replace the corresponding step in the bundle update. This can lead to considerable savings in particular for smaller problems, where the time spent in the quadratic programming solver is a substantial fraction of the total runtime. To keep matters simple, we only consider quadratic regularization (w) := 1 w . Note that 2  Jt+1 ( ) := Jt+1 ((1 -  )t ,  ) is a quadratic function in  , regardless of the choice of Remp [w]. Hence a line search only needs to determine first and second derivative as done in the proof of Theorem 4. It follows immediately from Lemma 8 that  Jt+1 (0) = t and that moreover by 1 1 2 Lemma 9 that  Jt+1 (0) = -  w J (wt ) 2 = -  wt + at+1 2. Hence the optimal value of  is given by  = min(1, t / wt + at+1 2 ). (15) 2
 This means that we may update wt+1 = (1 -  )wt -  at+1 . In other words, we need not store past gradients for the update. To obtain t note that we are compting Remp (wt ) as part of the Taylor u wA approximation step. Finally, Rt (wt ) is given by + b t , hence it satisfies the same update 2 relations. In particular, the fact that w A =  w means that the only quantity we need to cache t is b as an auxiliary variable rt in order to compute t efficiently. Experiments show that this simplified algorithm has essentially the same convergence properties. 2

6

Experiments

In this section we show experimental results that demonstrate the merits of our algorithm and its analysis. Due to space constraints, we report results of experiments with the Astro-Physics (astroph) and Reuters-CCAT (reuters-ccat) datasets [5, 12] in the main body of the paper and relegate results on other large datasets to the appendix. For a fair comparison with existing solvers we use the quadratic regularizer (w) :=  w 2 , and the binary hinge loss. 2 In our first experiment, we address the rate of convergence and its dependence on the value of . In Figure 2 we plot t as a function of iterations for various values of  using the QP solver at every iteration to solve the dual problem (6) to optimality. Initially, we observe super-linear convergence; this is consistent with our analysis. Surprisingly, even though theory predicts sub-linear speed of convergence for non-differentiable losses like the binary hinge loss (see (11)), our solver exhibits linear rates of convergence predicted only for differentiable functions (see (12)). We conjecture that the average over many piecewise linear functions, Remp (w), behaves essentially like a smooth function. As predicted, the convergence speed is inversely proportional to the value of . In our second experiment, we compare the convergence speed of two variants of the bundle method, namely, with a QP solver in the inner loop (which essentially boils down to SVMPerf) and the line 6


Figure 2: We plot t as a function of the number of iterations. Note the logarithmic scale in t. Left: astro-ph; Right: reuters-ccat.

Figure 3: Top: Objective function value as a function of time. Bottom: Objective function value as a function of iterations. Left: astro-ph, Right: reuters-ccat. The black line indicates the final value of the objective function + 0.001 .

search variant which we described in Section 5. We contrast these solvers with Pegasos [12] in the batch setting. Following [5] we set  = 10-4 for reuters-ccat and  = 2.10-4 for astro-ph. Figure 3 depicts the evolution of the primal objective function value as a function of both CPU time as well as the number of iterations. Following Shalev-Shwartz et al. [12] we investigate the time required by various solvers to reduce the objective value to within 0.001 of the optimum. This is depicted as a black horizontal line in our plots. As can be seen, Pegasos converges to this region quickly. Nevertheless, both variants of the bundle method converge to this value even faster (line search is slightly slower than Pegasos on astro-ph, but this is not always the case for many other large datasets we tested on). We also note that both line search and Pegasos converge to within 0.001 of the optimum value rather quickly, but they require a large number of iterations to converge to the optimum solution.

7

Related Research

Our work is closely related to Shalev-Shwartz and Singer [11] who prove mistake bounds for online algorithms by lower bounding the progress in the dual. Although not stated explicitly, essentially the same technique of lower bounding the dual improvement was used by Tsochantaridis et al. [15] to show polynomial time convergence of the SVMStruct algorithm. The main difference however is that Tsochantaridis et al. [15] only work with a quadratic objective function while the framework proposed by Shalev-Shwartz and Singer [11] can handle arbitrary convex functions. In both cases, 7


a weaker analysis led to O(1/ 2) rates of convergence for nonsmooth loss functions. On the other hand, our results establish a O(1/ ) rate for nonsmooth loss functions and O(log(1/ )) rates for smooth loss functions under mild technical assumptions. Another related work is SVMPerf [5] which solves the SVM estimation problem in linear time. SVMPerf finds a solution with accuracy in O(md/( 2)) time, where the m training patterns ~ xi  Rd . This bound was improved by Shalev-Shwartz et al. [12] to O(1/ ) for obtaining an accuracy of with confidence 1 -  . Their algorithm, Pegasos, essentially perrms stochastic fo (sub)gradient descent but projects the solution back onto the L2 ball of radius 1/ . But, as our experiments show, performing an exact line search in the dual leads to a faster decrease in the value of primal objective. Note that Pegasos also can be used in an online setting. This, however, only applies whenever the empirical risk decomposes into individual loss terms (e.g. it is not applicable to multivariate performance scores). The third related strand of research considers gradient descent in the primal with a line search to choose the optimal step size, see e.g. [2, Section 9.3.1]. Under assumptions of smoothness and strong convexity ­ that is, the objective function can be upper and lower bounded by quadratic functions ­ it can be shown that gradient descent with line search will converge to an accuracy of in O(log(1/ )) steps. The problem here is the line search in the primal, since evaluating the regularized risk functional might be as expensive as computing its gradient, thus rendering a line search in the primal unattractive. On the other hand, the dual objective is relatively simple to evaluate, thus making the line search in the dual, as performed by our algorithm, computationally feasible. Finally, we would like to point out connections to subgradient methods [7]. These algorithms are designed for nonsmooth functions, and essentially choose an arbitrary element of the subgradient set to perform a gradient descent like update. Let Jw (w)  G, and B (w , r) denote a ball of radius r centered around the minimizer of J (w). By applying the analysis of Nedich and Bertsekas [7] to the regularized risk minimization problem with (w) :=  w 2 , Ratliff et al. [9] showed that sub2 gradient descent with a fixed, but sufficiently small, stepsize will converge linearly to B (w , G/). References
[1] Y. Altun, A. J. Smola, and T. Hofmann. Exponential families for conditional random fields. In UAI, pages 2­9, 2004. [2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Д [3] J. Hiriart-Urruty and C. Lemarechal. Convex Analysis and Minimization Algorithms. 1993. [4] T. Joachims. A support vector method for multivariate performance measures. In ICML, pages 377­384, 2005. [5] T. Joachims. Training linear SVMs in linear time. In KDD, 2006. [6] S.-I. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of Markov networks using L1regularization. In Advances in Neural Information Processing Systems, pages 817­824, 2007. [7] A. Nedich and D. P. Bertsekas. Convergence rate of incremental subgradient algorithms. In S. Uryasev and P. M. Pardalos, editors, Stochastic Optimization: Algorithms and Applications, pages 263­304. 2000. [8] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999. [9] N. Ratliff, J. Bagnell, and M. Zinkevich. (online) subgradient methods for structured prediction. In Proc. of AIStats, 2007. [10] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970. [11] S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. [12] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML, 2007. [13] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In Advances in Neural Information Processing Systems, pages 25­32, 2004. [14] C. H. Teo, Q. Le, A. Smola, and S. V. N. Vishwanathan. A scalable modular convex solver for regularized risk minimization. In KDD, 2007. [15] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:1453­1484, 2005. [16] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. [17] T. Zhang. Sequential greedy approximation for certain convex optimization problems. IEEE Trans. Information Theory, 49(3):682­691, 2003.

8


A Proofs
Proof of Lemma 1
Since Jt (w)  Jt (w)  J (w) for all t  t this property also applies to their respective minima. Moreover, since w minimizes J (w) we have J (w )  J (wt ). Since Taylor expansions are exact at the point of expansion J (wt ) = Jt+1 (wt ). The first inequality follows from the definition of t, and the fact that Jt is monotonically increasing. Finally, since Jt +1 (wt ) = J (wt ) it is easy to see that t  t  mint t J (wt ) - Jt (wt )  mint t J (wt ) - J (w ).

Proof of Theorem 2
We rewrite (5) as a constrained optimization problem: minimize (w) +  subject to   0 and   at , w + bt for all t  t. The corresponding Lagrange function can be written as w L(w,  , ) = (w) +  -   e - A w - bt ith   0, (16) where e is the vector of all ones, and   0 denotes that each component of  is non-negative. Taking derivatives with respew to  yields 1 -  e = 0. Moreover, minimization of L with respect to w implies ct - solving maximizew , --1 A (w) =  (--1 A). Plugging both terms back into (16) allows us to eliminate the primal variables  and w.

Proof of Theorem 4
To show Theorem 4 we need several technical intermediate steps. The first one allows us to bound the maximum value of a concave function provided that we know its first derivative and a bound on the second derivative. Lemma 7 Denote by f : [0, 1]  R a concave function with f (0) = 0, f (0) = l, and |f l for all x  [0, 1] we have maxx[0,1] f (x)  2 min(1, l/H ).
(

x)|  H . Then

Proof We first observe that lx - H x2  f (x). The unconstrained maximum of the lower bound is attained 2 by l/H leading to a maximum value of l2 /2H . If l > H we pick x = 1 which yields l - H/2. The latter is majorized by l/2. Taking the minimum over both maxima proves the claim.
 To apply the above result, we need to compute the gradient and Hessian of Jt+1 () with respect to the line t search direction ((1 -  ) ,  ). The following lemma takes care of the gradient:

Џ Lemma 8 Denote by t the solution of (5) at time instance t. Moreover, denote by A = [A, at+1 ] and Џ = [b, bt+1 ] the extended matrices and vectors needed to define the dual problem for step t + 1, and let b   Rt+1 . Then the following holds: Џ
 Џ  Jt+1 ([t , 0]) = A wt + Џ and b Џ = Џw [-t , 1] A t + Џ b Jt+1 (wt ) - Jt (wt ) = t

(17) (18)

ЏЏ Proof By the dual connection   (--1 At ) = wt . Hence we have that  -  (--1 A) +  Џ = Џb Џ Џ b Џ A wt + Џ for  = [t , 0] . This proves the first claim. To see the second part we eliminate  from of the Lagrangian (16) and write the partial Lagrangian w L(w, ) = (w) +  A w + b ith   0. (19) The result follows by noting that at optimality L(wt , t ) = Jt (wt ) and Jt+1 (wt ) = (wt ) + at+1 , wt + bt+1 . Consequently we have Jt+1 (wt ) - Jt (wt ) = (wt ) + at+1 , wt + bt - (wt ) - t (A Rearranging terms proves the claim. To apply Lemma 7 we also need to bound the second derivative. Lemma 9 Under the assumptions of Lemma 8 we have
2 Џ ЏЏ Џ  Jt+1 () = --1 A 2  (--1 A)A Џ Џ Џ moreover A[-t , 1] = w J (wt ) w t

+ b)

(20) (21)

9


Proof The first equality follows by the chain rule. Next note that w (wt ) = --1 At by dual connection. Since at+1 = w R(wt ) the claim follows from J (w) = Remp (w) + (w). This result allows us to express the second derivative of the dual objective function (6) in terms of the gradient of the risk functional. The idea is that as we approach optimality, the second derivative will vanish. We will use this fact to argue that for continuously differentiable losses Remp (w) we enjoy linear convergence throughout.
 Proof [Theorem 4] We overload the notation for Jt+1 by defining the following one dimensional concave function   Џ Jt+1 ( ) := Jt+1 (((1 -  )t ,  )) = - (--1 A[(1 -  )t ,  ]) + [(1 -  )t ,  ]Џ. b  Clearly, Jt+1 (0) = Jt (wt ). Furthermore, by (18), (20), and (21) it follows that   Jt+1 ( )|=0 = [-t , 1] 2  Jt+1 ( ) -1

(22)

= t and Џ Џ = - [-t , 1] A  (--1 A[(1 -  )t ,  ])A[-t , 1] Џ = --1 [w J (wt )] 2  (--1 A[(1 -  )t ,  ])[w J (wt )] := q .
Џ 2 

   Jt+1 ([t , 0]) Џ

(23)
(

24)

(25)

By our assumption on  

2


 H we have |q |  H  w J (wt ) 2/.


Next we need to bound the gradient of J . For this purpose note that w (wt ) = -A t and moreover that t 1 = 1. This implies that w (wt ) lies in the convex hull of the past gradients, at . By our assumption on w Remp (w) it follows that w (wt )  G. We conclude that w J (wt ) 2  4G2 and |q |  4G2 H  /.
 Invoking Lemma 7 on Jt+1 ( ) - Jt (wt ) proves the first part of (11). Since t  t the second part of the inequality follows.

For the second part note that (11) already yields the t /2 decrease when t  4G2 H  /. To show the other parts we need to show that the gradient of the regularized risk vanishes as we converge to the optimal solution. Towards this end, we apply Lemma 7 in the primal. This allows us to bound w J (wt ) in terms of t . Plugging in the first and second derivative of J (wt ) we obtain t  1 w J (wt ) min(1, w J (wt ) /H). 2

2 1 If w J (wt ) > H , then t  2 w J (wt ) which in turn yields |q |  4t H  /. Plugging this into Lemma 9 yields a lower bound on the improvement of /8H  .

Finally, for w J (wt )  H we have t  w J (wt ) 2 /2H , which implies |q |  2H H  t /. Plugging this into Lemma 7 yields an improvement of t /4H H  . Since both cases cover the remaining range of convergence, the minimum min(/8H  , t /4H H  ) provides a lower bound for the improvement. The crossover point between both terms occurs at t = H/2. Rearranging the conditions leads to the (pessimistic) improvement guarantees of the second claim.

Proof of Theorem 5
For any t > 4G2 H  / it follows from (11) that t+1  t/2. Moreover, 0  J (0), since we know that J is nonnegative. Hence we need at most log2 [J (0)/4G2 H  ] to achieve this level of precision. Subsequently we need to solve the following difference equation t+1 - t = -  2 t. 8G2 H  (26)

Since this is monotonically decreasing, we can upper bound this by solving the differential equation (t) = 2 H - 2(t)/8G2 H  , with the boundary conditions (0) = 4G2 H  /. This in turn yields (t) = 8Gt+2) , and ( hence t  claim.
8G2 H   -

2. For a given

we will need

8G2 H   -

2 more steps to converge. This proves the first

To analyze convergence in the second case we need to study two additional phases: for t  [H/2, 4G2 H  /] we see constant progress. Hence it takes us 4-2 [8G2 (H  )2 - H H  ] steps to cover this interval. Finally in the third phase we have t+1  t[1 - /4H H  ]. Starting from t = H/2 we need log2 [2 /H]/ log2 [1 - /4H H  ] steps to converge. Expanding the logarithm in the denominator close to 1 proves the claim.

10


Algorithm 2 Line Search Initialize w0 = 0, r0 = 0, t = 0 repeat Compute gradient at+1 and offset bt+1 Compute t = Remp (wt ) +  wt 2 - rt Compute  using Eq. (15) Update wt+1  (1 -  )wt - ( /)at+1 Update rt+1  (1 -  )rt -  rt+1 tt+1 B until t 

The Linesearch algorithm
We describe the linesearch algorithm we used in Algorithm 2.

C Experiments
C.1 Datasets
We present results on the datasets listed in Table 1. Our choice is largely influenced by the choices in [4, 5] concerning classification tasks for the purpose of reproducing results reported in their work.

dataset Adult9 Real & Simulated Autos & Aviation Web8 KDDCup-99 Reuters CCAT Reuters C11 Arxiv astro-ph Covertype

abbr. adult9 real-sim aut-avn web8 kdd99 ccat c11 astro-ph covertype

#examples 32561 57763 56862 33546 3398431 804414 804414 62369 581012

dimension 123 20958 20707 300 127 47236 47236 99757 54

density % 11.28 0.25 0.25 4.24 12.86 0.16 0.16 0.08 22.22

Table 1: Datasets their properties.

11


C.2 Convergence

Figure 4: Convergence of the bundle method (semilog-y) with a QP inner loop. We plot t as a function of number of iterations for various datasets. Note that there is a potential numerical error in the experiments reported for the news20 dataset.

12


C.3 Comparison with Pegasos

Figure 5: Objective function value with respect to time on adult9 with various values of .

13


Figure 6: Objective function value with respect to time on real-sim with various values of .

14


Figure 7: Objective function value with respect to time on aut-avn with various values of .

15


Figure 8: Objective function value with respect to time on web8 with various values of .

16


Figure 9: Objective function value with respect to time on kdd99 with various values of .

17


Figure 10: Objective function value with respect to time on covertype with various values of .

18


Figure 11: Objective function value with respect to number of iterations on adult9 with various values of .

19


Figure 12: Objective function value with respect to number of iterations on news20 with various values of .

20


Figure 13: Objective function value with respect to number of iterations on astro-ph with various values of .

21