Proximal regularization for online and batch learning Chuong B. Do Quoc V. Le Computer Science Department, Stanford University, Stanford, CA 94305, USA Chuan-Sheng Foo Institute for Infocomm Research, Singapore 138632, Singapore CHUONGDO @ CS . STANFORD . EDU QUOCLE @ CS . STANFORD . EDU CSFOO @ CS . STANFORD . EDU Abstract Many learning algorithms rely on the curvature (in particular, strong convexity) of regularized objective functions to provide good theoretical performance guarantees. In practice, the choice of regularization penalty that gives the best testing set performance may result in objective functions with little or even no curvature. In these cases, algorithms designed specifically for regularized objectives often either fail completely or require some modification that involves a substantial compromise in performance. We present new online and batch algorithms for training a variety of supervised learning models (such as SVMs, logistic regression, structured prediction models, and CRFs) under conditions where the optimal choice of regularization parameter results in functions with low curvature. We employ a technique called proximal regularization, in which we solve the original learning problem via a sequence of modified optimization tasks whose objectives are chosen to have greater curvature than the original problem. Theoretically, our algorithms achieve low regret bounds in the online setting and fast convergence in the batch setting. Experimentally, our algorithms improve upon state-of-the-art techniques, including Pegasos and bundle methods, on medium and large-scale SVM and structured learning tasks. In this optimization problem, the L2 regularization penalty plays two important roles: not only does the quadratic term prevent overfitting to the empirical loss on the training data, but in fact, it also controls a measure of curvature of the objective function, known as its strong convexity. In the past several years, a number of approaches have been proposed for training linear SVMs, ranging from batch methods such as the cutting plane algorithm (Joachims, 2006) to online methods such as the PEGASOS subgradient algorithm (Shalev-Shwartz et al., 2007). In essentially all of these algorithms (for which the relevant bounds are known), theory indicates that the number of iterations required to obtain an -accurate solution is roughly O(1/). For example, cutting-plane methods require O(1/) passes through the training set (Smola et al., ~ 2008), whereas PEGASOS must process O(1/) training examples to ensure accuracy on expectation. In both cases, the theoretical bounds depend largely on the chosen value of the regularization hyperparameter . For many real world problems, however, the ideal choice of can be quite small. When this is the case, state-ofthe-art cutting plane and subgradient algorithms give unnacceptably slow convergence, both in theory and in practice. Recently, (Bartlett et al., 2008) described an adaptive online gradient descent algorithm based on the simple intuition that an objective function with low curvature can be stabilized by adding extra terms whose purpose is to increase curvature. In this paper, we extend these ideas to construct new online and batch algorithms suitable for training a wide variety of supervised learning models. Specifically, we design a sequence of optimization tasks, each of which is a variant of the original problem modified to include an extra proximal regularization term. We show how to choose these proximal terms in an adaptive fashion such that the resulting sequence of minimizers (or approximate minimizers) converge to the solution of the original optimization problem. Finally, we describe some simple heuristic modifications to these algorithms that retain all optimality guarantees while resulting in considerable performance improvements in practice. In the on- 1. Introduction Consider the task of training a linear SVM: min n w 2 2 m wR + 1 m i=1 max(0, 1 - y (i) wT x(i) ). (1) Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Proximal regularization for online and batch learning line setting, our analysis leads naturally to a stochastic subgradient-style algorithm along the lines of the PEGASOS. In the batch setting, our analysis yields an improved cutting-plane/bundle method. Both in theory and in experiments, our methods exhibit comparable performance for large (high curvature) compared to existing methods and dramatic improvements for small (low curvature). 1 2. Preliminaries Algorithm 1 Projected subgradient descent Initialize w1 0. for t 1, . . . , T do Receive a t -strongly convex function ft . Choose gt ft (wt ). Set t 1/1:t . Set wt+1 S [wt - t gt ]. end for return wT +1 . Algorithm 2 Proximal projected subgradient descent Initialize w1 0. for t 1, . . . , T do Receive a t -strongly convex function ft . Choose gt ft (wt ). r Set t 2 Set t 1/(1:t + 1:t ). Set wt+1 S [wt - t gt ]. end for return wT +1 . -1:t -1:t-1 + t (1:t +1:t-1 )2 + R2 G2 Let · denote the Euclidean norm, x := xT x. Given a point x Rn and a compact (i.e., closed, bounded) subset S Rn , let S [x] := arg minyS x - y denote the Euclidean projection of x onto S. For notational conveb nience, we use notational shorthand ca:b := i=a ci for any sequence of scalars ca , ca+1 , . . . , cb-1 , cb R. For a vector x Rn and c R, let [x; c] Rn+1 denote the concatenation of c onto the end of x. For x, y Rn , let x y denote the component-wise inequalities, xi yi , i. Let 0 and 1 denote the vectors of all 0's and all 1's, respectively. A function f : Rn R is said to be -strongly convex if for any x, y Rn and any subgradient g belonging to the subdifferential2 f (x) of f at x, f (y) f (x) + gT (y - 2 x) + y - x . Here, we consider learning problems as2 sociated with the optimization of -strongly convex functions in both the online and batch settings. In the online setting, we base our analyses on the concept of a convex repeated game. A convex repeated game is a twoplayer game consisting of T rounds. During round t, the first player proposes a vector wt belonging to some compact convex set S, the second player responds by choosing a t -strongly convex function of the form ft (w) := 2 t + t (w) for some convex function t , and then the 2 w first player suffers loss ft (wt ). We assume that t 0, and that the same set S is used in each round; for simplicity, we assume throughout that S is an origin-centered closed ball of radius R. Here, we seek an algorithm to minimize the T T first player's regret, t=1 ft (wt ) - minuS t=1 ft (u), i.e., the excess loss suffered compared to the minimum loss possible for any fixed choice of w S. In the batch setting, we are given a -strongly convex func2 tion of the form f (w) = w + (w), where is again 2 a convex function. Here, we will assume > 0 in order to ensure that the optimization problem is well-posed. If w = arg minwRn f (w), then our goal will be to find an approximate minimizer w such that f (w) - f (w ) . For an extended version of this paper with proofs, see http://ai.stanford.edu/~chuongdo/papers/proximal proofs.pdf 2 The subdifferential f (x) of a convex function f : Rn R at x is the set of all vectors g such that f (y) f (x) + gT (y - x) for all y Rn ; elements belonging to the subdifferential are known as subgradients. 1 . 3. Online proximal learning As a starting point, we recall the projected subgradient algorithm for strongly convex repeated games proposed by (Hazan et al., 2007) and later generalized by (Bartlett et al., 2008), as stated in Algorithm 1. In this algorithm, the first player updates his parameter vector in each round by taking a projected subgradient step, wt+1 S [wt - t gt ]. When the step size t = 1/1:t , we obtain the following regret bound (Bartlett et al., 2008): Lemma 1. Suppose that t > 0 and gt Gt for t = 1, . . . , T . Then, for any u S, Algorithm 1 satisfies T t=1 (ft (wt ) - ft (u)) 1 2 T t=1 G2 t . 1:t (2) When t = and Gt = G in each round, then the right hand side of the inequality can be further upper-bounded 2 by G (1 + log T ). Algorithm 1, thus, is an example of an 2 algorithm with logarithmic regret. When is small, however, this guaranteed regret can still be large. 3.1. Proximal regret bound Now, suppose we run Algorithm 1 on the sequence of modified functions, t 2 w - wt . (3) ft (w) := ft (w) + 2 for some setting of constants 1 , . . . , T R. We refer to the additional quadratic term in each of our modified functions as a proximal regularization term. Whereas each ft is t -strongly convex, each modified function ft is (t + t )strongly convex. Also, since the gradient of the proximal Proximal regularization for online and batch learning regularization term is zero when evaluated at wt , it follows immediately that ft (wt ) = ft (wt ). Thus, the updates in the proximal regularization case differ from the nonproximal algorithm only in the choice of step sizes, since we can still use the same subgradients. The idea of adding temporary regularization terms in order to achieve better bounds on the regret of a learning algorithm was first introduced in (Bartlett et al., 2008), who considered modified objective functions of the form t 2 w . (4) ft (w) := ft (w) + 2 Unlike in the proximal case, ft (wt ) = ft (wt ). In Section 5, we compare empirically these two choices. To analyze the proximal regularization method, we apply Lemma 1 to the sequence of functions in (3) to obtain Corollary 1. Define T 1 G2 t RT (1 , . . . , T ) := 4t R2 + . (5) 2 t=1 1:t + 1:t For any fixed 1 , . . . , T 0, running Algorithm 1 on the sequence of functions f1 , . . . , fT from (3) gives T Theorem 1. The regret obtained by Algorithm 2 is at most twice that of the optimal offline choice of 1 , . . . , T , i.e., ­ T » T X G2 1X t 4t R2 + . (ft (wt ) - ft (u)) 2 min 1 ,...,T 2 1:t + 1:t t=1 t=1 Strategy 2: Bound optimization. In the second approach, we bound the regret directly, via the following proposition: Proposition 1. Let (1 , . . . , T ) = arg min RT (1 , . . . , T ). (7) Then i = 0 for all i = 1. The benefit of the above proposition is that it allows us to reduce an optimization over many variables to a much simpler convex optimization problem over just a single vari able, 1 (which we simply call ). If t = > 0,4 and Gt = G, then we can upper bound the regret with a simple closed form expression, parameterized by : Theorem 2. Under the above assumptions, let RT denote the worst-case regret suffered by Algorithm 2. Then, for any > 0, we have the upper bound, RT B( ), where B( ) := 4 R2 + G2 1 + log 1 + / T + / 1 + / . 1 ,...,T 0 t=1 (ft (wt ) - ft (u)) RT (1 , . . . , T ). (6) Here, the proof depends on the fact that w - wt 2R for any w, wt S. The strength of the regret bound, depends on the choice of constants 1 , . . . , T . The key to the proximal regularization algorithm, then, is picking these constants so as to ensure that the regret is small. 3.2. Choosing proximal parameters Suppose that the values t and Gt for t = 1, . . . , T are determined independently of the choices made in the algorithm. We describe two approximate schemes for choosing t 's. The first scheme is a practical online balancing heuristic due to (Bartlett et al., 2008). The second scheme, makes the additional assumptions that the t and Gt do not vary with t but has the benefit of allowing us to choose the t 's so that the regret bound is as tight as possible. Strategy 1: Balancing heuristic. In the first approach, observe that the expression in (5) consists of two terms, one of which increases and one of which decreases as t increases. During the tth step of the algorithm, consider the choice of t 0 that ensures that the two terms are equal, G2 t i.e., 2t R2 = 2(1:t +1:t ) . This is a quadratic equation, with positive solution, t = 1 2 Since the upper bound is a convex differentiable function of over the domain 0, one could optimize the bound directly using standard line search techniques. Alternatively, by substituting different values for into the expression above, we can obtain various upper bounds on the regret that Algorithm 2 will achieve. In particular, + log T ). Corollary 3. When = G2RT , then lim B( ) = 4RG T . 0 Corollary 2. When = 0, then B( ) := G2 (1 The key intuition behind the efficiency of Algorithm 2 is that in some cases, one of these bounds may be better than the other. For example, when RG is sufficiently small rel2 ative to G , the seemingly inferior square root bound can actually be better than the logarithmic regret bound for values of T that are not too large. Regardless of the situation, Theorem 1 implies that Algorithm 2 achieves a total regret no worse than twice the best bound for any . 3.3. Application: Linear SVMs In this section, we consider the task of training a linear SVM. The approach we take here was inspired by the Pegasos algorithm (Shalev-Shwartz et al., 2007), currently regarded as one of the fastest methods for SVM training on T T X» X 3t R2 + 2 (ft (wt ) - ft (u)) min t=1 1 ,...,T t=1 -1:t - 1:t-1 + (1:t + 1:t-1 )2 + G2 t R2 . ­ G2 t , 1:t + 1:t In Algorithm 2, we provide pseudocode for the proximal projected subgradient descent algorithm using the balancing heuristic. Applying Lemma 3.1 from (Bartlett et al., 2008), we obtain the following bound:3 3 For comparison, (Bartlett et al., 2008) derived a bound of when using the modified functions in (4). These two expression are not directly comparable, though as we show in Section 5, the proximal algorithm performs better in our experiments. 4 Note that if = 0, then the bound reduces to 2 R2 + qP PT T 2 2 2 t=1 Gt /4R . t=1 Gt /2 , whose minimum occurs at = Proximal regularization for online and batch learning large-scale datasets. At its core, the Pegasos algorithm is essentially a wrapper for Algorithm 1. Given training inputs {x(i) , y (i) }m , the Pegasos algoi=1 rithm defines a sequence of functions f1 , . . . , fT . In the tth round, Pegasos randomly samples a subset At of fixed size k from {1, 2, . . . , m}, defines ft (w) to be w 2 2 Using Corollaries 2 and 3 from Section 3.2, and applying Proposition 2 we have · With probability at least 1 - , f (wr ) - f (w ) G2 (1+log T ) . To ensure that the right hand side is no T ~ G2 greater than requires T O( ) iterations. · With probability at least 1-, f (wr )-f (w ) 4RG . T To ensure that the right hand side is no greater than 2 2 G requires T 16R 2 iterations. 2 ~ G2 In the first bound, we recover the O( ) convergence rate of the Pegasos algorithm. In the second bound, we recover 2 2 the O( R 2G ) rate of (Zinkevich, 2003), that, at least at first, 2 appears not to depend on , suggesting that perhaps the proximal algorithm ought to give improved convergence when is small. On a closer examination, however, the 1 dependence on is "hidden" inside the R = bound from the Pegasos algorithm. Making this dependence exG2 plicit, we achieve a rate of only O( 2 2 ).7 Here, the weak link in our analysis is the dependence of 1 R on . In practice, however, the bound R = is often quite loose. Knowing ahead of time the norm of w = arg minw f (w) would help by allowing us to define a smaller feasible set S and thus obtain tighter bounds. 3.4. An optimistic strategy With the above motivation in mind, we propose the adaptive strategy shown in Algorithm 3. In this method, we assume that we are initially given some desired level of suboptimality . Optimization proceeds in several phases. At the beginning of each phase, we "hypothesize" a setting of R. During each phase, we run the proximal projected subgradient strategy until either (1) wt "close" to R, gets forcing us to increase R by a factor of 2 and start a new phase; or (2) enough iterations pass without this occurring, allowing us to declare convergence. The algorithm is "optimistic" in the sense that it initially assumes R to be small and only increases it as necessary. One can prove that: Lemma 2. Suppose that some particular phase ends without any increase in R. Define w = arg minwS f (w). Let r be chosen uniformly at random from {1, . . . , T }. ^ Then with probability at least 1 - , wr is -optimal, i.e., f (wr ) - f (w ) . + + and runs Algorithm 1 with t = , Gt = 1 1 maxi x(i) , R = , and S = {w Rn : w }.5 To characterize the relationship between the strongly convex game defined by Pegasos and the linear SVM training problem, we state the following theorem and its corollary, both of whose proofs closely mirror that of Theorems 2 and 3 from (Shalev-Shwartz et al., 2007): Theorem 3. Let f : Rn R be a (strongly) convex function, let S Rn be compact, and suppose w := arg minwS f (w). Let A be an algorithm for (strongly) convex repeated games with regret bound RT . Now, suppose we run A on a sequence of (strongly) convex functions f1 , . . . , fT which satisfy, for all t {1, . . . , T }, (1) Ef1 ,...,ft ,w1 ,...,wt [ft (w)] f (w) for all w S; and (2) Ef1 ,...,ft ,w1 ,...,wt-1 |wt [ft (wt )] = f (wt ). If r is drawn uniformly at random from {1, . . . , T }, then RT . (9) Er Ef1 ,...,ft ,w1 ,...,wT [f (wr ) - f (w )] T Informally, this result provides an estimate of the average suboptimality Pegasos obtains in terms of the existing regret bound for its underlying algorithm for convex games.6 Using Markov's inequality, it turns out that convergence in expected suboptimality implies convergence to optimality with high probability in the following sense: Proposition 2. Let (0, 1). Under the conditions above, with probability at least 1 - , f (wr ) - f (w ) RT . T Now, we turn to the task of converting Algorithm 2 into an SVM solver, in the same manner as Pegasos. This time, we again assume that the functions f1 , . . . , fT are sampled as in the same manner for the Pegasos algorithm, and for now, we assume the same settings of the constants t and Gt . We now analyze the efficiency of our optimization algorithm by characterizing the number of iterations needed to guarantee -optimality with high probability. 5 As shown in (Shalev-Shwartz et al., 2007), one can guarantee using a strong duality argument that the optimal solution will 1 always have norm at most , so using S as the feasible set does not impose any additional restrictions. 6 Note that a version of the theorem replacing wr with w = PT 1 t=1 wt follows easily from Jensen's inequality. Though this T leads to a potentially more stable version of Algorithm 3, the resulting algorithm in practice often converges less quickly and may be less computationally efficient to implement (for problems where the feature vectors xi are sparse). 1 |At | iAt max(0, 1 - y (i) wT x(i) ). (8) Theorem 4. For < 1 , Algorithm 3 terminates after pro2 ~ G2 cessing at most O( ) examples; with probability at least 1 - , the resulting parameters wr will be -optimal. Our analysis thus shows that the modified algorithm, in the worst case, is asymptotically equivalent to Pegasos up to 7 These results are not particularly surprising, given the recent minimax analysis of (Abernethy et al., 2008), who showed that under certain assumptions, the regret bound of the regular projected subgradient algorithm is worst case optimal. Proximal regularization for online and batch learning Algorithm 3 Optimistic proximal SVM solver input Training set {(x , y Regularization parameter Desired suboptimality Allowed failure probability Mini-batch size k Define S := {w Rn : 1/ }. w Set G maxi x(i) + . ^ Set 3-log . 2 1 Initialize R min(1, ). Initialize w1 0. repeat Set C ONVERGED true. 2 Find smallest T such that min( G (i) (i) )}m i=1 Algorithm 4 Proximal bundle method Initialize w1 0. for t 1, . . . , T do Choose at (wt ). Set bt (wt ) - aT wt . t q Set t 2 Compute t = arg maxRt : Pt i=1 (i wi -t,i ai ) -t-1:t-1 + (t+1:t-1 )2 + (R+At )2 R2 0,T 1t Set wt+1 = end for return wT +1 . t+1:t . . Dt (). for t 1, . . . , T do Sample At {1, . . . , m} such that |At | = k. Define ft (w) according to (8). Choose gt ft (wt ). q Set t 2 Set t 1/(1:t + 1:t ). Set wt+1 S [wt - t gt ]. -1:t -1:t-1 + (1:t +1:t-1 )2 + G2 R 2 (1+log T ) 4RG , T ) ^ ^ T . inference may involve either a computationally expensive dynamic programming step, or even solving a combinatorial optimization problem as a subroutine. The prototypical batch algorithm from which we start is the cutting plane optimization method of (Joachims, 2006) as reformulated and generalized in (Teo et al., 2007) and (Smola et al., 2008). In this method, f is assumed to be everywhere nonnegative, and one creates a sequence of lower-bound approximations to f of the form, Pt (w) = w 2 2 . if wt+1 R - 2 then Set R 2R. Set C ONVERGED false. break end if end for until C ONVERGED Choose r uniformly at random from {1, . . . , T }. return wr . + max 0, i{1,...,t} max a T w + bi i . Initially, w1 = 0. During each iteration t {1, . . . , T }, at and bt are chosen so that aT w + bt is the first-order t Taylor expansion of (w) at wt , and wt+1 is chosen to be the minimizer of Pt . To date, the best convergence re1 sults known for bundle methods state that at most O( T ) iterations are needed to achieve -optimality, as proved in (Teo et al., 2007) and (Smola et al., 2008). However, when 0, the number of iterations needed can still be very large, just as in the online case. To counter these problems, we propose a proximal bundle method, as shown in Algorithm 4. In particular, consider the sequence of primal and dual optimization problems, wRn min Pt (w) Dt () logarithmic factors. In practice, however, Algorithm 3 can 1 be significantly faster when w . In these cases, the algorithm will tend to operate in the regime of small R, 2 2 and will achieve O( R 2G ) regret, independent of .8 2 for t = 1, 2, . . . , T for t = 1, 2, . . . , T (10) (11) 4. Batch proximal learning In the batch learning setting, we are no longer presented with a sequence of objective functions but rather we are given a single -strongly convex objective function f : Rn R that we would like to optimize. Batch algorithms are often appropriate when the training set is not particularly large, but the cost of inference with respect to any individual training example is high. This type of scenario occurs frequently in structured prediction problems, where 8 As an anecdotal example, on the "combined" dataset in our experiments, the parameter norm bound corresponding to the 1 which gave the best test set performance was 3 × 105 , whereas w = 12.94. On this run, the proximal algorithm estimated an upper bound of R = 16. Rt : max 0, T 1t where Pt (w) := t · Pt (w) + Dt () := t X ,, i i=1 t X i 2 i=1 2 w - wi « - 2 2 wi + i bi ,Pt , i=1 (i wi 2(t + 1:t ) ,2 - i ai ), . for some constants 1 , . . . , T . As in the standard bundle algorithm, wt+1 := arg minwRn Pt (w). If t = arg maxRt : 0,T 1t Dt (), then the two optima are related by wt+1 = Pt i=1 (i wi -t,i ai ) t+1:t using strong duality. In most cutting plane analyses, convergence rates are established by lower-bounding the dual improvement in each Proximal regularization for online and batch learning Table 1. Convergence of Pegasos, Adaptive, and Proximal on nine binary classification tasks. The second through fifth columns give the size of the training and testing sets, number of features, and the optimal regularization parameter. The last two sets of three columns ^ report the best SVM training loss, f = mint1,...,T f (wt ), seen for each tested algorithm, and the number of iterations needed to reduce ^ the initial objective function by 0.99(f (w1 ) - f ). n/a is reported for cases where the optimizer failed to find a better objective than the starting parameter set. The best numbers in each group are shown in bold. Dataset a9a combined connect-4 covtype ijcnn1 mnist rcv1 real-sim w8a mtrain 32,561 78,823 54,045 464,808 35,000 60,000 20,242 57,846 49,749 mtest 16281 19705 13512 116204 91701 10000 677399 14463 14951 n 123 100 126 54 22 780 47,236 20,958 300 best 10 10-9 10-7 10-8 10-7 10-5 10-7 10-5 10-8 -4 Pegasos 0.3537 0.5299 6.8229 1.4852 0.3582 0.1200 0.0084 0.0602 1.5146 Best training loss Adaptive Proximal 0.3531 0.3533 0.2760 0.2336 0.9698 0.5136 0.7217 0.5830 0.2088 0.1857 0.1033 0.1012 0.0035 0.0487 0.0602 0.0602 0.1391 0.1292 Eff. iterations to convergence Pegasos Adaptive Proximal 28 19 18 100 99 8 n/a 99 63 n/a 96 12 89 98 3 75 28 3 53 10 83 6 5 7 n/a 45 13 iteration, and then arguing that only a limited number of iterations can occur before some termination criteria (e.g., primal-dual gap) is satisfied. Here, we again use the dual improvement argument, though we obtain somewhat different results, given that the dual objective function changes after each iteration due to the changing proximal regularization terms. Our analysis is closely related to the online learning framework of (Shalev-Shwartz & Kakade, 2009). Lemma 3. Let w1 , . . . , wt-1 Rn and Rt-1 be vectors P such that 0 and T 1 t - 1. If we define wt := t-1 i=1 (i wi -i ai ) (t-1)+1:t-1 , ~ Provided that R + A = O(1), then our analysis yields a ~ 1 worst case convergence rate of O( T ), matching the convergence rate of our online algorithm, as well as the best known convergence rates for bundle methods. We note that the idea of stabilizing standard bundle method algorithms to improve convergence has been suggested previously in the bundle method literature. Proximal bundle methods originated with (Kiwiel, 1983), and are closely related to trust region (Schramm & Zowe, 1992) and level set (Lemar´ chal et al., 1995) techniques for bundle method e improvement. In practice, each of these prior methods require considerable parameter tuning on the part of the user. In contrast, our bundle algorithm is straightforward, with the curvature terms t automatically chosen in order to minimize a regret bound. then Dt ([; 1]) - Dt-1 () = f (wt ) - wt + at 2 . 2(t + 1:t ) (12) Using this lower bound, we can then bound the best suboptimality obtained by our algorithm after t steps: Proposition 3. Let w = arg minwRn f (w). Suppose that wt R and at At for t = 1, . . . , T . Then, t{1,...,T } 5. Experiments We carried out two sets of experiments with proximal algorithms. For the first set of tasks, we tested online algorithms for large-scale binary classification. For the second set of tasks, we performed batch training of structured output SVMs for RNA folding and web ranking. In both the online and batch cases, we ran the optimistic version of our proximal algorithm, setting = 0 and = 1, stopping after a fixed number of iterations, and returning wT +1 instead of wr , as in (Shalev-Shwartz et al., 2007). 5.1. Online learning with binary classification In this experiment, we tested the behavior of our algorithm on nine binary classification datasets.9 For each of these datasets, we first determined the optimal setting of best for ensuring good generalization performance using crossvalidation. We then compared the Proximal online al9 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. For each dataset where a binary classification version was not available, we reduced multiclass to a single class vs. rest problem. When separate testing sets were not available, we reserved 80% of the data set for training and 20% for testing. min f (wt ) - f (w ) 1 T T X» t=1 2t R2 + (R + At ) . 2(t + 1:t ) 2 ­ Remarkably, the suboptimality guarantees in the proposition above have essentially the same form as the regret bounds stated in Corollary 1. As a result, we can make use of the balancing heuristic for choosing the proximal constants 1 , . . . , T . Furthermore, the optimistic strategy for bounding the optimal parameter norm, as described in Section 3.4, also carries over with little modification. For the sake of space, we show only the proximal bundle method using the balancing heuristic in Algorithm 4; we do not give pseudocode for the optimistic extension explicitly. Using Proposition 3 and the argument in Theorem 1, we have the following theorem, Theorem 5. Suppose that wt R and at A for t = 1, . . . , T . Then, Algorithm 4 achieves, t{1,...,T } min f (wt ) - f (w ) (R + A)2 (1 + log T ) . T Proximal regularization for online and batch learning = 0.0001 Regularized training loss Regularized training loss = 1e-006 Classification error Classification error 10 -0.2 Pegasos Adaptive Proximal 10 0.2 10 Pegasos Adaptive Proximal 0 = 0.0001 Pegasos Adaptive Proximal = 1e-006 Pegasos Adaptive Proximal 10 -0.4 10 0 10 -1 10 -0.5 10 -0.6 10 -0.2 0 1 2 10 -0.6 10 0 10 Iterations = 1e-005 1 10 2 10 -2 10 2 10 Iterations = 1e-007 10 10 0 0 10 Iterations = 1e-005 1 10 2 10 0 10 Iterations = 1e-007 1 10 2 Regularized training loss Regularized training loss 10 10 Classification error Classification error Pegasos Adaptive Proximal 10 0 10 1 Pegasos Adaptive Proximal Pegasos Adaptive Proximal 10 -1 10 -0.4 Pegasos Adaptive Proximal 10 -0.5 10 0 10 10 -1 -0.6 10 4 0 10 Iterations = 1e-006 1 10 2 10 4 0 10 Iterations = 1e-008 1 10 2 10 -2 10 0 0 10 Iterations = 1e-006 1 10 2 10 0 10 Iterations = 1e-008 1 10 2 Regularized training loss Regularized training loss 10 10 10 Classification error Classification error 10 2 Pegasos Adaptive Proximal 10 2 Pegasos Adaptive Proximal Pegasos Adaptive Proximal 10 -1 10 -0.4 Pegasos Adaptive Proximal 10 0 10 0 10 -0.5 10 -2 10 0 10 Iterations 1 10 2 10 -2 10 0 10 Iterations 1 10 2 10 -2 10 0 -0.6 10 10 Iterations 1 10 2 10 0 10 Iterations 1 10 2 Figure 1. Convergence of Pegasos, Adaptive, and Proximal for combined and covtype. Left column: combined; Right column: covtype. Each row corresponds to a regularization parameter . Bottom row: = best , middle row: = 10best , top row: = 100best . Effective iterations are shown on the x-axis. Figure 2. Test errors of Pegasos, Adaptive, and Proximal for combined and covtype during the course of optimization. Each row corresponds to a regularization parameter . Bottom row: = best , middle row: = 10best , top row: = 100best . Effective iterations are shown on the x-axis. gorithm against Pegasos (Shalev-Shwartz et al., 2007) and Adaptive online gradient descent (Bartlett et al., 2008) by running each algorithm for 100 effective iterations10 under a variety of regularization parameter settings. In Table 1, we provide some statistics on the training and testing datasets used. We record the best objective value ^ f = mint1,...,T f (wt ) obtained for each algorithm over the 100 effective iterations. We also record the number of iterations needed to achieve a objective function reduction ^ of 0.99(f (w1 ) - f ) for each algorithm. The Proximal algorithm achieves the best objective on 7 out of 9 datasets, while consistently requiring few effective iterations. Figures 1 and 2 show learning curve plots and test error plots for two of the datasets (combined and covtype). As shown, the proximal algorithm enjoys a comfortable advantage over the other methods, especially for small best . 5.2. Batch learning with RNA folding and web ranking In this experiment, we compared our batch proximal learning algorithm against standard bundle algorithms (Smola et al., 2008) for learning RNA folding and web search ranking models. Both of these problems can be formulated as nonsmooth structured SVMs ((Chapelle et al., 2007) for ranking and (Do et al., 2006) for RNA folding). To date, the fastest approaches for dealing with this type of nonsmooth optimization problem are cutting plane/bundle methods (e.g., SVMPerf (Joachims, 2006) and BMRM (Teo et al., 2007)). 10 In the RNA folding experiment, the dataset contained RNA sequences taken from 151 separate RNA families (Do et al., 2006), and we used a model with approximately 350 distinct features based largely on existing thermodynamic scoring schemes for RNA folding. In the ranking experiment, the dataset contained 1000 queries for training, 1000 queries for validation, with an average of 50 documents per query. In both cases, we compared the performance of the proximal bundle method against the standard bundle method for various values of . Figure 3 shows training loss curves depicting the best training loss obtained so far for a standard bundle method compared to our proximal variant. In both methods, many iterations pass before the algorithms are able to identify parameters which improve upon the initial parameter set; for the standard bundle method, this problem is especially pronounced for small regularization parameters. Again, the results show that the proximal variant significantly outperforms the standard algorithm, especially when is small. 6. Discussion Functions with low curvature are the Achilles's heel of optimization algorithms in machine learning. In this paper, we propose new online and batch learning algorithms, which sequentially modify the objective functions used during optimization. By choosing these modified tasks carefully, our methods ensure that (1) the sequence of solutions given by these modified tasks will lead to a good approximate minimizer of the original optimization problem, and (2) the regret bounds obtained in the online setting and One pass through the entire dataset is an effective iteration. Proximal regularization for online and batch learning regularized training loss regularized training loss regularized training loss regularized training loss regularized training loss = 10 = 1e-05 10 -3.15 References Bundle Proximal Bundle 10 2 10 2.16 10 2.15 Bundle Proximal Bundle 0 10 -3.17 10 10 1 10 iterations =1 2 10 3 10 0 10 1 10 3 10 2.167 regularized training loss iterations = 1e-06 10 -3.15 Bundle Proximal Bundle 10 -3.17 10 2.153 Bundle Proximal Bundle 0 10 10 1 10 iterations = 0.1 2 10 3 10 0 10 1 10 iterations = 1e-07 2 10 3 10 2.167 10 -3.15 Bundle 10 -3.17 10 2.154 Bundle Proximal Bundle 0 Proximal Bundle 10 2.167 10 1 10 iterations = 0.01 2 10 3 10 0 10 1 10 iterations = 1e-08 2 10 3 10 10 -3.15 Bundle 10 -3.17 10 2.152 Bundle Proximal Bundle 0 Proximal Bundle 10 10 1 10 iterations 2 10 3 10 0 10 1 10 iterations 2 10 3 Figure 3. Comparison of a standard bundle method to our proximal bundle method for SVM structured learning with various choices of . Left column: RNA folding; Right column: Web ranking. Each row corresponds to a regularization parameter . The regularization parameter decreases from top to bottom. the convergence rates obtained in the batch setting will be improved due to the increased curvature. The idea of adding curvature in order to improve regret bounds for online algorithms was introduced in (Bartlett et al., 2008), and the online algorithmic schemes proposed there have much in common with the basic online methods proposed here. We apply these ideas to the problem of training linear SVMs and structured prediction models, where we introduce a new adaptive strategy for optimistically bounding the norm of the optimal parameters. We also transfer these ideas to the batch setting, where we present improved bundle methods for structured learning. Experimentally, we show that the problem of low curvature is not simply a matter of theoretical concern. Rather, for many real world large-scale learning problems, the optimal regularization penalty (as determined by holdout crossvalidation) is often very small. For problems where high regularization is appropriate (e.g., when the dimensionality of the data is large relative to the number of training examples), our algorithm performs as well as the best existing methods, such as Pegasos. When low regularization is needed, however, our algorithm offers dramatic improvements over state-of-the-art techniques, converging in a few passes through the dataset when other algorithms may fail to converge at all. Acknowledgments We thank Andrew Ng for his support in this project. We thank anonymous reviewers for their comments on the manuscript. CBD was supported by an NSF Graduate Research Fellowship. CSF was supported by an A*STAR National Science Scholarship. Abernethy, J., Bartlett, P. L., Rakhlin, A., & Tewari, A. (2008). Optimal strategies and minimax lower bounds for online convex games. Proceedings of the 21st Annual Conference on Computational Learning Theory. Bartlett, P., Hazan, E., & Rakhlin, A. (2008). Adaptive online gradient descent. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, 65­72. MIT Press. Chapelle, O., Le, Q. V., & Smola, A. J. (2007). Large margin optimization of ranking measures. NIPS Workshop: Machine Learning for Web Search. Do, C. B., Woods, D. A., & Batzoglou, S. (2006). CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22, e90­e98. Hazan, E., Agarwal, A., & Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Mach Learn, 69, 169­192. Joachims, T. (2006). Training linear SVMs in linear time. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 217­226). Kiwiel, K. C. (1983). Proximity control in bundle methods for convex nondifferentiable minimization. Math Program, 27, 320­341. Lemar´ chal, C., Nemirovskii, A., & Nesterov, Y. (1995). e New variants of bundle methods. Math Program, 69, 111­147. Schramm, H., & Zowe, J. (1992). A version of the bundle idea for minimizing a nonsmooth function: conceptual idea, convergence analysis, numerical results. SIAM J Optim, 2, 121­152. Shalev-Shwartz, S., & Kakade, S. M. (2009). Mind the duality gap: Logarithmic regret algorithms for online optimization. In D. Koller, D. Schuurmans, Y. Bengio and L. Bottou (Eds.), Advances in Neural Information Processing Systems 21, 1457­1464. MIT Press. Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Proceedings of the 24th Annual International Conference on Machine Learning (pp. 807­814). Smola, A., Vishwanathan, S. V. N., & Le, Q. (2008). Bundle methods for machine learning. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, 1377­1384. MIT Press. Teo, C. H., Smola, A., Vishwanathan, S. V., & Le, Q. V. (2007). A scalable modular convex solver for regularized risk minimization. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 727­736). Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. Proceedings of the 20th Annual International Conference on Machine Learning. regularized training loss regularized training loss