Fast Rates for Regularized Objectives Karthik Sridharan, Nathan Srebro, Shai Shalev-Shwartz Toyota Technological Institute -- Chicago Abstract We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM objective. Thus, we obtain a rate of 1/n on the convergence of the SVM objective (with fixed regularization parameter) to its infinite data limit. We demonstrate how this is essential for obtaining certain type of oracle inequalities for SVMs. The results extend also to approximate minimization as well as to strong convexity with respect to an arbitrary norm, and so also to objectives regularized using other p norms. 1 Introduction We consider the problem of (approximately) minimizing a stochastic objective F (w) = E [f (w; )] (1) where the optimization is with respect to w W, based on an i.i.d. sample 1 , . . . , n . We focus on problems where f (w; ) has a generalized linear form: f (w; ) = ( w, () , ) + r(w) . (2) The relevant special case is regularized linear prediction, where = (x, y ), ( w, (x) , y ) is the loss of predicting w, (x) when the true target is y , and r(w) is a regularizer. It is well known that when the domain W and the mapping (·) are bounded, and the function (z; ) is Lipschitz continuous in z , the empirical averages ^ ^ F (w) = E [f (w; )] = 1 n in =1 f (w; i ) (3) 1 converge uniformly to their expectations F (w) with rate /n. This justifies using the empirical minimizer ^ ^ w = arg min F (w), (4) w W ^ and we can then establish convergence of F (w) to the population optimum F (w with a rate of 1 /n. ) = min F (w) w W (5) Recently, Hazan et al [1] studied an online analogue to this problem, and established that if f (w; ) is strongly convex in w, the average online regret diminishes with a much faster rate, namely (log n)/n. The function f (w; ) becomes strongly convex when, for example, we have 2 r(w) = w as in SVMs and other regularized learning settings. 2 In this paper we present an analogous "fast rate" for empirical minimization of a strongly convex stochastic objective. In fact, we do not need to assume that we perform the empirical minimization 1 exactly: we provide uniform (over all w W) guarantees on the population sub-optimality F (w) - ^ ^^ F (w ) in terms of the empirical sub-optimality F (w) - F (w) with a rate of 1/n. This is a stronger type of result than what can be obtained with an online-to-batch conversion, as it applies to any possible solution w, and not only to some specific algorithmically defined solution. For example, it can be used to analyze the performance of approximate minimizers obtained through approximate optimization techniques. Specifically, consider f (w; ) as in (2), where (z; ) is convex and LLipschitz in z , the norm of () is bounded by B , and r is -strongly convex. We show that for any a > 0 and > 0, with probability at least 1 - , for all w (of arbitrary magnitude): ( . L2 B 2 (log(1/ )) ^ ^^ F (w) - F (w ) (1 + a)(F (w) - F (w)) + O 1 + 1/a) (6) n We emphasize that here and throughout the paper the big-O notation hides only fixed numeric constants. It might not be surprising that requiring strong convexity yields a rate of 1/n. Indeed, the connection between strong convexity, variance bounds, and rates of 1/n, is well known. However, it is interesting to note the generality of the result here, and the simplicity of the conditions. In particular, we do not require any "low noise" conditions, nor that the loss function is strongly convex (it need only be weakly convex). In particular, (6) applies, under no additional conditions, to the SVM objective. We therefore obtain convergence with a rate of 1/n for the SVM objective. This 1/n rate on the SVM objective is always valid, and does not depend on any low-noise conditions or on specific properties of the kernel nction. Such a "fast" rate might seem surprising at a first glance to the reader familiar with fu the 1/ n rate on the expected loss of the SVM optimum. Theris no contradiction here--what we e establish is that although the loss might converge at a rate of 1/ n, the SVM objective (regularized loss) always converges at a rate of 1/n. In fact, in Section 3 we see how a rate of 1/n on the objective corresponds to a rate of 1/ n on the loss. Specifically, we perform an oracle analysis of the optimum of the SVM objective (rather than of empirical minimization subject to a norm constraint, as in other oracle analyses of regularized linear learning), based on the existence of some (unknown) low-norm, low-error predictor w. Strong convexity is a concept that depends on a choice of norm. We state our results in a general form, for any choice of norm · . Strong convexity of r(w) must hold with respect to the chosen norm · , and the data () must be bounded with respect to the dual norm · , i.e. we must have () B . This allows us to apply our results also to more general forms of regularizers, 2 including squared p norm regularizers, r(w) = w p , for p < 1 2 (see Corollary 2). However, 2 the reader may choose to read the paper always thinking of the norm w , and so also its dual norm w , as the standard 2-norm. 2 Main Result We consider a generalized linear function f : W × R, that can be written as in (2), defined over a closed convex subset W of a Banach space equipped with norm · . Lipschitz continuity and boundedness We require that the mapping (·) is bounded by B , i.e. () B , and that the function (z; ) is L-Lipschitz in z R for every . Strong Convexity We require that F (w) is -strongly convex w.r.t. the norm w . That is, for all w1 , w2 W and [0, 1] we have: F (w1 + (1 - )w2 ) F (w1 ) + (1 - )F (w2 ) - (1 - ) w1 - w2 2 . 2 Recalling that w = arg minw F (w), this ensures (see for example [2, Lemma 13]): F (w ) F (w ) + 2 w-w 2 (7) We require only that the expectation F (w) = E [f (w; )] is strongly convex. Of course, requiring that f (w; ) is -strongly convex for all (with respect to w) is enough to ensure the condition. 2 In particular, for a generalized linear function of the form (2) it is enough to require that (z; y ) is convex in z and that r(w) is -strongly convex (w.r.t. the norm w ). We now provide a faster convergence rate using the above conditions. Theorem 1. Let W be a closed convex subset of a Banach space with norm · and dual norm · and consider f (w; ) = ( w, () ; ) + r(w) satisfying the Lipschitz continuity, boundedness, ^ ^ and strong convexity requirements with parameters B , L, and . Let w , w, F (w) and F (w) be as defined in (1)-(5). Then, for any > 0 and any a > 0, with probability at least 1 - over a sample of size n, we have that for all w W: (where [x]+ = max(x, 0)) F (w) - F (w ) 1 8 (1 + a )L2 B 2 (32 + log(1/ )) ^ ^ (1 + a)[F (w) - F (w )]+ + n 1 8 (1 + a )L2 B 2 (32 + log(1/ )) ^ ^^ (1 + a)(F (w) - F (w)) + . n 2 It is particularly interesting to consider regularizers of the form r(w) = w p , which are (p - 1)2 strongly convex w.r.t. the corresponding p-norm [2]. Applying Theorem 1 to this case yields the following bound: 1 Corollary 2. Consider an p norm and its dual q, with 1 < p 2, 1 + p = 1, and the objective q f (w; ) = ( w, () ; ) + w p , where () q B and (z; y ) is convex and L-Lipschitz 2 in z . The domain is the entire Banach space W = p. Then, for any > 0 and any a > 0, with probability at least 1 - over a sample of size n, we have that for all w W = p (of any magnitude): ( . 1 1 + a )L2 B 2 log(1/ ) ^ ^^ F (w) - F (w ) (1 + a)(F (w) - F (w)) + O (p - 1)n Corollary 2 allows us to analyze the rate of convergence of the regularized risk for p-regularized linear learning. That is, training by minimizing the empirical average of: 2 wp (8) 2 where (z, y ) is some convex loss function and x q B . For example, in SVMs we use the 2 norm, and so bound x 2 B , and the hinge loss (z, y ) = [1 - y z ]+ , which is 1-Lipschitz. What we obtain is a bound on how quickly we can minimize the expectation F (w) = E [ ( w, x , y )] + 2 2 w p , i.e. the regularized empirical loss, or in other words, how quickly do we converge to the infinite-data optimum of the objective. f (w ; x, y ) = ( w , x , y ) + We see, then, that the SVM objective converges to its optimum value at a fast rate of 1/n, without ^ ^ any special assumptions. This still doesn't mean that the expected loss L(w) = E [ ( w, x , y )] converges at this rate. This behavior is empirically demonstrated on the left plot of Figure 1. For ^ each data set size we plot the excess expected loss L(w) - L(w ) and the sub-optimality of the ^ ) - F (w ) (recall that F (w) = L(w) + w 2). Although the ^ ^ regularized expected loss F (w 2^ regularized expected loss converges to its infinite data limit, i.e. to the population minimizer, with 1 ^ rate roughly 1/n, the expected loss L(w) converges at a slower rate of roughly /n. Studying the convergence rate of the SVM objective allows us to better understand and appreciate analysis of computational optimization approaches for this objective, as well as obtain oracle ^ inequalities on the generalization loss of w, as we do in the following Section. Before moving on, we briefly provide an example of applying Theorem 1 with respect to the 1norm. The bound in Corollary 2 diverges when p 1 and the Corollary is not applicable for 2 1 regularization. This is because w 1 is not strongly convex w.r.t. the 1-norm. An example of a regularizer that is strongly convex with respect to the 1 norm is the (unnormalized) entropy regud 2 larizer [3]: r(w) = i=1 |wi | log(|wi |). This regularizer is 1/Bw -strongly convex w.r.t. w 1 , as long as w 1 Bw (see [2]), yielding: d Corollary 3. Consider a function f (w; ) = ( w, () ; ) + i=1 |wi | log(|wi |), where () B and (z; y ) is convex and L-Lipschitz in z . Take the domain to be the 1 ball 3 2 Suboptimality of Objective Excess Expected Loss 10 -1 10 -1 10 -2 10 3 n 10 4 10 -2 10 3 n 10 4 ) and sub-optimality of the regularized expected loss ^ ^ F (w) - F (w ) as a function of training set size, for a fixed = 0.8. Right: Exce3 s expected loss L(w ) - s minw L(wo ), relative to the overall optimal wo = arg minw L(w), with n = 00/n. Both plots are on a logarithmic scale and refer to a synthetic example with x uniform over [-1.5, 1.5]300 , and y = sign x1 when |x1 | > 1 but uniform otherwise. ^ Figure 1: Left: Excess expected loss L(w) - L(w W = {w Rd : w 1 Bw }. Then, for any > 0 and any a > 0, with probability at least 1 - over a sample of size n, we have that for all w W: ( . 1 222 ) ^ (w) - F (w)) + O 1 + a )L B Bw log(1/ ) ^^ F (w) - F (w (1 + a)(F n 3 Oracle Inequalities for SVMs In this Section we apply the results from previous Section to obtain an oracle inequality on the expected loss L(w) = E [ ( w, x , y )] of an approximate minimizer of the SVM training objective ^ ^ F (w) = E [f (w)] where 2 f (w; x, y ) = ( w, x , y ) + w, (9) 2 and (z, y ) is the hinge-loss, or any other 1-Lipschitz loss function. As before we denote B = supx x (all norms in this Section are 2 norms). We assume, as an oracle assumption, that there exists a good predictor wo with low norm wo a ^ nd which attains low expected loss L(wo ). Consider an optimization algorithm for F (w) that is ^ (w) min F (w) + opt . Using the results of Section 2, we can ^ ~ guaranteed to find w such that F ~ translate this approximate optimality of the empirical objective to an approximate optimality of the expected objective F (w) = E [f (w)]. Specifically, applying Corollary 2 with a = 1 we have that with probability at least 1 - : . B2 log(1/ ) ~ F (w) - F (w ) 2 opt + O (10) n Optimizing to within opt = O( Bn ) is then enough to ensure B2 log(1/ ) ) ~ F (w) - F (w = O n 2 . (11) ~ In order to translate this to a bound on the expected loss L(w) we consider the following decomposition: ~ ~ L(w) = L(wo ) + (F (w) - F (w )) + (F (w ) - F (wo )) + wo 2 - w 2 2 2~ B2 + log(1/ ) L(wo ) + O 0 + wo 2 (12) 2 n 4 where we used the bound (11) to bound the second term, the optimality of w term is non-positive, and we also dropped the last, non-positive, term. t o ensure the third This might seem like a rate of 1/n on the generalization error, but we need to choose so as to balance the second and third terms. The optimal choice for is l B og(1/ ) (n) = c , (13) wo n for some constant c. We can now formally state our oracle inequality, which is obtained by substituting (13) into (12): Corollary 4. Consider an SVM-type objective as in (9). For any wo and any > 0, with probability 2 ^ ^ ~ ~ at least 1 - over a sample of size n, we have that for all w s.t. F(n) (w) min F(n) (w) + O( Bn ), where (n) chosen as in (13), the following holds: B 2 w 2 log(1/ ) o ~ L(w) L(wo ) + O n Corollary 4 is demonstrated empirically on the right plot of Figure 1. The way we set (n) in Corollary 4 depends on wo . However, using l B og(1/ ) (n) = n (14) we obtain: Corollary 5. Consider an SVM-type objective as in (9) with (n) set as in (14). For any > 0, ^ ~ ~ with probability at least 1 - over a sample of size n, we have that for all w s.t. F(n) (w) B2 ^ min F(n) (w) + O( n ), the following holds: B 2 ( w 4 + 1) log(1/ ) o ~ L(w) inf L(wo ) + O wo n The price we pay here is that the bound of Corollary 5 is larger by a factor of wo relative to the 1 /n to the bound of Corollary 4. Nevertheless, this bound allows us to converge with a rate of expected loss of any fixed predictor. It is interesting to repeat the analysis of this Section using the more standard result: B ( 2 2 wB ) ^ (w) - F (w ) + O ^ F (w) - F (w F 15) n 2 for w Bw where we ignore the dependence on . Setting Bw = /, as this is a bound on the norm of both the empirical and population optimums, and using (15) instead of Corollary 2 in our analysis yields the oracle inequality: B 1/3 2 wo 2 log(1/ ) ~ L(w) L(wo ) + O (16) n The oracle analysis studied here is very simple--our oracle assumption involves only a single predictor wo , and we make no assumptions about the kernel or the noise. We note that a more sophisticated analysis has been carried out by Steinwart et al [4], who showed that rates faster than 1/ n are possible under certain conditions on noise and complexity of kernel class. In Steinwart's et al analyses the estimation rates (i.e. rates for expected regularized risk) are given in terms of the approximation error quantity w 2 + L(w ) - L where L is the Bayes risk. In our re2 sult we consider the estimation rate for regularized objective independent of the approximation error. 5 4 Proof of Main Result To prove Theorem 1 we use techniques of reweighing and peeling following Bartlett et al [5]. For each w, we define gw () = f (w; ) - f (w ; ), and so our goal is to bound the expectation of gw in terms of its empirical average. We denote by G = {gw |w W }. Since our desired bound is not exactly uniform, and we would like to pay different attention to functions depending on their expected sub-optimality, we will instead consider the following reweighted class. For any r > 0 define g ( } gw r Gr = Z+ : E [gw ] r4k 17) w = 4k(w) : w W, k (w) = min{k r where Z+ is the set of non-negative integers. In other words, gw Gr is just a scaled version of r gw G and the scaling factor ensures that E [gw ] r. We will begin by bounding the variation between expected and empirical average values of g r Gr . This is typically done in terms of the complexity of the class Gr . However, we will instead use the complexity of a slightly different class of functions, which ignores the non-random (i.e. non-datadependent) regularization terms r(w). Define: h ( } r h Hr = = 4k(w ) : w W, k (w) = min{k Z+ : E [gw ] r4k 18) w w where hw () = gw () - (r(w) - r(w )) = ( w, () ; ) - ( w , () ; ). (19) r r That is, hw () is the data dependent component of gw , dropping the (scaled) regularization terms. r ^ ^r With this definition we have E [gw ] - E [gw ] = E [hr ] - E [hr ] (the regularization terms on the left w w hand side cancel out), and so it is enough to bound the deviation of the empirical means in Hr . This can be done in terms of the Rademacher Complexity of the class, R(Hr ) [6, Theorem 5]: For any > 0, with probability at least 1 - , s l og 1/ r r r ^ sup E [h ] - E [h ] 2R(Hr ) + up |h ()| (20) 2n . hr Hr hr Hr , We will now proceed to bounding the two terms on the right hand side: 2 Lemma 6. sup |hr ()| LB r/ hr Hr , Proof. From the definition of hr given in (18)­(19), the Lipschitz continuity of (·; ), and the w bound () B , we have for all w, : |hr ()| w |hw ( )| 4k(w) LB w - w / k(w) 4 (21) We now use the strong convexity of F (w), and in particular eq. (7), as well as the definitions of gw and k (w), and finally note that 4k(w) 1, to get: 2 2 2 2 w - w (F (w) - F (w )) = E [gw ] 4k(w) r 16k(w) r (22) Substituting (22) in (21) yields the desired bound. 2 r Lemma 7. R(Hr ) 2L B n Proof. We will use the following generic bound on the Rademacher complexity of linear functionals [7, Theorem 1]: for any t(w) which is -strongly convex (w.r.t a norm with dual norm · ), 2 R({ w, | t(w) a}) (sup ) a . (23) n For each a > 0, define H(a) = {hw : w W, E [gw ] a}. First note that E [gw ] = F (w) - F (w ) is -strongly convex. Using (23) and the Lipschitz composition property we therefore have 2 R(H(a)) LB R(Hr ) = R a n . Now: -j j j =0 4 H(r 4 ) j =0 4-j R(H(4rj )) LB r n 2 j =0 4-j /2 = 2LB r n 2 6 We now proceed to bounding E [gw ] = F (w) - F (w with probability at least 1 - we have: ) and thus proving Theorem 1. For any r > 0, r ^ ^r ^ E [gw ] - E [gw ] = 4k(w) (E [gw ] - E [gw ]) = 4k(w) (E [hr ] - E [hr ]) 4k(w) rD (24) w w 3 1 l 2+log(1/ ) where D = LB n (4 2 + og(1/ )) 2LB is obtained by substituting Lemmas n 6 and 7 into (20). We now consider two possible cases: k (w) = 0 and k (w) > 0. The case k (w) = 0 corresponds to functions with an expected value close to optimal: E [gw ] r, i.e. F (w) F (w ) + r. In this case (24) becomes: ^ E [gw ] E [gw ] + rD (25) We now turn to functions for which k (w) > 0, i.e. with expected values further away from optimal. In this case, the definition of k (w) ensures 4k(w)-1 r < E [gw ] and substituting this into (24) we ^ have E [gw ] - E [gw ] 4 E [gw ] rD. Rearranging terms yields: r E [gw ] 1^ E [gw ] 1-4D / r 1 1-4D / r (26) 1), we always (27) Combining the two cases (25) and (26) (and requiring r (4D)2 so that have: ^ + 1 E [gw ] 1-4D/r E [gw ] + rD 1 Setting r = (1 + a )2 (4D)2 yields the bound in Theorem 1. 5 Comparison with Previous "Fast Rate" Guarantees Rates faster than 1/ n for estimation have been previously explored under various conditions, where strong convexity has played a significant role. Lee et al [8] showed faster rates for squared loss, exploiting the strong convexity of this loss function, but only under finite pseudodimensionality assumption, which do not hold in SVM-like settings. Bousquet [9] provided similar guarantees when the spectrum of the kernel matrix (covariance of the data) is exponentially decay ing. Tsybakov [10] introduced a margin condition under which rates faster than 1/ n are shown possible. It is also possible to ensure rates of 1/n by relying on low noise conditions [9, 11], but here we make no such assumption. Most methods for deriving fast rates first bound the variance of the functions in the class by some monotone function of their expectations. Then, using methods as in Bartlett et al [5], one an c get bounds that have a localized complexity term and additional terms of order faster than 1/ n. However, it is important to note that the localized complexity term typically dominates the rate and still needs to be controlled. For example, Bartlett et al [12] show that strict convexity of the loss function implies a variance bound, and provide a general result that can enable obtaining faster rates as long as the complexity term is low. For instance, for classes ith finite VC dimension V , the w resulting rate is n-(V +2)/(2V +2) , which indeed is better than 1/ n but is not quite 1/n. Thus we see that even for a strictly convex loss function, such as the squared loss, additional conditions are necessary in order to obtain "fast" rates. In this work we show that strong convexity not only implies a variance bound but in fact can be used to bound the localized complexity. An important distinction is that we require strong convexity of the function F (w) with respect to the norm w . This is rather different than requiring the loss function z (z, y ) be strongly convex on the reals. In particular, the loss of a linear predictor, w ( w, x , y ) can never be strongly convex in a multi-dimensional space, even if is strongly convex, since it is flat in directions orthogonal to x. As mentioned, f (w; x, y ) = ( w, x , y ) can never be strongly convex in a high-dimensional space. However, we actually only require the strong convexity of the expected loss F (w). If the loss function (z, y ) is -strongly convex in z , and the eigenvalues of the covariance of x are bounded away from zero, strong convexity of F (w) can be ensured. In particular, F (w) would be cstrongly-convex, where c is the minimal eigenvalue of the C OV[x]. This enables us to use Theorem 7 1 to obtain rates of 1/n on the expected loss itself. However, we cannot expect the eigenvalues to be bounded away from zero in very high dimensional spaces, limiting the applicability of the result of low-dimensional spaces were, as discussed above, other results also apply. An interesting observation about our proof technique is that the only concentration inequality we invoked was McDiarmid's Inequality (in [6, Theorem 5] to obtain (20)--a bound on the deviations in terms of the Rademacher complexity). This was possible because we could make a localization argument for the norm of the functions in our function class in terms of their expectation. 6 Summary We believe this is the first demonstration that, without any additional requirements, the SVM objective converges to its infinite data limit with a rate of O(1/n). This improves the previous results that considered the SVM objective only under special additional conditions. The results extends also to other regularized objectives. Although the quantity that is ultimately of interest to us is the expected loss, and not the regularized expected loss, it is still important to understand the statistical behavior of the regularized expected loss. This is the quantity that we actually optimize, track, and often provide bounds on (e.g. in approximate or stochastic optimization approaches). A better understanding of its behavior can allow us to both theoretically explore the behavior of regularized learning methods, to better understand empirical behavior observed in practice, and to appreciate guarantees of stochastic optimization approaches for such regularized objectives. As we saw in Section 3, deriving such fast rates is also essential for obtaining simple and general oracle inequalities, that also helps us guide our choice of regularization parameters. References [1] E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimization. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, 2006. [2] S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University, 2007. [3] T. Zhang. Covering number bounds of certain regularized linear function classes. J. Mach. Learn. Res., 2:527­550, 2002. [4] I. Steinwart, D. Hush, and C. Scovel. A new concentration result for regularized risk minimizers. Highdimensional Probability IV, in IMS Lecture Notes, 51:260­275, 2006. [5] P. L. Bartlett, O. Bousquet, and S. Mendelson. Localized rademacher complexities. In COLT '02: Proceedings of the 15th Annual Conference on Computational Learning Theory, pages 44­58, London, UK, 2002. Springer-Verlag. [6] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In O. Bousquet, ¨ U.v. Luxburg, and G. Ratsch, editors, Advanced Lectures in Machine Learning, pages 169­207. Springer, 2004. [7] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In NIPS, 2008. [8] W. S. Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in learning with squared loss. In Computational Learing Theory, pages 140­146, 1996. [9] O. Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. PhD thesis, Ecole Polytechnique, 2002. [10] A. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 32:135­166, 2004. [11] I. Steinwart and C. Scovel. Fast rates for support vector machines using gaussian kernels. ANNALS OF STATISTICS, 35:575, 2007. [12] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138­156, March 2006. 8