On the Equivalence of Weak Learnability and Linear Separability: New Relaxations and Efficient Boosting Algorithms Shai Shalev-Shwartz Toyota Technological Institute, Chicago, USA SHAI@TTI-C.ORG Yoram Singer Google, Mountain View, USA SINGER@GOOGLE.COM Abstract Boosting algorithms build highly accurate prediction mechanisms from a collection of lowaccuracy predictors. To do so, they employ the notion of weak-learnability. The starting point of this paper is a proof which shows that weak learnability is equivalent to linear separability with 1 margin. While this equivalence is a direct consequence of von Neumann's minimax theorem, we derive the equivalence directly using Fenchel duality. We then use our derivation to describe a family of relaxations to the weak-learnability assumption that readily translates to a family of relaxations of linear separability with margin. This alternative perspective sheds new light on known soft-margin boosting algorithms and also enables us to derive several new relaxations of the notion of linear separability. Last, we describe and analyze an efficient boosting framework that can be used for minimizing the loss functions derived from our family of relaxations. In particular, we obtain efficient boosting algorithms for maximizing hard and soft versions of the 1 margin. in fact finds a linear separator with a large margin. However, AdaBoost does not converge to the max margin solution [RW05, RSD07]. Interestingly, the equivalence between weak learnability and linear separability is not only qualitative but also quantitative: weak learnability with edge is equivalent to linear separability with an 1 margin of . We give a precise statement and a simple proof of the equivalence in Thm. 4. We note that the equivalence can be also derived from von Neumann's minimax theorem [vN28]. Nevertheless, our proof is instructive and serves as a building block for the derivation of our main results. Since the weak learnability assumption is equivalent to linear separability, it implies that the weak-learnability assumption is non-realistic due to its high sensitivity to even small amounts of label noise. For example, assume that we have a dataset that is perfectly separable with a large margin with the exception of two examples. These two examples share the same instance but attain opposite labels. Since such a dataset is non-separable, the weak learnability assumption fails to hold as well. To cope with this problem, we must somehow relax the weak learnability, which is equivalent to relaxing the linear separability assumption. In this paper we propose a family of relaxations of the linear separability assumption, which stems from the equivalence of weaklearnability and linear-separability. The guiding tool is to first define a natural family of relaxations of the weak learnability assumption, and then analyze its implication on the separability assumption. In addition to our analysis and relaxations outline above, we also propose and analyze an algorithmic framework for boosting that efficiently solve the problems derived from our family of relaxations. The algorithm finds an accurate solution after performing at most O(log(m)/2 ) iterations, where m is the number of training examples. The number of iterations upper bounds the number of different weakhypotheses constituting the solution. Therefore, we cast a natural trade-off between the desired accuracy level, , of the (possibly relaxed) margin attained by the weight vector learned by the boosting algorithm, and the sparseness of the resulting predictor. In particular, we obtain new algorithms for maximizing the hard and soft 1 margin. We also provide an O(m log(m)) procedure for entropic projections onto balls. Combined with this procedure, the total complexity of each iteration of our algorithm for minimizing the soft 1 margin is almost the same as the complexity of each iteration 1 Introduction Boosting is a popular and successful method for building highly accurate predictors from a set of low-accuracy base predictors. For an overview see for example [FS99, Sch03, MR03]. The first boosting algorithm was used for showing the equivalence between weak learnability and strong learnability [Sch90]. Weak learnability means that for any distribution over a set of examples there exists a single feature, also referred to as weak hypothesis, that performs slightly better than random guessing. Schapire [Sch90] was the first to show that if the weak learnability assumption holds then it is possible to construct a highly accurate classifier, to the point that it perfectly classifies all the examples in the training set. This highly accurate classifier is obtained by taking the sign of a weighted combination of weak hypotheses. Put another way, [Sch90] showed that if the weak learnability assumption holds then the set of examples is linearly separable. Studying the generalization properties of the AdaBoost algorithm, Schapire et al. [SFBL97] showed that AdaBoost of AdaBoost, assuming that the complexity of each activation of the weak learning algorithm requires (m) time. Related Work As mentioned above, the equivalence between weak learnability and linear separability with 1 margin is a direct consequence of von Neumann's minimax theorem in game theory [vN28]. Freund and Schapire [FS96] were the first to use von Neumann's result to draw a connection between weak learnability and separability. They showed that if the weak learnability assumption holds then the data is linearly separable. The exact quantification of the weak learnability parameter and the 1 margin parameter was spelled out later in [RW05]. Schapire et al. [SFBL97] showed that the AdaBoost algorithm finds a large margin solution. However, as pointed out by [RW05, RSD07], AdaBoost does not converge to the max margin solution. Ratsch and Warmuth [RW05] suggested an algorithm called AdaBoost which converges to the maximal margin solution in O(log(m)/2 ) iterations. The family of algorithms we propose in this paper entertains the same convergence properties. Rudin et al. [RSD07] provided a more accurate analysis of the margin attained by AdaBoost and also presented algorithms for achieving the maxmargin solution. However, their algorithm may take O(1/3 ) iterations to find an accurate predictor. The above algorithms are effective when the data is linearly separable. Over the years, many boosting algorithms were suggested for non-separable datasets. We list here few examples. The LogLoss Boost algorithm [CSS02] tries to minimize the cumulative logistic loss, which is less sensitive to noise. MadaBoost [DW00] is another example of an algorithm that copes with non-separability. It does so by capping from the above the importance weights produced by the boosting algorithm. MadaBoost shares similarities with some of the relaxations presented in this paper. However, MadaBoost does not exploit the aforementioned equivalence and has a convergence rate that seems to be inferior to the rate obtained by the relaxations we consider in this paper. Another notable example for a boosting algorithm that works well in the non-separable case and is noise tolerant is the BrownBoost algorithm [Fre01]. BrownBoost uses the error-function (erf) as a margin-based loss function. The error-function reaches an asymptote when its input (margin in the context of BrownBoost) tends to -. It thus constitutes a robust alternative to a convex loss function, including the LogLoss function. Since the error function is nonconvex, all the results presented in this paper are not applicable to BrownBoost. In the support vector machine literature, the common relaxation of the separability assumption is obtained by using the hinge-loss (see for example [CST00]). Warmuth, Glocer and Ratsch [WGR07] recently proposed the SoftBoost algorithm that directly minimizes the hingeloss function. The relaxation described in [WGR07] is a special case of the family of relaxations we present in this paper. The SoftBoost algorithm also builds on the idea of relaxing the weak learnability assumption by capping the maximal weight of a single example. A similar idea was also used by the SmoothBoost algorithm [Ser03]. Our presentation leads to an interesting perspective on this relaxation, showing that maximizing the margin while minimizing the hinge-loss is equivalent to maximizing the average margin of the k examples with the worst margin. This equivalence is also implied from the work presented in [WGR07]. More importantly, in this paper we present a much simple algorithm which does not employ a convex optimization procedure on each round of boosting. Our approach stands in contrast to the algorithm of [WGR07], which requires "totally corrective" updates (see also [WLR06]) and needs to solve a rather complex optimization problem on each iteration. The family of boosting algorithms we derive is reminiscent of the boosting algorithm proposed by Zhang [Zha03]. However, our analysis is different and allows us to: (i) provide an analytic solution for the step size; (ii) tackle complicated loss functions, including cases when the loss function does not take an explicit form. Our analysis stems from the primal-dual view of online convex programming [SSS06a, SSS07, SS07] and also borrows ideas from the analysis given in [SVL07]. The main difference between our analysis and that of [SVL07, Zha03] is that we do not impose any assumption on the second order derivatives of the objective function. Instead, we rely on a duality argument and require a strongly convex assumption on the Fenchel conjugate of the loss function. As we show, in many interesting cases, it is simple to verify that our assumption holds, while it is very complex to analyze the second order derivatives of the loss function in hand. Throughout this paper, we focus on the analysis of the empirical loss over the training set. There has been extensive work on obtaining generalization bounds for boosting algorithms and for margin-based hypotheses. We refer the reader for example to [SFBL97, MBB98, KPL01]. A complimentary question, left out of the scope of this paper, is whether the equivalence between weak learnability and linear separability with margin can be exploited for obtaining improved generalization bounds. 2 Notation and basic definitions Let (x1 , y1 ), . . . , (xm , ym ) be a sequence of m examples, where for all i, xi X and yi {+1, -1}. Let H be a set of base hypotheses, namely, each h H is a function from X into [+1, -1]. For simplicity, we assume that H is finite and thus H = {h1 , . . . , hn }. Let A be a matrix of size m × n over [+1, -1] where the (i, j ) entry of A is Ai,j = yi hj (xi ). We note that boosting algorithms solely use the matrix A and do not directly work with the set of examples. Therefore, throughout the rest of the paper we focus on the properties of the matrix A. We denote column vectors with bold face letters, e.g. d and w, and use the notation d , w for denoting their corresponding row vectors. The inner product between vectors is denoted by d, w = d w. We denote by A the transpose of the matrix A. The vector obtained by multiplying a matrix A with a vector d is designated as Ad and its ith element as (Ad)i . The set of non-negative real numbers is denoted as R+ and the set of integers {1, . . . , n} as [n]. The m dimensional probability simplex is denoted by Sm = {d Rm : d 1 = + 1}. We denote the m dimensional 1 ball of radius r by Bm (r) = {w Rm : w 1 r}. For the unit 1 ball, 1 we often omit r and use the shorthand Bm . Similarly, we 1 denote the m dimensional p ball by Bm (r) = {w Rm : p w p r} and again omit r whenever it is equals to 1. x, x x, v A Sm Bm ( ) p IC (d) [x]+ ·, · f, f ei [m ] Table 1: Summary of notations. column vector and its transpose inner product (= x v) matrix of size m × n m dimensional probability simplex p ball {w Rm : w p } indicator function (= 0 if d C and = else) vector whose ith element equals max{0, xi } norm and its dual norm function and its Fenchel conjugate all zeros vector except 1 in the ith position the set {1, . . . , m} Definition 1 (separability with 1 margin ) A matrix A is linearly separable with 1 margin if there exists w Bn 1 such that mini[m] (Aw)i , and is the largest scalar that satisfies the above inequality, namely, = max min (Aw)i . n w B1 i[m] Definition 2 ( -weak-learnability) A matrix A is -weaklearnable if for all d Sm there exists j [n] such that |(d A)j | , and is the largest scalar that satisfies the above. Namely, = min max |(d A)j | . m d S j [n] 3 Weak-learnability and linear-separability We next give a few basic definitions from convex analysis. A set S Rn is convex if for any two vectors d1 , d2 in S , all the line between d1 and d2 is also in S , that is, {d1 + (1 - )d2 : [0, 1]} S . A function f : S R is closed and convex if for any scalar r, the level set {d : f (d) r} is closed and convex. We allow functions to output + and denote by dom(f ) the set {d : f (d) < +}. The core of a set C Rn , denoted core(C ), is the set of all points in x C such that for all d Rn there exists > 0 for which for all [0, ] we have x + d C . The Fenchel conjugate of a function f : S R is defined as f ( ) = max d, - f (d) . d S In this section we establish the equivalence between weak learnability and linear separability with 1 margin. As mentioned before, this result can be derived from von Neumann's minimax theorem. The purpose of the proof below is to underscore the duality between weak learnability and separability, which becomes useful in the next sections. Theorem 4 A matrix A is -weak-learnable if and only if it is linearly separable with 1 margin of . Proof: We prove the theorem using Fenchel duality (Thm. 3). For convenience, we refer to the optimization problem on the right (left) hand side of Thm. 3 as the primal (dual) optimization problem. Let f be the indicator function of the m-dimensional simplex, i.e. f (d) = 0 if d Sm and otherwise f (d) = , and let g (w) = w . Then, the primal problem is P = min f (d) + g (d A) = min d A . m d d S (1) If f is closed and convex then f = f . Our derivation makes an extensive use of the following theorem. Theorem 3 (Fenchel Duality: Theorem 3.3.5 in [BL06]) Let f : Rm R {} and g : Rn : R {} be two closed and convex functions and let A be a matrix of dimension m × n. Then, max -f (-Aw) - g (w) min f (d) + g (d A) . w d The definition of -weak-learnability conveys that A is P weak-learnable. Next, we turn to the dual problem. The Fenchel conjugate of g is the indicator function of the set Bn 1 (see Sec. 2) and the Fenchel conjugate of f is f ( ) = max , d - f (d) = mam , d = max i . x m d R d S i[m] The above holds with equality if in addition we have d . 0 core om(g ) - A dom(f ) We denote an arbitrary norm by by · . That is, w d : d 1 Therefore, D = max -f (-A w) - g (w) = max min (Aw)i . n n w R wB1 i[m] · and its dual norm = max w, d . i Two dual norms that we extensively use are w 1 = |wi | and w = maxi |wi |. For a set C , we denote by IC (d) the indicator function of C , that is, IC (d) = 0 if d C and otherwise IC (d) = . The definition of w implies that the Fenchel conjugate of IC (d) where C = {d : d 1}, is the function · . To conclude this section, we would like to point the reader to Table 1 which summarizes our notations. Definition 1 implies that A is separable with 1 margin of D . To conclude our proof, it is left to show that P = D . First, we note that for w = 0 the value of D is zero, and thus D 0. Therefore, if P = 0 then 0 = P D 0 so in this case we clearly have P = D . Assume now that P = > 0. Based on Thm. 3 and the definition of the core operator, it suffices to show that for any vector v there exists > 0 such that for all [0, ] we have v {A d : d Sm }. This property holds true since for / any d Sm we have A d P while for sufficiently small we must have v < P for all [0, ]. 4 A family of relaxations In the previous section we showed that weak learnability is equivalent to separability. The separability assumption is problematic since even a perturbation of a singe example can break it. In this section we propose a family of relaxations of the separability assumption. The motivation for these relaxations stems from the equivalence between weak-learnability and separability. The main idea is to first define a natural family of relaxations of the weak learnability assumption, and then analyze the implication to the separability assumption. To simplify the presentation, we start with a particular relaxation that was studied in [Ser03, WLR06]. We then generalize the example and describe the full family of relaxations. 4.1 A first relaxation: capped probabilities and soft margin let si ( ) be the ith largest element of , that is, s1 ( ) s2 ( ) . . .. Then, the above argument yields f ( ) = k 1j sj ( ) . k =1 Combining the form of f with Thm. 3 we obtain that the dual problem of Eq. (2) is max n k -1 1j sm-j (Aw) . k =0 w B1 (3) Using the same technique as in the proof of Thm. 4 it is easy to verify that strong duality holds as well. We therefore obtain the following corollary. Corollary 5 Let A be a matrix and let k [m]. For a vector , let AvgMink ( ) be the average of the k smallest elements of . Let be as defined in Eq. (2). Then, w Bn 1 To motivate the first simple relaxation, consider a matrix A whose ith row equals to the negation of its j th row. That is, our training set contains an instance which appears twice, each time with a different label. Clearly, this training set is not separable even though the rest of the training set can be perfectly separable with a large margin. The equivalence between weak learnability and linear separability implies that A is also not weak learnable. To derive this property directly, 1 construct the distribution d with di = dj = 2 (and dr = 0 for r = i and r = j ) and note that d A = 0. In the above example, the weak learnability assumption fails because we place excessive weight on the problematic examples i, j . Indeed, it was observed that AdaBoost overweighs examples, which partially explains its poor performance on noisy data. To overcome this problem, it was suggested (see for instance [Ser03, WLR06]) to restrict the set of admissible distributions by capping the maximum importance weight of each example. That is, the weak learner should return a weak hypothesis only when its input distri1 bution satisfies d k , for a predefined integer k [m]. Plugging the above restriction on d into Definition 2 we obtain the following relaxed weak learnability value, = min dSm : d 1 k max AvgMink (Aw) = . max |(d A)j | . j [n] (2) Assume that a matrix A satisfies the above with > 0. The immediate question that surfaces is what is the implication on the separability properties of A? To answer this question, we need to refine the duality argument given in the proof of Thm. 4. 1 Let f (d) be the indicator function of Sm Bm ( k ) and let g (w) = w . The optimization problem given in Eq. (2) can be rewritten as mind f (d) + g (d A). To derive the dual optimization problem, we find the Fenchel conjugate of f , f ( ) = max dSm : d k Let us now discuss the role of the parameter k . First, if k = 1 then the function AvgMink reduces to the minimum over the vector provided as its argument, and therefore we revert back to the traditional definition of margin. When k = m, the only admissible distribution is the uniform distribution. In this case, it is easy to verify that the optimal weight vector associates wj = 1 with the feature that maximizes |(d A)j | (while d being the uniform distribution) and wj = 0 with the rest of the features. That is, the performance of the optimal strong hypothesis is equal to the performance of the best single weak hypothesis, and no boosting process takes place. The interesting regime is when k is proportional to m, for example k = 0.1m. In this case, if > 0, then we are guaranteed that 90% of the examples can be separated by margin of at least . It is also possible to set k based on knowledge of the number of noisy examples in the training set and the separability level of the rest of the examples. For example, assume that all but of the examples are separable with margin . Then, the worst objective value that w can attain is, k AvgMink (Aw) = - +(k - ) . Constraining the right hand side of this equality above to be at least and solving for k 2 yields that for k 2 ( + 1)/ at least m - k examples attain a margin value of at least /2. 4.2 A general relaxation scheme 1 d, . To maximize the inner product d, we should allocate the largest admissible weight to the largest element of , allocate the largest of the remaining weights to the second largest element of , and so on and so forth. For each i [m], The general relaxation scheme is obtained by replacing IC with a large family of functions. Before specifying the properties of allowed functions, let us first define the following generalized notion of weak learnability. We now generalize the above relaxation and present our general relaxation scheme. To do so, we first rewrite Eq. (2) as follows. Denote C = Bm (1/k ) and recall that IC (d) is the indicator function of the set C . We can now rewrite Eq. (2) as m . ax |(d A)j | + IC (d) (4) = min m d S j [n] Definition 6 ((, f )-weak-learnability) Let f be an arbitrary function. A matrix A is (, f )-weak-learnable if m . ax |(d A)j | + f (d) = min m d S j [n] Example 2 Let f (d) = d where · is an arbitrary norm and is a scalar. Then, f (w) is the indicator function of the ball of radius with respect to the dual norm {w : w }. The condition given in the theorem clearly holds here as well and we obtain the dual problem w B n , R 1 Intuitively, we can think on as the minimum of the maximal edge plus a regularization term f (d). In the case of capped importance weights, the regularization function is a barrier function that does not penalize distributions inside Bm (1/k ) and places an infinite penalty for the rest of the distributions. The following theorem shows how the fact that a matrix A is (, f )-weak-learnable affects its separability properties. To remind the reader, we denote by ei the vector whose ith element is 1 and the rest of its elements are zero. The notation [x]+ represents the vector whose ith element is max{0, xi }. Theorem 7 Let f be a convex function, be a scalar, and A be a (, f )-weak-learnable matrix. If the following assumptions hold, (i) mind f (d) = 0, (ii) 0 core(dom(f )), (iii) Rm , i [m], [0, 1], the Fenchel conjugate of f satisfies f ( ) f ( - i ei ) then, w B n , R 1 max s.t. [ - A w]+ . That is, we are now maximizing the margin subject to a constraint on the vector of hinge-losses. We now turn to proving Thm. 7. First, we need the following lemma which characterizes the Fenchel conjugate of f + ISm . Lemma 8 Assume that f satisfies the conditions given in ~ Thm. 7 and denote f (d) = f (d) + ISm (d). Then, ~ f ( ) = - max ( - f ([ + ]+ )) . R ~ Proof: We first rewrite f as ~ f ( ) D max -f (d) - (ISm (d) - , d ) d m =- in f (d) + (ISm (d) - , d ) d = max - f ([ - A w]+ ) = . enote g (d) = ISm (d) - , d . It is easy to verify that g (x) = maxi (i + xi ). Next, note that 0 core(dom(f )) by assumption and that dom(g ) = Sm . Therefore, strong duality holds and we can use Thm. 3 which yields, ~ -f ( ) = = max (-f (x) - g (-x)) x - max f (x) - max(i - xi ) x i The proof of the theorem is again based on the Fenchel duality theorem. The vector [ - A w]+ appearing in the dual problem is the vector of hinge-losses. Before diving into the details of the proof, let us give two concrete family of functions that satisfy the requirement given in the theorem. Example 1 Let f be the indicator function of a ball of radius , {d : d }, where · is an arbitrary norm and is a scalar such that the intersection of this ball with the simplex is non-empty. Then, f (w) = w and the condition given in the theorem clearly holds. In this case, we obtain that = min d A . mn x a - [-A w]+ m w B 1 , R dS : d Let C = {x : i, xi i + }. We show in the sequel that for any , the vector [ + ]+ is a minimizer of f (x) over x C . Combining this with the above expression for ~ -f ( ) we get that ~ -f ( ) = max ( - f ([ + ]+ )) , . In particular, if · is the norm we obtain again the example of capped sample weights. Since the 1-norm and -norm are dual we get that in the dual problem we are maximizing the margin parameter while minimizing the cumulative hinge-loss. Combining this fact with Corollary 5 we get that AvgMink (Aw) = max R as required. Therefore, it is left to show that the vector [ + ]+ is indeed a minimizer of f (x) over C . Clearly, [ + ]+ C . In addition, for any x C we can make a sequence of modifications to x until x = [ + ]+ as follows. Take some element i. If xi > [i + ]+ then based on assumption (iii) of Thm. 7 we know that x xi - [i + ]+ i - f f (x) . xi e xi If xi < [i + ]+ we must have that [i + ]+ = 0 since we assume that x C and thus xi i + . Thus, xi < 0 but now using assumption (iii) of Thm. 7 again we obtain that f (x - xi ei ) f (x). Repeating this for every i [m] makes x equals to [ + ]+ while the value of f (x) is nonincreasing along this process. We therefore conclude that [ + ]+ is a minimizer of f (x) over x C and our proof is concluded. - 1 k im [ - (Aw)i ]+ =1 . The right hand side of the above is usually called the "softmargin". The above equality tells us that the soft margin is equivalent to the average margin of the k worst examples (see also [WLR06, SSWB98]). the dual of the problem given in Definition 6 is the same maximization problem as stated in the theorem. To conclude the proof it is left to show that strong duality also holds here. First, using the assumption mind f (d) = 0 we get that f (0) = 0. By setting w = 0 and = 0 we get that the dual problem is bounded below by zero. Thus, if = 0 then strong duality holds. If > 0 then we can use the fact that ~ dom(f ) dom(f ) and therefore the same arguments as in the end of the proof of Thm. 4 holds here as well. Based on the above lemma the proof of Thm. 7 is easily derived. Proof:[of Thm. 7] The proof uses once more the Fenchel ~ duality theorem. Define the function f (d) = f (d) + ISm (d). Therefore, Thm. 3 tells us that the dual ~ of the problem mind f (.d) + d A is the problem - ~ (-Aw) Using Lemma 8 we obtain that f maxwBn 1 I N P U T: matrix A [+1, -1]m,n Relaxation function f Desired accuracy m D E FI N E : h(d) = i=1 di log(di ) + log(m) f (d) = Fenchel conjugate of f 2 log(m) I N I T I A L I Z E : w1 = 0, = F O R t = 1, 2, . . . , T j A wt , d + (f (d) + h(d)) dt = argmin dSm 5 Boosting algorithms arg maxj |(d A)j | t (w.l.o.g. assume sign(d A)jt = 1) t 0 1 w d A ( e j t -w t ) t t = max , min , A(ejt -wt ) 2 t t+1 = (1 - t )wt + t ejt In this section we derive a boosting algorithm for solving the max-relaxed-margin problem described in the previous section, namely, n w B1 O U T P U T: wT +1 Figure 1: A Boosting Algorithm for maximizing the relaxed margin given in Eq. (5). Using the property ( h) ( ) = h ( / ) we obtain that ti . dt,i e That is, the log of the probability assigned to the ith example is negatively proportional to the margin of the example according to the current weight vector wt . Therefore, the algorithm allocates larger importance weights to the erroneous examples, in a similar fashion to the weighting scheme of examples of many other boosting algorithms, such as AdaBoost. Next, we perform a step analogous to calling a weaklearner by finding a single column of A with the best edge. We would like to note that it is possible to extend the algorithm so that the weak learner may find a column whose edge is only approximately optimal. For simplicity we confine the description to weak learners that return the column with the largest edge. Finally, we set wt+1 to be the convex combination of wt and the new hypothesis. The coefficient of the convex combination, denoted t , is calculated analytically based on our analysis. Note that the update form guarantees that wt 1 1 for all t. The sole modification of the algorithm when running with other relaxation functions is concerned with the definition of dt . In Sec. 5.2 we further elaborate on how to solve the optimization problem which appears in the definition of dt . We provide a few general tools and also present an efficient procedure for the case where f is the indicator function of Bm ( ). The following theorem provides analysis of the rate of convergence of the algorithm. max max ( - f ([ - A w]+ )) . R (5) The function f should satisfy the conditions stated in Thm. 7. In particular, if f (x) = x 1 we obtain the soft margin problem , im (6) [ - (A w)i ]+ max max - n w B1 R =1 - 1 (Aw ) while if f (x) = maxi xi then we obtain the non-relaxed max margin problem wBn i[m] 1 max min (A w)i . where h(d) is the relative entropy function. Since we are now dealing with the case f (d) = ISm , we can use Lemma 18 in the appendix and get that dt is the gradient of the Fenchel conjugate of the function h(d). In the appendix we list several Fenchel conjugate pairs. In particular, the Fenchel conjugate of the relative entropy is the soft-max function 1m . i h ( ) = log m ei =1 The boosting algorithm for solving Eq. (5) is described in Fig. 1. To simplify the presentation, let us first describe the algorithm for the non-relaxed max-margin problem, that is, f (x) = maxi xi . As we have shown in the proof of Thm. 4, the corresponding Fenchel conjugate f (d) is the indicator function of Sm . The algorithm initializes the weight vector to be the zero vector, w1 = 0. On round t, we define a distribution over the examples = - A wt , d - (f (d) + h(d)) dt = argmax dSm A , wt , d + (f (d) + h(d)) argmin dSm Theorem 9 Assume that the algorithm given in Fig. 1 is run for T = (log(m)/2 ) iterations. Then, the algorithm outputs an -accurate solution, max ( - f ([ - A wT +1 ]+ )) - , where is the optimal value of the solution as defined in Thm. 7. Before turning into the proof of Thm. 9 let us first discuss its implications. First we note that the number of iterations of the algorithm upper bounds the number of non-zero elements of the solution. Therefore, we have a trade-off between the desired accuracy level, , and the level of sparsity of the solution, wT +1 . The algorithm can be used for maximizing the hard margin using O(log(m)/2 ) iterations. In this case, the algorithm shares the simplicity of the popular AdaBoost approach. The rate of convergence we obtain matches the rate of the AdaBoost described by Ratsch and Warmuth [RW05] and is better than the rate obtained in Rudin et al. [RSD07]. We note also that if A is -separable and we set = /2 then we would find a solution with half the optimal margin in O(log(m)/ 2 ) iterations. AdaBoost seemingly attains an exponentially fast decay of the empirical error of 2 e- T . Thus, T should be at least 1/ 2 . Further careful examination also reveals a factor of log(m) in the convergence rate of AdaBoost. Therefore, our algorithm attains the same rate of convergence of AdaBoost while both algorithms obtain a margin which is half of the optimal margin. (See also the margin analysis of AdaBoost described in Rudin et al. [RSD07].) We can also use the algorithm for maximizing the soft margin given in Eq. (6). In Sec. 5.2 we show how to cal~ culate dt in O(m) time. Therefore, the complexity of the resulting algorithm is roughly the same as the complexity of AdaBoost. The bound on the number of iterations that we obtain matches the bound of the SoftBoost algorithm, recently proposed by Warmuth et al. [WLR06]. However, our algorithm is simpler to implement and the time complexity of each iteration of our algorithm is substantially lower than the one described in [WLR06]. 5.1 Proof of convergence rate we have f ( v + (1 - ) u) f (v) + (1 - ) f (u) - (1 - ) v - u 2 . 2 In the above definition, if = 0 we revert back to the standard definition of convexity. Strong convexity quantifies the difference between the value of the function at the convex combination and the convex combination of the values of the function. The relative entropy is 1-strongly convex with respect to the 1 norm over the probabilistic simplex (see Lemma 16 in [SS07]). Few important properties of strongly convex functions are summarized in Lemma 18 (in the appendix). We use these properties in our proofs below. Continuing with our motivating discussion, we view the algorithmic relaxation of AdaBoost as a replacement of the convex function ISm (d) by the strongly convex func~ tion h(d). More generally, recall the definition f (d) = f (d) + ISm (d) from Sec. 4 and that solving Eq. (5) is equiv~ alent to maximizing -f (-A w) over w Bn . As in the 1 ~ algorithmic relaxation of AdaBoost, we replace f (d) by the function ^ ~ f (d) = f (d) + h(d) , where (0, 1). Since for all d Sm we have 0 h(d) log(m), by setting = /(2 log(m)) we obtain that ^ ~ ^ d Sm , f (d) - /2 f (d) f (d) . Using Lemma 19 in the appendix we obtain that ^ ~ ^ , f ( ) f ( ) f ( ) + /2 . (7) To motivate our proof technique, let us focus first on the max-margin case without any relaxation. As we showed before, the AdaBoost algorithm approximates the max operai i 1 e ), also tor, maxi i , with a soft-max operator, log( m known as the exp-loss. We can think of this approximation as another form of relaxation of the max margin. To distinguish this type of relaxation from the family of relaxations described in the previous section, we refer to it as an "algorithmic" relaxation, since this relaxation is driven by algorithmic factors and not directly by the concept of relaxing the margin. The algorithmic relaxation of AdaBoost encapsulates the following relaxation of weak learnability: replace the indicator function of the simplex with the relative entropy function over the simplex, which we denote by h(d) (see also the definition in Fig. 1). The advantage of endowing the simplex with the relative entropy stems from the fact that the relative entropy is strongly convex with respect to the 1 norm, as we formally define now. Definition 10 A continuous function f is -strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u S and [0, 1] ^ The above implies that maximizing -f (-Aw) gives an /2 ~ accurate solution to the problem of maximizing -f (-Aw). This argument holds for the entire family of functions discussed in Sec. 4. An appealing property of strong convexity that we exploit is that by adding a convex function to a strongly convex function we retain at least the same strong ~ convexity level. Therefore, for all the functions f (d) dis^(d) retains the strongly cussed in Sec. 4 the corresponding f convex property of the relative entropy. The algorithm in Fig. 1 is designed for maximizing ^ -f (-Aw) over Bn . Based on the above discussion, this 1 maximization translates to an approximate maximization of ~ -f (-A w). Using again Thm. 3 we obtain that w Bn 1 ^ ^ max -f (-A w) min f (d) + d A . d Denote by D(w) and P (d) the dual and primal objective values of the above equation. We also denote by t the sub-optimality value attained at iteration t of the algorithm, namely, t = max D(w) - D(wt ) . n w B1 The following key lemma lower bounds the improvement of the algorithm in terms of its current sub-optimality. Lemma 11 Let t be the sub-optimality value of the algorithm in Fig. 1 at iteration t and assume that t 1. Then, t - t+1 2 /8. t Proof: Denote t = t - t+1 and based on the definition of t we clearly have that t = D(wt+1 ) - D(wt ). To simplify our notation, we use the shorthand j for jt and for t . Since wt+1 = (1 - )wt + ej we get that t = D(wt + (ej - wt )) - D(wt ) . Using the definition of D we further rewrite t as ^ ^ t = f (-Awt ) - f (-Awt - A (ej - wt )) . (8) ^ The key property that we use is that f is the Fenchel conjugate of a -strongly convex function over the simplex with respect to the 1 norm. Therefore, using Lemma 18 in the appendix, we know that for any 1 and 2 : 2 2 ^ ^ f ( 1 + 2 ) - f ( 1 ) , 2 + , 2 ^ where = arg maxd 1 , d - f (d). Combining this property with Eq. (8) and using the definition of dt we obtain that 2 A (ej - wt ) 2 . (9) t dt , A (ej - wt ) - 2 Using the assumption A [+1, -1]m×n , the fact that wt Bn , and the triangle inequality we get that 1 A (ej - wt ) 2 and thus t dt , A (ej - wt ) - 2 2 / . (10) Next, we show that dt , A (ej - wt ) = P (dt ) - D(wt ). To to so, we first use Lemma 17 to get that dt , -A wt = ^ ^ f (dt ) + f (-A wt ) and second we use the definition of j to get that dt , A ej = d A . Combining this with t Eq. (10) yields t (P (dt ) - D(wt )) - 2 2 / . (11) The weak duality property tells us that P (dt ) maxwBn D(w) and therefore t t - 2 2 / . Denote 1 = t /4 and note that [0, 1]. Had we set t = we could have obtained that t 2 /8 as required. Since we t set t to be the maximizer of the expression in Eq. (9) over [0, 1], we get an even larger value for t . This concludes our proof. Based on Lemma 11 the proof of Thm. 9 easily follows. Proof:[(of Thm. 9)] We first show that 1 1. To see this, we use the weak duality to get that 1 P (d1 ) - D(w1 ). Next, we recall that in the proof of Lemma 11 we have shown that for all t, P (dt ) - D(wt ) = dt , A(ejt - wt ) . Since w1 = 0 we get that 1 d1 , Aej1 = d A 1. 1 We can now apply Lemma 11 for t = 1 and get that 2 1 . By induction, we obtain that Lemma 11 holds for all t. Applying Lemma 20 (given in the appendix) we get that t (t8 1) . + Plugging the definition of = /(2 log(m)) into the 6 l (m upper bound on T +1 we get T +1 1(Tog2) ) . Therefore, if + T + 2 32 log(m)/2 we get that T +1 /2. Finally, Let ~ be the error of wT +1 on the original f then using Eq. (7) we obtain that T +1 + /2 = . 5.2 Efficient implementation for soft margin In this section we provide an efficient procedure for calculating the distribution dt as described in Fig. 1 when f (d) is the indicator function of {d : d }. As we showed above, this case corresponds to the maximization of the soft margin. We first present a lemma that provides us with an alternative method for finding d, which is based on Bregman divergences. The Bregman divergence with respect to a convex function h between two vectors d and d0 is defined as, Bh (d d0 ) = h(d) - h(d0 ) - h(d0 ), d - d0 . See [CZ97] for a rigorous definition of the Bregman divergence. Lemma 12 Let h : S R be a strongly convex and differentiable function, let f be a convex function, and denote ^ f = h + f . Let be a vector and denote d0 = h ( ), where h is the Fenchel conjugate of h. Then, ^ f ( ) = argmin (Bh (d d0 ) + f (d)) . d Proof: Since h is strongly convex and differentiable we have that h(d0 ) = . Therefore, ^ f ( ) = = = = ^ argmax d, - f (d) d argmin h(d) - d, + f (d) d argmin h(d) - d, h(d0 ) + f (d) d argmin Bh (d d0 ) + f (d) . d Applying the above lemma with f = IC for some convex set C we obtain the following corollary. Corollary 13 Assume that the conditions stated in Lemma 12 hold and that f (d) = IC (d) for some convex set C . Then, (h + f ) ( ) = argmin Bh (d h ( )) . d C We now get back to the problem of finding dt when f (d) is IC (d) for C = {d : d }. Based on Corollary 13 we can first define the distribution vector d0 such that d0,i 1 exp(- (Awt )i ) and then set dt = argmin dSm : d Bh (d d0 ) . (12) We are therefore left with the problem of solving the entropic projection problem given in Eq. (12). A similar problem was tackled by Herbster and Warmuth [HW01], who provided O(m log(m)) and O(m) algorithms for performing entropic projections. For completeness, in the rest of this section we outline the simpler O(m log(m)) algorithm. To do so, we first show that the entropic projection preserves the relative order of components of the projected vector. Lemma 14 Let dt be the solution of Eq. (12) and let i, j be two indices such that d0,i > d0,j . Then, dt,i dt,j . INPUT: A vector d0 Sm and a scalar (0, 1) Sort d0 in non-increasing order u m I N I T I A L I Z E : Z = r =1 u r F O R i = 1, ..., m 1 - (i - 1) = Z I F ui BREAK ENDIF Z Z - ui ENDFOR O U T P U T: dt s.t. dt,r = min{, d0,r } Figure 2: An O(m log(m)) Procedure for solving the Entropic Projection problem defined by Eq. (12). Proof: Assume that the claim of the proof is not true. Let i and j be two indices which violate the claim, therefore ~ dt,i < dt,j . We now construct a vector d which resides in Sm and whose components do not exceed . We set all the ~ components of dt , except for the ith and j th components, to be equal to the corresponding components of dt . Next, we ~ ~ ~ set dt,i = dt,j and dt,j = dt,i . Clearly, dt constitutes a feasible solution. Taking the difference between the Bregman divergence of the two vectors each to d0 we get, ~ Bh (dt d0 ) - Bh (dt d0 ) = (dj - di ) log(d0,i /d0,j ) > 0 , which contradicts the fact that dt is the vector attaining the smallest Bregman divergence to d0 . Without loss of generality, assume that d0 is sorted in a non-increasing order. Therefore, using Lemma 14 we know that dt has the form (, . . . , , dt,i , . . . , dt,j , 0, . . . , 0) where for each r {i, . . . , j } we have dt,r (0, ). Moreover, the following lemma provides us with a simple way to find all the rest of the elements of dt . Lemma 15 Assume that d0 is sorted in a non-increasing order and that dt = (, . . . , , dt,i , . . . , dt,j , 0, . . . , 0). Then, for all r {i, . . . , j } we have 1 - (i - 1) dt,r = d0,r where = . j r = i d 0,r which immediately yields that attains the value stated in the lemma. We are left with the problem of finding the indices i and j . The next lemma tells us that not a single element of the optimal vector attains a value of zero. Lemma 16 Assume that the vector d0 is provided in a nonincreasing order of elements and that all of its elements are positive. Then, the optimal solution of Eq. (12) is of the form, (, . . . , , dt,i , . . . , dt,m ) where dt,m > 0. Proof: Plugging the value of from the previous lemma into the objective function and performing simple algebraic manipulations we obtain the following objective value, Bh (dt d0 ) = i r-1 log( d0,r ) + (1 - (i - 1)) log() . =1 Therefore, the objective is monotonically increasing in . This in turn implies that we should set to be as small as possible in order to find the minimal Bregman divergence. Next, note that the value of as defined in Lemma 15 is decreasing as a function of j . The optimal solution is obtained for j = m. Finally, we are left with the task of finding the index i. Once it is found we readily obtain , which immediately translates into a closed form solution for dt . Lemma 14 in conjunction with a property presented in the sequel, implies that the first index for which dt , as defined by Lemma 15 with j = m, constitutes the optimal index for i. The pseudocode describing the resulting efficient procedure for solving the problem in Eq. (12) is given in Fig. 2. The algorithm starts by sorting the vector d0 . Then, it checks each possible index i of the sorted vector as the position to stop capping the weights. More formally, given an index i the algorithm checks whether dt can take the form (, . . . , , dt,i , . . . , dt,m ) where dt,i < . To check each index i the algorithm calculates as given by Lemma 15. The same lemma also implies that dt,i = d0,i . Thus, if the assumption on the index i is correct, the following inequality must hold, > dt,i = d0,i . In case the index i under examination indeed satisfies the inequality the algorithm breaks out of the loop. Therefore, the algorithm outputs the feasible solution with the smallest number of weights at the bound . It thus remains to verify that the feasible solution with the smallest number of capped weights is indeed optimal. This property follows from a fairly straightforward yet tedious lemma which generalizes Lemma 3 from [SSS06b] and is thus omitted. Note also that the time complexity of the resulting algorithm is O(m log(m))) which renders it applicable to boosting-based applications with large datasets. Proof: Let v denotes the gradient of Bh (d d0 ) with respect to d at dt , namely, vi = log(dt,i ) + 1 - log(d0,i ) . Let I = {i, . . . , j }. Note that for the elements in I the optimization problem has a single linear equality constraint and the solution is in the interior of the set (0, )|I | . Therefore, using Corollary 2.1.3 in [BL06] we obtain that there exists a constant such that for all i I , vi = - 1 or equivalently i I , dt,i = dt,0 e -1 -1 6 Discussion . Let us denote = e . Using this form in the equation i dt,i = 1 we get that, 1= rm dt,r = (i - 1) + =1 rj d 0,r , =i The starting point of this paper was an alternative view of the equivalence of weak-learnability and linear-separability. This view lead us to derive new relaxations of the notion of margin, which are useful in the noisy non-separable case. In turn, the new relaxations of the margin motivated us to derive new boosting algorithms which maintain distributions over the examples that are restricted to a subset of the simplex. There are a few future direction research we plan to pursue. First, we would like to further explore additional constraints of the distribution dt , such as adding 2 constraints. We also would like to replace the relative entropy penalty for the distribution dt with binary entropies of each of the components of dt with respect to the two dimensional vector ( 1 , 1 ). The 22 result is a boosting-based apparatus for the log-loss. Last, we would like to explore alternative formalisms for the primal problem that also modify the definition of the function g (d) = d A , which may lead to a regularization term of the vector w rather than the domain constraint we currently have. +1 The term (t(t+)1)-1 is smaller than 1 and thus t+1 2 which concludes our proof. where we used the fact that the function x - rx2 is monotonically increasing in [0, 1/(2r)] along with the inductive assumption. We can rewrite the right-hand side of Eq. (13) as . ( = ( t+1)+1 (t+1)-1 t+1)2 -1 1 1 · t+1 2 r (t+2) t+1 r (t+2) (t+1) 2 1 r (t+2) , B Fenchel conjugate pairs We now list a few useful Fenchel-conjugate pairs. Proofs can be found in ([BV04] Section 3.3, [BL06] Section 3.3., [SS07] Section A.3). f (d) IC (d) for C = {d : d } ISm (d) Pm ISm (d) + i=1 di log( 1di ) /m 1 2 A Technical lemmas The first lemma states a sufficient condition under which the Fenchel-Young inequality holds with equality. Its proof can be found in ([BL06], Proposition 3.3.4). Lemma 17 Let f be a closed and convex function and let f (w) be its differential set at w. Then, for all f (w) we have, f (w) + f ( ) = , w . The next lemma underscores the importance of strongly convex functions. The proof of this lemma follows from Lemma 18 in [SS07]. Lemma 18 Let f be a closed and -strongly convex function over S with respect to a norm · . Let f be the Fenchel conjugate of f . Then, f is differentiable and its gradient satisfies f ( ) = arg maxwS w, - f (w). Furthermore, for all 1 , 2 Rn , we have f ( 1 + 2 ) - f ( 1 ) f ( 1 ), 2 1 2 2 + 2 f ( ) log maxi i ` 1 Pm m 1 2 i=1 2 ei ´ d 2 c f (d) for c > 0 f f (d + d0 ) (c d) for c = 0 c f ( /c) f ( ) - , d0 f ( /c) References [BL06] [BV04] [CSS02] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 47(2/3):253­285, 2002. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. Y. Censor and S.A. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, New York, NY, USA, 1997. C. Domingo and O. Watanabe. Madaboost: A modification of adaboost. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, 2000. Y. Freund. An adaptive version of the boost by majority algorithm. Machine Learning, 43(3):293­318, 2001. Lemma 19 Let f , g be two functions and assume that for all w S we have g (w) f (w) g (w) - c for some constant c. Then, g ( ) f ( ) g ( ) + c. Proof: There exists some w s.t. g ( ) = w , - g (w ) w , - f (w ) max w, - f (w) = f ( ) . w [CST00] This proves the first inequality. The second inequality follows from the fact that the conjugate of g (w) - c is g ( ) + c. [CZ97] Lemma 20 Let 1 1 2 ... be a sequence such that for all t 1 we have t - t+1 r 2 for some constant t r (0, 1/2). Then, for all t we have t r(t1 1) . + Proof: We prove the lemma by induction. First, for t = 1 we 1 have r(t1 1) = 2r 1 and the claim clearly holds. Assume + that the claim holds for some t. Then, t+1 t - r2 t 1 r (t+1) [DW00] [Fre01] - 1 r (t+1)2 , (13) [FS96] Y. Freund and R.E. Schapire. Game theory, online prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 325­332, 1996. Y. Freund and R. E. Schapire. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5):771­780, 1999. M. Herbster and M. Warmuth. Tracking the best linear predictor. Journal of Machine Learning Research, 1:281­309, 2001. V. Koltchinskii, D. Panchenko, and F. Lozano. Some new bounds on the generalization error of combined classifiers. In Advances in Neural Information Processing Systems 14, 2001. Llew Mason, Peter Bartlett, and Jonathan Baxter. Direct optimization of margins improves generalization in combined classifiers. Technical report, Deparment of Systems Engineering, Australian National University, 1998. ¨ R. Meir and G. Ratsch. An introduction to boosting and leveraging. In S. Mendelson and A. Smola, editors, Advanced Lectures on Machine Learning, pages 119­184. Springer, 2003. C. Rudin, R.E. Schapire, and I. Daubechies. Analysis of boosting algorithms using the smooth margin function. Annals of Statistics,, 2007. G. Ratsch and M. Warmuth. Efficient margin maximizing with boosting. Journal of Machine Learning Research, pages 2153­2175, 2005. R.E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197­227, 1990. R.E. Schapire. The boosting approach to machine learning: An overview. In D.D. Denison, M.H. Hansen, C. Holmes, B. Mallick, and B. Yu, editors, Nonlinear Estimation and Classification. Springer, 2003. R.A. Servedio. Smooth boosting and learning with malicious noise. Journal of Machine Learning Research, 4:633­648, 2003. [SSS06a] S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. In Advances in Neural Information Processing Systems 20, 2006. S. Shalev-Shwartz and Y. Singer. Efficient learning of label ranking by soft projections onto polyhedra. Journal of Machine Learning Research, 7 (July):1567­1599, 2006. S. Shalev-Shwartz and Y. Singer. A primal-dual perspective of online learning algorithms. Machine Learning Journal, 2007. [FS99] [SSS06b] [HW01] [SSS07] [KPL01] ¨ [SSWB98] B. Scholkopf, A. Smola, R. Williamson, and P. Bartlett. New support vector algorithms. Technical Report NC2-TR-1998-053, NeuroColt2, 1998. [SVL07] A. Smola, S.V.N. Vishwanathan, and Q. Le. Bundle methods for machine learning. In Advances in Neural Information Processing Systems 21, 2007. J. von Neumann. Zur theorie der gesellschaftsspiele (on the theory of parlor games). Math. Ann., 100:295­320, 1928. [MBB98] [MR03] [vN28] [RSD07] [WGR07] M. Warmuth, K. Glocer, and G. Ratsch. Boosting algorithms for maximizing the soft margin. In Advances in Neural Information Processing Systems 21, 2007. [WLR06] M. Warmuth, J. Liao, and G. Ratsch. Totally corrective boosting algorithms that maximize the margin. In Proceedings of the 23rd international conference on Machine learning, 2006. T. Zhang. Sequential greedy approximation for certain convex optimization problems. IEEE Transaction on Information Theory, 49:682­ 691, 2003. [RW05] [Zha03] [Sch90] [Sch03] [Ser03] [SFBL97] R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In Machine Learning: Proceedings of the Fourteenth International Conference, pages 322­330, 1997. To appear, The Annals of Statistics. [SS07] S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University, 2007.