On the Margin Explanation of Bo osting Algorithms

1

Liwei Wang1 , Masashi Sugiyama2 , Cheng Yang1 , Zhi-Hua Zhou3 , and Jufu Feng1 Key Laboratory of Machine Perception, MOE, School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, P.R.China. {wanglw,yangch,fjf}@cis.pku.edu.cn 2 Department of Computer Science, Tokyo Institute of Technology, 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552, Japan. sugi@cs.titech.ac.jp 3 National Key Laboratory for Novel Software Technology, Nanjing University Nanjing 210093, P.R. China. zhouzh@nju.edu.cn

Abstract
Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, Breiman raised important questions about the margin explanation by developing a boosting algorithm arc-gv that provably generates a larger minimum margin than AdaBoost. He also gave a sharper bound in terms of the minimum margin, and argued that the minimum margin governs the generalization. In experiments however, arc-gv usually performs worse than AdaBoost, putting the margin explanation into serious doubts. In this paper, we try to give a complete answer to Breiman's critique by proving a bound in terms of a new margin measure called Equilibrium margin (Emargin). The Emargin bound is uniformly sharper than Breiman's minimum margin bound. This result suggests that the minimum margin is not crucial for the generalization error. We also show that a large Emargin implies good generalization. Experimental results on benchmark datasets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than arc-gv, which agrees well with our theory.

1

Intro duction

The AdaBoost algorithm [FS96, FS97] has achieved great success in the past ten years. It has demonstrated excellent experimental performance both on benchmark datasets and real applications [BK99, Die00, VJ01]. It is observed in experiments that the test error of a combined voting classifier usually keeps decreasing as its size becomes very large and even after the training error is zero [Bre98, Qui96]. This fact, on the first sight, obviously violates Occam's razor.
 This work was supported by NSFC(60775005, 60635030, 60721002) and Global COE Program of Tokyo Institute of Technology.

Schapire et al. [SFBL98] tried to explain this phenomenon in terms of the margins of the training examples. Roughly speaking, the margin of an example with respect to a classifier is a measure of the confidence of the classification result. Schapire et al. [SFBL98] proved an upper bound for the generalization error of a voting classifier that does not depend on how many classifiers were combined, but only on the margin distribution over the training set, the number of the training examples and the size (the VC dimension for example) of the set of base classifiers. They also demonstrate that AdaBoost has the ability to produce a good margin distribution. This theory indicates that producing a good margin distribution is the key to the success of AdaBoost and explains well the surprising phenomenon observed in experiments. Soon after that however, Breiman [Bre99] cast serious doubts on this margin explanation. He developed a boosting-type algorithm called arc-gv, which provably generates a larger minimum margin than AdaBoost1 (Minimum margin is the smallest margin over all the training examples, see Section 2 for the formal definition). Then he gave an upper bound for the generalization error of a voting classifier in terms of the minimum margin, as well as the number of training examples and the size of the set of base classifiers. This bound is sharper than the bound based on the margin distribution given by Schapire et al. Breiman argued that if the bound of Schapire et al. implied that the margin distribution is the key to the generalization error, his bound implied more strongly that the minimum margin is the key to the generalization error, and the arc-gv algorithm would achieve the best performance among all boosting-type algorithms. In experiments, even though arc-gv always produces larger minimum margins than AdaBoost, its test error is consistently higher. Breiman also investigated the margin distributions generated by AdaBoost and arcgv, and found that arc-gv actually produced uniformly better margin distributions than AdaBoost. Thus he concluded that neither the minimum margin nor the margin distribution determined the generalization error and a new theoretical explanation is needed.
Actually, the minimum margin of arc-gv converges to the largest possible value among all voting classifiers.
1


Breiman's argument seems convincing and put the margin explanation into serious doubts. Recently however, Reyzin and Schapire [RS06] gained important discovery after a careful study on Breiman's arc-gv algorithm. Note first that the bounds of both Breiman and Schapire et al. state that the generalization error also depends on the complexity of the set of base classifiers as well as the minimum margin or the margin distribution. To investigate how the margin affects the generalization error, one has to keep the complexity of the base classifiers fixed. In Breiman's experiments, he tried to control this by always using CART trees [BFOS84] of a fixed number of leaves as the base classifier. Reyzin and Schapire re-conducted Breiman's experiments and found that the trees produced by arc-gv were much deeper than those produced by AdaBoost. Since deeper trees are more complex even though the number of leaves is the same, arc-gv uses base classifiers of higher complexity than AdaBoost in Breiman's experiments. Thus it was not a fair comparison. In order to study the margin explanation in a fair manner, a more controlled setting is needed. Reyzin and Schapire then compared arc-gv and AdaBoost by using the decision stump, whose complexity is fixed, as the base classifier. Experiments showed that arc-gv produced larger minimum margins yet still a higher error rate. But this time, the margin distribution generated by arc-gv is not as "good" as that AdaBoost generated (see Fig.7 in [RS06]). So they argued that according to the Schapire et al. bound in terms of the margin distribution, the empirical observation, i.e., the inferior performance of arc-gv, could be explained. From a more critical point of view however, Breiman's doubt has not been fully answered by the above results. First of all, Breiman backed up his argument with a sharper bound in terms of the minimum margin. In Reyzin and Schapire's experiment with the decision stumps, arc-gv still produced larger minimum margin and had worse performance. Even though AdaBoost generates a "better" margin distribution than arc-gv, it would not disprove Breiman's critique unless we could show a bound in terms of the margin distribution and is uniformly sharper than Breiman's minimum margin bound. Another problem is how to measure the "goodness" of a margin distribution. The statement that AdaBoost generates "better" margin distributions than arc-gv is vague. Reyzin and Schapire used the average margin as a measure to compare margin distributions produced by AdaBoost and arc-gv. But the average margin does not explicitly appear in the bound of Schapire et al. Thus a larger average margin does not necessarily imply a smaller generalization error in theory. In this paper, we try to give a complete answer to Breiman's doubt by solving the two problems mentioned above. We first propose a novel upper bound for the generalization error of voting classifiers. This bound is uniformly sharper than Breiman's bound. The key factor in this bound is a new margin notion which we refer to as the Equilibrium margin (Emargin). The Emar-

gin can be viewed as a measure of how good a margin distribution is. In fact, the Emargin depends, in a complicated way, on the margin distribution, and has little relation to the minimum margin. Experimental results show that AdaBoost usually produces a larger Emargin than arc-gv when the complexity of the base classifier is well controlled. Our results thus explain the inferior performance of arc-gv and give Breiman's doubt a negative answer. The rest of this paper is organized as follows: In Section 2 we briefly describe the margin theory of Schapire et al. and Breiman's argument. Our main results are given in Section 3. We provide further explanation of the main bound in Section 4. All the proofs can be found in Section 5. We provide experimental justification in Section 6 and conclude in Section 7.

2

Background and Related Work

In this section we briefly review the existing margin bounds and the two boosting algorithms. Consider binary classification problems. Examples are drawn independently according to an underlying distribution D over X × {-1, +1}, where X is an instance space. Let H denote the space from which the base hypotheses are chosen. A base hypothesis h  H is a mapping from X to {-1, +1}. A voting classifier f (x) is of the form  f (x) = i hi (x), where 
i

= 1,

i  0.

An error occurs on an example (x, y ) if and only if y f (x)  0. We use PD (A(x, y )) to denote the probability of the event A when an example (x, y ) is chosen randomly according to the distribution D. Therefore, PD (y f (x)  0) is the generalization error which we want to bound. We also use PS (A(x, y )) to denote the probability with respect to choosing an example (x, y ) uniformly at random from the training set S . For an example (x, y ), the value of y f (x) reflects the confidence of the prediction. Since each base classifier outputs -1 or +1, one has i i y f (x) = i - i .
:y =hi (x) :y =hi (x)

Hence (y f (x) is the difference between the weights assigned to those base classifiers that correctly classify (x, y ) and the weights assigned to those that misclassify the example. y f (x) is called the margin for (x, y ) with respect to f . If we consider the margins over the whole set of training examples, we can regard PS (y f (x)  ) as a distribution over  (-1    1), since PS (y f (x)  ) is the fraction of training examples whose margin is at most . This distribution is referred to as the margin distribution. The minimum margin of f , which is the smallest margin over the training examples, then can


Input: S = (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) where xi  X , yi  {-1, 1}. Initialization: D1 (i) = 1/n. for t = 1 to T do 1. Train base learner using distribution Dt . 2. Get base classifier ht : X  {-1, 1}. 3. Choose t . 4. Update: Dt+1 (i) = Dt (i) exp(-t yi ht (xi )) , Zt

chosen. The theory contains two bounds: one applies to the case that the base classifier set H is finite, and the other applies to the general case that H has a finite VC dimension. Theorem 1 [SFBL98] For any  > 0, with probability at least 1 -  over the random choice of the training set S of n examples, every voting classifier f satisfies the fol lowing bounds: P +y  y PD f (x)  0 inf f (x)   S
 (0,1]

where Zt is a normalization factor chosen so that Dt+1 will be a distribution. end Output: The final Classifier T . t H (x) = sgn t ht (x)
=1

1 O  n

l

1 og n log |H | + log 2  

1/2

,

if |H | < . And PD +y f (x)  0 1 O  n  P
 (0,1]

inf

y
S

f ( x)   1 /2 ,

Algorithm 1: A unified description of AdaBoost and arc-gv.

d

1 log2 (n/d) + log 2  

be equivalently represented by the maximum value of  such that PS (y f (x)  ) = 0. A unified description of AdaBoost and arc-gv is shown in Algorithm 1. The only difference of the two algorithms is the choice of t . AdaBoost sets t as t = 1 + t 1 log , 2 1 - t

where d is the VC dimension of H . The theorem states that if the voting classifier generates a good margin distribution, that is, most training examples have large margins so that PS (y f (x)  ) is small for not too small , then the generalization error is also small. In [SFBL98] it has also been shown that for the AdaBoost algorithm, PS (y f (x)  ) decreases to zero exponentially fast with respect to the number of boosting iterations if  is not too large. These results imply that the excellent performance of AdaBoost is due to its good margin distribution. Breiman's doubts on the margin explanation came from the arc-gv algorithm. It can be shown that the minimum margin generated by arc-gv converges to the largest possible value among all voting classifiers. In practice, arc-gv has larger minimum margins than AdaBoost in most cases for a finite number of boosting iterations. Breiman also proved an upper bound for the generalization error of voting classifiers. This bound depends only on the minimum margin, not on the entire margin distribution. Theorem 2 [Bre99] Let 0 be the minimum margin defined as 0 = min {y f (x) : (x, y )  S } , (1)

where t is the edge of the base classifier ht , defined as: t = in
=1

Dt (i)yi ht (xi ).

The edge t is an affine transformation of the error rate of ht with respect to the distribution Dt . Arc-gv chooses t in a different way. It takes into consideration of the minimum margin of the composite classifier up to the current round. Denote by t the minimum margin of the voting classifier of round t - 1, that is, . y t-1 s hs (xi ) t = min i s=1t-1 i s=1 s Let 1 1 + t log 2 1 - t Arc-gv sets t as [Bre99]: 1 : t : t = 0: t = - 1 1 + t log . 2 1 - t

t > 1 , 0  t  1, t < 0 .

where S is the training set. If |H | < , 2 0 > 4 R= |H | ,

The first margin explanation of the AdaBoost algorithm [SFBL98] is to upper bound the generalization error of voting classifiers in terms of the margin distribution, the number of training examples and the complexity of the set from which the base classifiers are

32 log(2|H |)  2n, 2 n0


then for any  > 0, with probability at least 1 -  over the random choice of the training set S of n examples, every voting classifier f satisfies the fol lowing bounds:  y PD f (x)  0 + l 1 |H | 1 log( ). (2) R og(2n) + log + 1 R n  Breiman pointed out that his bound is sharper than the margin distribution bound of Schapire et al. If  in Theorem 1 is taken to be the minimum margin 0 , the bound in Theorem 2 is about the square of the bound in terms of the margin distribution, since the bound in Theoreml 2 is O(log n/n) and the bound in Theorme 1 is O( og n/n). Breiman then argued that compared to the margin distribution explanation, his bound implied more strongly that the minimum margin governs the generalization error. However, arc-gv performs almost consistently worse than AdaBoost in experiments2 . These empirical results contradict what the margin theory predicts and therefore put the margin explanation into serious doubts. A lot of efforts have been made on providing better explanation of the boosting algorithms in recent years [MBG02, KP02, KP05, AKLL02]. Koltchinskii and Panchanko [KP02, KP05] proved a number of bounds in terms of the margin distribution which are sharper than Theorem 1. However, it is difficult to compare the minimum margin bound to these bounds since they contain unspecified constants. Nevertheless, these results imply that the margin distribution might be more important than the minimum margin for the generalization error of voting classifiers.

The next theorem is our main result: the Emargin bound. Here we consider the case that the base classifier set H is finite. For the case that H is infinite but has a finite VC dimension, the bound is more complicated and will be given in Theorem 8. All the proofs can be found in Section 5. Theorem 3 If |H | < , then for any  > 0, with probability at least 1 -  over the random choice of the training set S of n examples, every voting classifier f satisfies the fol lowing bound: y PD f (x)  0 q ^ , log |H | (3) + inf D-1 , u (q ) 12 n q {0, n , n ,...,1} where ^ =1 u (q ) n 8 ^ 2 (q ) log n2 log |H | 2 , l og |H |

n + log |H | + log 

^ and (q ) is given by  . 8 : y  ^(q ) = sup   /|H |, 1 PS f (x)   q (4) Clearly the key factors in this bound are the optimal q ^ and the corresponding (q ). Definition 4 Let q  be the optimal q in Eq.(3), and denote ^  = (q  ). We cal l  the Equilibrium margin (Emargin). The name equilibrium is due to the following fact. Prop osition 5 q  is the empirical error at the Emargin  . y = q . PS f (x) <  (5) With Definition 4, the Emargin bound (3) can be simply written as y  log |H | q . PD f (x)  0 + D -1  , u (   ) (6) n Theorem 3 then states that the generalization error of a voting classifier depends on its Emargin and the empirical error at the Emargin. Our Emargin bound has a similar flavor to Theorem 1. Note that the Emargin depends, in a complicated way, on the whole margin distribution. Roughly, if most training examples have large margins, then  is large and q  is small. The minimum margin is only a special case of the Emargin. From Eq.(4) one can see ^ that (0) is the minimum margin. Hence the Emargin is

3

Main Results

In this section we propose upper bounds in terms of the Emargin. The bound is uniformly sharper than Breiman's minimum margin bound. First let us introduce some notions. Consider the Bernoulli relative entropy function D(q ||p) defined as q 1-q D(q ||p) = q log + (1 - q ) log , p 1-p 0  p, q  1.

For a fixed q , D(q ||p) is a monotone increasing function of p for q  p  1. It is easy to check that D(q ||p) = 0 when p = q , and D(q ||p)   as p  1. Thus one can define the inverse function of D(q ||p) for fixed q as D-1 (q , u), such that D(q ||D-1 (q , u)) = u for all u  0 and D-1 (q , u)  q . See also [Lan05].
2 Actually, the inferior performance has also been observed when using other voting classifiers that maximize the minimum margin (see also [GS98, RW02]).


equal to the minimum margin if and only if the optimal q  is zero. We next compare our Emargin bound to Breiman's minimum margin bound. We show that the Emargin bound is uniformly sharper than the minimum margin bound. Theorem 6 The bound given in Theorem 3 is uniformly sharper than the minimum margin bound in Theorem 2. That is q log | |H + D-1  , u ( ) n l + 1 1 |H | R og(2n) + log + 1 log , R n  where 32 log(2|H |) R=  2n. 2 n0 According to this theorem, the minimum margin is not crucial for the generalization error, i.e., a larger minimum margin does not necessarily imply a smaller test error. Thus arc-gv does not necessarily have better performance than AdaBoost. Our new bound implies that it is the Emargin  and the empirical error q  at  that govern the performance of the classifier. The following theorem describes how the Emargin  and the Emargin error q  affect the generalization ability. It states that a larger Emargin and a smaller Emargin error result in a lower generalization error. Theorem 7 Let f1 , f2 be two voting classifiers. Denote by 1 , 2 the Emargin and by q1 , q2 the empirical error at 1 , 2 of f1 , f2 respectively. That is y , qi = PS fi (x) < i i = 1, 2. Also denote by B1 , B2 the Emargin upper bound of the generalization error of f1 , f2 (i.e. the right-hand side of Eq.(3)). Then B1  B2 , if 1  2 and q1  q2 . Theorem 7 suggests that the Emargin and the Emargin error can be used as measures of the goodness of a margin distribution. A large Emargin and a small Emargin error indicate a good margin distribution. Experimental results in Section 6 show that AdaBoost usually has larger Emargins and smaller Emargin errors than arc-gv. The last theorem of this section is the Emargin bound for the case that the set of base classifiers has a finite VC dimension. Theorem 8 Suppose the set of base classifiers H has VC dimension d. Then for any  > 0, with probability at least 1 -  over the random choice of the training set

S of n examples, every voting classifier f satisfies the fol lowing bounds:  y PD f (x)  0 q ^ , n d2 + 1 + inf · D-1 , u (q ) 12 n q {0, n , n ,...,1} n - 1 where ^ =1 u (q ) n 1 6d n en2 log d d 1 6 n log + 1 ^2 (q ) d 

^ 2 (q )

log

+

+ 3 log ^ and (q ) is  ^ (q ) = sup  0 ,1

2n log 

,

:

PS

y

f ( x)  


. q (7)

4

Explanation of the Emargin Bound

In Theorem 3, we adopt the partial inverse of the relative entropy to upper bound the generalization error. The key term in the Emargin bound is ^ inf q D-1 (q , u[(q )]). To better understand the bound, we make use of three different upper bounds of inf q D-1 (q , u) to obtain simpler forms of the Emargin bound. We list in the following lemma the upper bounds ^ of inf q D-1 (q , u[(q )]). Lemma 9 The fol lowing bounds holds. 1. inf D-1
q

q

^ , u (q )


0 ^ D-1 , u (0) ^ . u (0)

2. inf D-1
q

q

^ , u (q )


^ 2 1/2  u  (q )   . inf q +    
q

3. inf D-1
q

q

^ , u (q )


^ q C u[ (q )] ^ q C u[ (q )]

inf

D -1

q

^ , u  (q )

inf

^ C u[(q )],
=

where C > 0 is any constant and C

max(2C, 8). ,

Note from Theorem 3 that 1l ^ = og n log |H | 1 + log u (q ) O ^ n   (q )2 and q = PS y ^ f (x)  (q ) .

Thus we can derive the following three bounds from the Emargin bound by using the three inequalities in Lemma 9 respectively.


Corollary 10 If |H | < , then for any  > 0, with probability at least 1 -  over the random choice of the training set S of n examples, every voting classifier f satisfies the fol lowing bounds: 1. 1 PD (y f (x)  0)  O n l 1 og n log |H | + log 2 0  ,

where


i

= 1, i  0,

where 0 is the minimum margin. 2. PD +y f ( x)  0 1 O 3. 1 PD (y f (x)  0)  O for al l  such that 1 PS (y f (x)  )  O n l 1 og n log |H | + log 2   . n l og n log |H | 1 + log 2   ,  n l  P
 (0,1]

inf

y
S

f (x)   1 /2 ,

og n log |H | 1 + log 2 

there can be associated with a distribution over H by the coefficients {i }. We denote this distribution as ~ Q(f ). By choosing N elements independently and ran~ domly from H according to Q(f ), we can generate a classifier g  CN (H ). The distribution of g is denoted by Q(f ). For any fixed  (0 <  < 1)  y PD f (x)  0 + y PD,gQ(f ) g (x)    y PD,gQ(f ) g (x) > , y f (x)  0 - . y + N 2 PD,gQ(f ) g (x)   exp (8) 2 We next bound the first term on the right-hand side of the inequality. For any fixed g  CN (H ), and for any positive number  and nonnegative integer k such that k  n, we consider the probability (over the random draw of n training examples) that the training error at margin  is less than k /n, while the true error of g at margin  is larger than . A compact representation of this probability is w + P Py k  Pr g (x)  ) > D (y g (x)  ) > I S S D n n here PrS Dn denotes the probability over n training samples chosen independently at random according to D, and I is the indicator function. Note that P > y Pr n g (x)   D
S D

The first bound in the Corollary has the same order of magnitude as the minimum margin bound. The second bound is the same as Theorem 1. So essentially, previous bounds can be derived from the Emargin bound. The third bound in the Corollary is new. og It states that the generalization error is O( log n l2 |H | ) n even in the non-zero error case, provided the margin error PS (y f (x)  ) is small enough.

5

Pro ofs


In this section, we give proofs of the theorems, lemmas and corollaries. 5.1 Pro of of Theorem 3 The proof uses the tool developed in [SFBL98]. The difference is that we do not bound the deviation of the generalization error from the empirical margin error directly, instead we consider the difference of the generalization error to a zero-one function of a certain empirical measure. This allows us to unify the zero-error and nonzero-error cases and it results in a sharper bound. For the sake of convenience, we follow the convention in [SFBL98]. Let C (H ) denote the convex hull of H . Also let CN (H ) denote the set of unweighted averages over N elements from the base classifier set H . Formally, . g N 1j CN (H ) = : g= hj , hj  H N =1 For any voting classifier f= 
i hi


S D n

P Pr
S

>k + I S g ( x)    n P y > y k g ( x)    g ( x)   D n y (1 - )n-r .

P

rk
=0

n r

r

Then applying the relative entropy Chernoff bound to the Bernoulli trials, we further have - k . rk n  r n-r (1 - )  exp nD r n =0 We thus obtain P
S D

Pr n 

>y

D

g (x)   P y
S

I - exp

g (x)   .

>k +  n (9)

 C (H ),

nD

k n


We only consider  at the values in the set . 1 2 , ,...,1 U= |H | |H | There are no more than |H |N elements in CN (H ). Using the union bound we get  > y g  CN (H ),   U, PD g (x)   Pr n
S D

>k + I S g ( x)    n . - k |H |(N +1) exp nD n Note that =y EgQ(f ) PD g (x)   y , PD,gQ(f ) g (x)   =P y >k EgQ(f ) I S g (x)   n Py >k . PgQ(f ) g (x)   S n We have  > y f  C (H ),   U, PD,gQ(f ) g (x)   Pr n y
S D


P

We next bound the first term in the right-hand side of Eq.(10). Using the same argument for deriving Eq.(8), we have for any  >  Py  >k PgQ(f ) g (x)   S n +P y >k I S f (x)   n P y >k , PgQ(f ) S g (x) >  n . y k (11) PS f ( x )   n Note that the last term in Eq.(11) can be further bounded by  PgQ(f ) (xi , yi )  S : yi g (xi )   and yi f (xi ) >  - n exp N ( - )2 2 . (12)

 PgQ(f ) |H | Let
(N +1)

P
S (y g (x)

 ) > .

k n

+ 

- exp nD

k n - nD

 = |H |(N +1) exp then k

k n

,

Combining (8), (10), (11) and (12), we have that with probability at least 1 -  over the draw of training examples, for all f  C (H ), all   U , all  > , and all k , but fixed N  y PD f (x)  0 + - + - N 2 N ( - )2 exp n exp 2 2 +Py >k I S f ( x)   n k ( 1 n. , D -1 N + 1) log |H | + log nn  Let   -  U, 2 |H | where 0   < 1. It is easy to check that the sum of the first two terms on the right-hand side of the above inequality can be bounded by 2 N e - . N 2 max n, exp xp 2|H | 8 = Let

( . 1 1 , N + 1) log |H | + log =D nn  We obtain that with probability at least 1 -  over the draw of the training samples, for all f  C (H ), all   U,  y PD,gQ(f ) g (x)   + Py >k PgQ(f ) g (x)   S n k ( . 1 1 D -1 , N + 1) log |H | + log nn  Using the union bound over k = 0, 1, . . . , n, then with probability at least 1 -  over the draw of the training samples, for all f  C (H ), all   U , and all k  y PD,gQ(f ) g (x)   + Py >k PgQ(f ) g ( x)   S n k 1( n. D-1 , N + 1) log |H | + log (10) nn 
-1

N =  · 2-N , we can get a union bound over all N . Put 22 , 8 n N = 2 log  log |H | note that if > then 2n > exp 8 |H | , .

N 2|H |


We obtain y  log |H | PD f (x)  0 I P yn >k + + inf f ( x)   D -1 S 0k<n n where 1 u= n 8 2 log n2 log |H | 2 l

k n ,u

,

n og |H | + log |H | + log 

.

012 . ^ , n , n , . . . , 1 Take q = 0, it is clear that (0) is the minimum margin. By Lemma 9, the Emargin bound can be relaxed to 8 22 l y 1 n PD f ( x )  0 log og |H | 2 n 0 log |H | . n (15) + 2 log |H | + log  We show that this relaxed bound is sharper than Theorem 2. For the minimum margin bound, we only consider the case that R  1, since otherwise the bound is larger than one. Simple calculations show that the right-hand side of (15) is smaller than the minimum margin bound. The theorem then follows. 5.4 Pro of of of Theorem 7

The theorem follows. 5.2 Pro of of Prop osition 5 Let M be the set defined as . q ^ ^ M= : (q ) = (q  ) =  Let q0 be the minimal q in M . We will show that q  = q0 , and PS y f (x) <  = q0 . (13) (14)

According to Proposition 5, we have that qi = PS (y fi (x) < i ) is also the optimal q  in the Emargin bound. Thus we only need to show q  q . D-1 1 , u(1 ) D-1 2 , u(2 ) Note that if 1  2 , then u(1 )  u(2 ). So q  q , D-1 2 , u(2 ) D-1 2 , u(1 ) since D-1 (q , u) is an increasing function of u for fixed q . Also D-1 (q , u) is an increasing function of q for fixed u, we have s q  q D-1 2 , u(1 ) D-1 1 , u(1 ) ince q1  q2 . This completes the proof. 5.5 Pro of of Theorem 8 The next lemma is a modified version of the uniform convergence result of [VC71, Vap98] and its refinement [Dev82]. It will be used for proving Theorem 8. Lemma 12 Let A be a class of subsets of a space Z . Let N A (z1 , z2 , . . . , zn ) be the number of different sets in { . A z1 , z2 , . . . , zn } : AA Define s(A, n) =
(z1 ,z2 ,...,zn )Z n

To show q  = q0 , note that D-1 (q , u) is an increasing function of q forqfixed u. Since q  is the optimal value ^ a such that D-1 , u (q ) chieves the minimum, one  must have q = q0 . To show y = PS f (x) <  q0 , first note that PS y f (x) <   M.

^ For every q  M , by the definition of (q ), one has y  PS f (x) <  q. This implies PS y f (x) <  = q0 .

This completes the proof. 5.3 Pro of of Theorem 6 The following lemma will be used to prove Theorem 6. Lemma 11 D-1 (0, p)  p for p  0. Pro of of Lemma 11. We only need to show D(0||p)  p, since D(q ||p) is a monotonic increasing function of p for p  q . By the Taylor expansion D(0||p) = - log(1 - p) = p + p2 p3 + + · · ·  p. 2 3

max

N A (z1 , z2 , . . . , zn ).

Then for any fixed integer k  P + k Pr A  A : PD (A) > I S (A) >  S D n n - k , 2 · s(A, n2 ) exp nD n where 
=

Pro of of Theorem 6. The right-hand side of the Emargin bound (3) is the minimum over all q 

n 1 - . n-1 n


Proof of Lemma 12. The proof is the standard argument. We first show that for any 0 <  < 1,  > 0, and P any integer n  P + k rn A  A : PD (A) > I S (A) >  S D n P 1 r A  A : PS 1 - e-2n 2 2 S Dn , S Dn  . P + k > I S (A) > (1 - ) n 

Then P> Pr
S D n , S
D n

P -I
S ( A)

AA

s

up

S

(A)

>

k n

= (1 - )
(A)

d P


Is up
AA

P>
S
(A)

P -I
S ( A)

>

k n

d (1 - ) V Is dP up
AA

P k n

P>
S
(A)

P -I
S ( A)

Or equivalently, s
S D n

>

d P up
AA D (A)

P -I
S (A)

Pr

>

k n

>  P
S
(A)


(1 - ) V dP IP
S
(A

P

1

P

r up s 1 - e-2n 2 2 S Dn , S Dn AA . P > k - I S ( A) > (1 - ) n


>P k ) - I S (A ) > n d (1 - ) P d

(16)


V dP 1

IP
S
2 2 (A


Let V denote the event P sup
AA D ( A)

)  P D (A ) -   dP


P

P -I

k S (A) > n

> . = × 1

- e-2n - e-2n
S D


V

2 2


s Let A be (one of ) the optimal A so that a PD (A) - I P k S (A) > n


P up
D (A) - I

P
S ( A) >

Pr n

AA

k n

> 

.

This completes the proof of (16). Take n n2 - n, 1 = , (n - 1)
=

chieves the maximum. Note that the following two events PS


(A

)  PD (A ) - 

we have   P + k Pr A  A : PD (A) > I S (A) >  S D n n 2 Pr
S D n , S
D n 

and


A  A : PS +

(A)

P

PD (A ) - I

k S (A ) > n


>  >I

P

k S ( A) > n

1 ( - ) n-1

.

imply that P )-I k n > (1 - ).

Proceeding as [Dev82] and using the relative entropy Hoeffding inequality, the theorem follows.
(A

PS


 S (A ) >

Pro of of Theorem 8. The proof is the same as Theorem 3 until we have Eq.(9). Let  =  , we need to 2


bound
S D n

and  > y  g  CN (H ),  > 0, PD g (x)  2 Py +.  >k I S g (x)   2 n  (x) = we have

Pr

q  q, x(1 - q x)

12 Note that we only need to consider  = 0, N , N , . . . , 1. Let ( ,  A(g ) = x, y )  X × {-1, 1} : y g (x)  2

and A = {A(g ) : g  CN (H )} . By Sauer's lemma [Sau72] it is easy to see that en N d , s(A, n)  d where d is the VC dimension of H . By Lemma 12, we have  y >  Pr n g  CN (H ),  > 0, PD g (x)  S D 2  Py +  >k I S g (x)   2 n - k e 2 Nd n , exp nD 2(N + 1) d n where n 1 - . n-1 n Using the argument as Theorem 3, the theorem follows. 
=

1 q q D( ||q ) = ( )  . 2 2 8 This completes the proof of Eq. (17). ^ Now if q  C u[(q )], recall that C = max(2C, 8), and note D-1 is increasing function on its first and second parameter respectively. We have C q ^  ^ ,^ 2 -1 -1 , u (q ) D D u  (q ) u  (q ) C  ^ ,C ^ 2 8 D -1 u  (q ) u (q ) ^ . C u (q ) The lemma then follows.

6

Exp eriments

Pro of of Lemma 9. The first inequality has already been proved in Lemma 11. For the second inequality, we only need to show u D-1 (q , u)  q + /2, or equivalently D(q , q + u /2)  u,

since D is an increasing function in the second parameter. But this is immediate by a well known result [Hoe63]: D(q , q +  )  2 2 . For the third inequality we first show that for all 0<q<1 qq D -1 ( , )  q , (17) 28 which is equivalent to q q D( ||q )  . 2 8 For fixed q , let (x) = D(q x||q ), 0 < x  1. Note that (1) =  (1) = 0,

In this section we provide experimental results to verify our theory. We compare AdaBoost and arc-gv in terms of their Emargin, Emargin error and the generalization error. Theorem 7 indicates that if a voting classifier f1 has a larger Emargin and a smaller Emargin error than another classifier f2 , then f1 would have better performance on the test data. The goal of the experiment is to see whether the empirical results agree with the theoretical prediction. The experiments are conducted on 10 benchmark datasets described in Table 1. Except the USPS which contains handwritten digits, all datasets are from the UCI repository [AN07]. If the data is multiclass, we group them into two classes, since we study the binary classification problem. For instance, the "letter" dataset has 26 classes, we use the first 13 as the positive and the others as the negative. In the preprocessing stage, each feature is normalized to [0, 1]. All datasets are used in a five-fold cross validation manner. For the USPS which originally has a training set and a test set, we merge them and regenerate the cross validation data. In all experiments, decision stumps are adopted as the base learner, so the complexity of the base classifiers is well controlled. We use a finite set of possible decision stumps. Specifically, for each feature we consider 100 thresholds uniformly distributed on [0, 1]. Therefore the size of the base classifier set is 2 × 100 × k , where k denotes the number of features. We run AdaBoost and arc-gv for 500 rounds, then calculate the Emargin, Emargin error, test error as well as the minimum margin of them respectively. The results are described in Table 2. AdaBoost has a larger or equal Emargin and a smaller Emargin error than arc-gv on all the datasets except German and Ionosphere. According to our theory, it predicts that AdaBoost would have a lower generalization error. The experiments show that among these eight datasets, AdaBoost outperforms arc-gv on six datasets, ties on one dataset, and loses


Table 1: Description of the datasets Dataset Breast Diabetes German Image Ionosphere # Examples 683 768 1000 2310 351 # Features 9 8 24 16 34 Dataset Letter Satimage USPS Vehicle Wdbc # Examples 20000 6435 9298 846 569 # Features 16 36 256 20 30

Table 2: Margin measures and performances of AdaBoost and arc-gv. For the datasets in bold-face, AdaBoost generates larger Emargins and smaller Emargin errors than arc-gv. AdaBoost outperforms arc-gv on all these datasets except the Image dataset. Breast Diab etes German Image Ionosphere Letter Satimage USPS Vehicle Wdb c AdaBoost arc-gv AdaBoost arc-gv AdaBoost arc-gv AdaBoost arc-gv AdaBoost arc-gv AdaBoost arc-gv AdaBoost arc-gv AdaBoost arc-gv AdaBoost arc-gv AdaBoost arc-gv Emargin 0.313 0.281 0.110 0.049 0.157 0.034 0.196 0.195 0.323 0.131 0.078 0.063 0.133 0.133 0.108 0.053 0.129 0.052 0.350 0.350 Emargin Error 0.803 0.909 0.748 0.759 0.824 0.780 0.610 0.705 0.800 0.577 0.645 0.958 0.521 0.956 0.972 0.990 0.737 0.794 0.581 0.710 Test Error 0.052 0.057 0.255 0.256 0.258 0.261 0.023 0.021 0.100 0.106 0.174 0.178 0.053 0.057 0.450 0.460 0.297 0.304 0.035 0.035 Minimum margin 0.005 0.008 -0.064 -0.017 -0.118 -0.026 -0.009 -0.003 0.084 0.061 -0.165 -0.034 -0.054 -0.019 -0.142 -0.024 -0.117 -0.033 -0.130 -0.100

only on one dataset. These results agree well with our theory. Note also that on all the datasets except Ionosphere, arc-gv has a larger minimum margin than AdaBoost, but arc-gv has a lower test error than AdaBoost only on one dataset. This verifies that the minimum margin is not crucial for the generalization error.

7

Conclusions

In this paper we tried to give a complete answer to Breiman's doubt on the margin explanation of the AdaBoost algorithm. We proposed a bound in terms of a new margin measure called the Emargin, which depends on the whole margin distribution. This bound is uniformly sharper than the minimum margin bound used by Breiman to back up his argument. According to our theory, arc-gv does not necessarily outperform AdaBoost even though it generates larger minimum margins. Our bounds also imply that the Emargin and the

Emargin error are the key to the generalization error of a voting classifier--a larger Emargin and a smaller Emargin error result in better generalization ability. Experiments on benchmark datasets agree well with our theory. A future work is to study why AdaBoost generates larger Emargins and smaller Emargin errors, i.e., better margin distributions, than arc-gv. Can we find a strategy that optimizes the margin distribution? If such an algorithm exists, it would be a good test of our theory to see whether it has better performance than AdaBoost as we predict.

References
[AKLL02] A. Antos, B. K´gl, T. Linder, and G. Lue gosi. Data-dependent margin-based generalization bounds for classification. Journal of Machine Learning Research, 3:73­98, 2002. [AN07] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007.


[BFOS84] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, 1984. [BK99] E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36:105­139, 1999. [Bre98] L. Breiman. Arcing classifiers. The Annals of Statistics, 26:801­849, 1998. [Bre99] L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11:1493­ 1517, 1999. [Dev82] L. Devroye. Bounds for the uniform deviation of empirical measures. Journal of Multivariate Analysis, 12:72­79, 1982. [Die00] T. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization. Machine Learning, 40:139­157, 2000. [FS96] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In International Conference on Machine Learning, 1996. [FS97] Y. Freund and R. E. Schapire. A decisiontheoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119­139, 1997. [GS98] A. J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In National Conference on Artificial Intel ligence, 1998. [Hoe63] W. Hoeffding. Probability inequalities for sum of bounded random variables. Journal of American Statistical Society, 58:13­ 30, 1963. [KP02] V. Koltchinskii and D. Panchanko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30:1­50, 2002. [KP05] V. Koltchinskii and D. Panchanko. Complexities of convex combinations and bounding the generalization error in classification. Annals of Statistics, 33:1455­1496, 2005. [Lan05] J. Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273­306, 2005. [MBG02] L. Mason, P. Bartlett, and M. Golea. Generalization error of combined classifiers. Journal of Computer and System Sciences, 65:415­438, 2002. [Qui96] J. R. Quinlan. Bagging, boosting, and c4.5. In 13th International Conference on Artificial Intel ligence, 1996. [RS06] L. Reyzin and R. E. Schapire. How boosting the margin can also boost classifier complexity. In International Conference on Machine Learning, 2006. [RW02] G. R¨tsch and M. Warmuth. Maximizing the a

margin with boosting. In 15th Annual Conference on Computational Learning Theory, 2002. [Sau72] N. Sauer. On the density of family of sets. Journal of Combinatorial Theory, Series A, 13:145­147, 1972. [SFBL98] R. Schapire, Y. Freund, P. Bartlett, and W. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26:1651­1686, 1998. [Vap98] V. Vapnik. Statistical Learning Theory. John Wiley and Sons Inc., 1998. [VC71] V. N. Vapnik and A. YA. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16:264­280, 1971. [VJ01] P. Viola and M. Jones. Rapid ob ject detection using a boosted cascade of simple features. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001.