On the Margin Explanation of Bo osting Algorithms 1 Liwei Wang1 , Masashi Sugiyama2 , Cheng Yang1 , Zhi-Hua Zhou3 , and Jufu Feng1 Key Laboratory of Machine Perception, MOE, School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, P.R.China. {wanglw,yangch,fjf}@cis.pku.edu.cn 2 Department of Computer Science, Tokyo Institute of Technology, 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552, Japan. sugi@cs.titech.ac.jp 3 National Key Laboratory for Novel Software Technology, Nanjing University Nanjing 210093, P.R. China. zhouzh@nju.edu.cn Abstract Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, Breiman raised important questions about the margin explanation by developing a boosting algorithm arc-gv that provably generates a larger minimum margin than AdaBoost. He also gave a sharper bound in terms of the minimum margin, and argued that the minimum margin governs the generalization. In experiments however, arc-gv usually performs worse than AdaBoost, putting the margin explanation into serious doubts. In this paper, we try to give a complete answer to Breiman's critique by proving a bound in terms of a new margin measure called Equilibrium margin (Emargin). The Emargin bound is uniformly sharper than Breiman's minimum margin bound. This result suggests that the minimum margin is not crucial for the generalization error. We also show that a large Emargin implies good generalization. Experimental results on benchmark datasets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than arc-gv, which agrees well with our theory. 1 Intro duction The AdaBoost algorithm [FS96, FS97] has achieved great success in the past ten years. It has demonstrated excellent experimental performance both on benchmark datasets and real applications [BK99, Die00, VJ01]. It is observed in experiments that the test error of a combined voting classifier usually keeps decreasing as its size becomes very large and even after the training error is zero [Bre98, Qui96]. This fact, on the first sight, obviously violates Occam's razor. This work was supported by NSFC(60775005, 60635030, 60721002) and Global COE Program of Tokyo Institute of Technology. Schapire et al. [SFBL98] tried to explain this phenomenon in terms of the margins of the training examples. Roughly speaking, the margin of an example with respect to a classifier is a measure of the confidence of the classification result. Schapire et al. [SFBL98] proved an upper bound for the generalization error of a voting classifier that does not depend on how many classifiers were combined, but only on the margin distribution over the training set, the number of the training examples and the size (the VC dimension for example) of the set of base classifiers. They also demonstrate that AdaBoost has the ability to produce a good margin distribution. This theory indicates that producing a good margin distribution is the key to the success of AdaBoost and explains well the surprising phenomenon observed in experiments. Soon after that however, Breiman [Bre99] cast serious doubts on this margin explanation. He developed a boosting-type algorithm called arc-gv, which provably generates a larger minimum margin than AdaBoost1 (Minimum margin is the smallest margin over all the training examples, see Section 2 for the formal definition). Then he gave an upper bound for the generalization error of a voting classifier in terms of the minimum margin, as well as the number of training examples and the size of the set of base classifiers. This bound is sharper than the bound based on the margin distribution given by Schapire et al. Breiman argued that if the bound of Schapire et al. implied that the margin distribution is the key to the generalization error, his bound implied more strongly that the minimum margin is the key to the generalization error, and the arc-gv algorithm would achieve the best performance among all boosting-type algorithms. In experiments, even though arc-gv always produces larger minimum margins than AdaBoost, its test error is consistently higher. Breiman also investigated the margin distributions generated by AdaBoost and arcgv, and found that arc-gv actually produced uniformly better margin distributions than AdaBoost. Thus he concluded that neither the minimum margin nor the margin distribution determined the generalization error and a new theoretical explanation is needed. Actually, the minimum margin of arc-gv converges to the largest possible value among all voting classifiers. 1 Breiman's argument seems convincing and put the margin explanation into serious doubts. Recently however, Reyzin and Schapire [RS06] gained important discovery after a careful study on Breiman's arc-gv algorithm. Note first that the bounds of both Breiman and Schapire et al. state that the generalization error also depends on the complexity of the set of base classifiers as well as the minimum margin or the margin distribution. To investigate how the margin affects the generalization error, one has to keep the complexity of the base classifiers fixed. In Breiman's experiments, he tried to control this by always using CART trees [BFOS84] of a fixed number of leaves as the base classifier. Reyzin and Schapire re-conducted Breiman's experiments and found that the trees produced by arc-gv were much deeper than those produced by AdaBoost. Since deeper trees are more complex even though the number of leaves is the same, arc-gv uses base classifiers of higher complexity than AdaBoost in Breiman's experiments. Thus it was not a fair comparison. In order to study the margin explanation in a fair manner, a more controlled setting is needed. Reyzin and Schapire then compared arc-gv and AdaBoost by using the decision stump, whose complexity is fixed, as the base classifier. Experiments showed that arc-gv produced larger minimum margins yet still a higher error rate. But this time, the margin distribution generated by arc-gv is not as "good" as that AdaBoost generated (see Fig.7 in [RS06]). So they argued that according to the Schapire et al. bound in terms of the margin distribution, the empirical observation, i.e., the inferior performance of arc-gv, could be explained. From a more critical point of view however, Breiman's doubt has not been fully answered by the above results. First of all, Breiman backed up his argument with a sharper bound in terms of the minimum margin. In Reyzin and Schapire's experiment with the decision stumps, arc-gv still produced larger minimum margin and had worse performance. Even though AdaBoost generates a "better" margin distribution than arc-gv, it would not disprove Breiman's critique unless we could show a bound in terms of the margin distribution and is uniformly sharper than Breiman's minimum margin bound. Another problem is how to measure the "goodness" of a margin distribution. The statement that AdaBoost generates "better" margin distributions than arc-gv is vague. Reyzin and Schapire used the average margin as a measure to compare margin distributions produced by AdaBoost and arc-gv. But the average margin does not explicitly appear in the bound of Schapire et al. Thus a larger average margin does not necessarily imply a smaller generalization error in theory. In this paper, we try to give a complete answer to Breiman's doubt by solving the two problems mentioned above. We first propose a novel upper bound for the generalization error of voting classifiers. This bound is uniformly sharper than Breiman's bound. The key factor in this bound is a new margin notion which we refer to as the Equilibrium margin (Emargin). The Emar- gin can be viewed as a measure of how good a margin distribution is. In fact, the Emargin depends, in a complicated way, on the margin distribution, and has little relation to the minimum margin. Experimental results show that AdaBoost usually produces a larger Emargin than arc-gv when the complexity of the base classifier is well controlled. Our results thus explain the inferior performance of arc-gv and give Breiman's doubt a negative answer. The rest of this paper is organized as follows: In Section 2 we briefly describe the margin theory of Schapire et al. and Breiman's argument. Our main results are given in Section 3. We provide further explanation of the main bound in Section 4. All the proofs can be found in Section 5. We provide experimental justification in Section 6 and conclude in Section 7. 2 Background and Related Work In this section we briefly review the existing margin bounds and the two boosting algorithms. Consider binary classification problems. Examples are drawn independently according to an underlying distribution D over X × {-1, +1}, where X is an instance space. Let H denote the space from which the base hypotheses are chosen. A base hypothesis h H is a mapping from X to {-1, +1}. A voting classifier f (x) is of the form f (x) = i hi (x), where i = 1, i 0. An error occurs on an example (x, y ) if and only if y f (x) 0. We use PD (A(x, y )) to denote the probability of the event A when an example (x, y ) is chosen randomly according to the distribution D. Therefore, PD (y f (x) 0) is the generalization error which we want to bound. We also use PS (A(x, y )) to denote the probability with respect to choosing an example (x, y ) uniformly at random from the training set S . For an example (x, y ), the value of y f (x) reflects the confidence of the prediction. Since each base classifier outputs -1 or +1, one has i i y f (x) = i - i . :y =hi (x) :y =hi (x) Hence (y f (x) is the difference between the weights assigned to those base classifiers that correctly classify (x, y ) and the weights assigned to those that misclassify the example. y f (x) is called the margin for (x, y ) with respect to f . If we consider the margins over the whole set of training examples, we can regard PS (y f (x) ) as a distribution over (-1 1), since PS (y f (x) ) is the fraction of training examples whose margin is at most . This distribution is referred to as the margin distribution. The minimum margin of f , which is the smallest margin over the training examples, then can Input: S = (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) where xi X , yi {-1, 1}. Initialization: D1 (i) = 1/n. for t = 1 to T do 1. Train base learner using distribution Dt . 2. Get base classifier ht : X {-1, 1}. 3. Choose t . 4. Update: Dt+1 (i) = Dt (i) exp(-t yi ht (xi )) , Zt chosen. The theory contains two bounds: one applies to the case that the base classifier set H is finite, and the other applies to the general case that H has a finite VC dimension. Theorem 1 [SFBL98] For any > 0, with probability at least 1 - over the random choice of the training set S of n examples, every voting classifier f satisfies the fol lowing bounds: P +y y PD f (x) 0 inf f (x) S (0,1] where Zt is a normalization factor chosen so that Dt+1 will be a distribution. end Output: The final Classifier T . t H (x) = sgn t ht (x) =1 1 O n l 1 og n log |H | + log 2 1/2 , if |H | < . And PD +y f (x) 0 1 O n P (0,1] inf y S f ( x) 1 /2 , Algorithm 1: A unified description of AdaBoost and arc-gv. d 1 log2 (n/d) + log 2 be equivalently represented by the maximum value of such that PS (y f (x) ) = 0. A unified description of AdaBoost and arc-gv is shown in Algorithm 1. The only difference of the two algorithms is the choice of t . AdaBoost sets t as t = 1 + t 1 log , 2 1 - t where d is the VC dimension of H . The theorem states that if the voting classifier generates a good margin distribution, that is, most training examples have large margins so that PS (y f (x) ) is small for not too small , then the generalization error is also small. In [SFBL98] it has also been shown that for the AdaBoost algorithm, PS (y f (x) ) decreases to zero exponentially fast with respect to the number of boosting iterations if is not too large. These results imply that the excellent performance of AdaBoost is due to its good margin distribution. Breiman's doubts on the margin explanation came from the arc-gv algorithm. It can be shown that the minimum margin generated by arc-gv converges to the largest possible value among all voting classifiers. In practice, arc-gv has larger minimum margins than AdaBoost in most cases for a finite number of boosting iterations. Breiman also proved an upper bound for the generalization error of voting classifiers. This bound depends only on the minimum margin, not on the entire margin distribution. Theorem 2 [Bre99] Let 0 be the minimum margin defined as 0 = min {y f (x) : (x, y ) S } , (1) where t is the edge of the base classifier ht , defined as: t = in =1 Dt (i)yi ht (xi ). The edge t is an affine transformation of the error rate of ht with respect to the distribution Dt . Arc-gv chooses t in a different way. It takes into consideration of the minimum margin of the composite classifier up to the current round. Denote by t the minimum margin of the voting classifier of round t - 1, that is, . y t-1 s hs (xi ) t = min i s=1t-1 i s=1 s Let 1 1 + t log 2 1 - t Arc-gv sets t as [Bre99]: 1 : t : t = 0: t = - 1 1 + t log . 2 1 - t t > 1 , 0 t 1, t < 0 . where S is the training set. If |H | < , 2 0 > 4 R= |H | , The first margin explanation of the AdaBoost algorithm [SFBL98] is to upper bound the generalization error of voting classifiers in terms of the margin distribution, the number of training examples and the complexity of the set from which the base classifiers are 32 log(2|H |) 2n, 2 n0 then for any > 0, with probability at least 1 - over the random choice of the training set S of n examples, every voting classifier f satisfies the fol lowing bounds: y PD f (x) 0 + l 1 |H | 1 log( ). (2) R og(2n) + log + 1 R n Breiman pointed out that his bound is sharper than the margin distribution bound of Schapire et al. If in Theorem 1 is taken to be the minimum margin 0 , the bound in Theorem 2 is about the square of the bound in terms of the margin distribution, since the bound in Theoreml 2 is O(log n/n) and the bound in Theorme 1 is O( og n/n). Breiman then argued that compared to the margin distribution explanation, his bound implied more strongly that the minimum margin governs the generalization error. However, arc-gv performs almost consistently worse than AdaBoost in experiments2 . These empirical results contradict what the margin theory predicts and therefore put the margin explanation into serious doubts. A lot of efforts have been made on providing better explanation of the boosting algorithms in recent years [MBG02, KP02, KP05, AKLL02]. Koltchinskii and Panchanko [KP02, KP05] proved a number of bounds in terms of the margin distribution which are sharper than Theorem 1. However, it is difficult to compare the minimum margin bound to these bounds since they contain unspecified constants. Nevertheless, these results imply that the margin distribution might be more important than the minimum margin for the generalization error of voting classifiers. The next theorem is our main result: the Emargin bound. Here we consider the case that the base classifier set H is finite. For the case that H is infinite but has a finite VC dimension, the bound is more complicated and will be given in Theorem 8. All the proofs can be found in Section 5. Theorem 3 If |H | < , then for any > 0, with probability at least 1 - over the random choice of the training set S of n examples, every voting classifier f satisfies the fol lowing bound: y PD f (x) 0 q ^ , log |H | (3) + inf D-1 , u (q ) 12 n q {0, n , n ,...,1} where ^ =1 u (q ) n 8 ^ 2 (q ) log n2 log |H | 2 , l og |H | n + log |H | + log ^ and (q ) is given by . 8 : y ^(q ) = sup /|H |, 1 PS f (x) q (4) Clearly the key factors in this bound are the optimal q ^ and the corresponding (q ). Definition 4 Let q be the optimal q in Eq.(3), and denote ^ = (q ). We cal l the Equilibrium margin (Emargin). The name equilibrium is due to the following fact. Prop osition 5 q is the empirical error at the Emargin . y = q . PS f (x) < (5) With Definition 4, the Emargin bound (3) can be simply written as y log |H | q . PD f (x) 0 + D -1 , u ( ) (6) n Theorem 3 then states that the generalization error of a voting classifier depends on its Emargin and the empirical error at the Emargin. Our Emargin bound has a similar flavor to Theorem 1. Note that the Emargin depends, in a complicated way, on the whole margin distribution. Roughly, if most training examples have large margins, then is large and q is small. The minimum margin is only a special case of the Emargin. From Eq.(4) one can see ^ that (0) is the minimum margin. Hence the Emargin is 3 Main Results In this section we propose upper bounds in terms of the Emargin. The bound is uniformly sharper than Breiman's minimum margin bound. First let us introduce some notions. Consider the Bernoulli relative entropy function D(q ||p) defined as q 1-q D(q ||p) = q log + (1 - q ) log , p 1-p 0 p, q 1. For a fixed q , D(q ||p) is a monotone increasing function of p for q p 1. It is easy to check that D(q ||p) = 0 when p = q , and D(q ||p) as p 1. Thus one can define the inverse function of D(q ||p) for fixed q as D-1 (q , u), such that D(q ||D-1 (q , u)) = u for all u 0 and D-1 (q , u) q . See also [Lan05]. 2 Actually, the inferior performance has also been observed when using other voting classifiers that maximize the minimum margin (see also [GS98, RW02]). equal to the minimum margin if and only if the optimal q is zero. We next compare our Emargin bound to Breiman's minimum margin bound. We show that the Emargin bound is uniformly sharper than the minimum margin bound. Theorem 6 The bound given in Theorem 3 is uniformly sharper than the minimum margin bound in Theorem 2. That is q log | |H + D-1 , u ( ) n l + 1 1 |H | R og(2n) + log + 1 log , R n where 32 log(2|H |) R= 2n. 2 n0 According to this theorem, the minimum margin is not crucial for the generalization error, i.e., a larger minimum margin does not necessarily imply a smaller test error. Thus arc-gv does not necessarily have better performance than AdaBoost. Our new bound implies that it is the Emargin and the empirical error q at that govern the performance of the classifier. The following theorem describes how the Emargin and the Emargin error q affect the generalization ability. It states that a larger Emargin and a smaller Emargin error result in a lower generalization error. Theorem 7 Let f1 , f2 be two voting classifiers. Denote by 1 , 2 the Emargin and by q1 , q2 the empirical error at 1 , 2 of f1 , f2 respectively. That is y , qi = PS fi (x) < i i = 1, 2. Also denote by B1 , B2 the Emargin upper bound of the generalization error of f1 , f2 (i.e. the right-hand side of Eq.(3)). Then B1 B2 , if 1 2 and q1 q2 . Theorem 7 suggests that the Emargin and the Emargin error can be used as measures of the goodness of a margin distribution. A large Emargin and a small Emargin error indicate a good margin distribution. Experimental results in Section 6 show that AdaBoost usually has larger Emargins and smaller Emargin errors than arc-gv. The last theorem of this section is the Emargin bound for the case that the set of base classifiers has a finite VC dimension. Theorem 8 Suppose the set of base classifiers H has VC dimension d. Then for any > 0, with probability at least 1 - over the random choice of the training set S of n examples, every voting classifier f satisfies the fol lowing bounds: y PD f (x) 0 q ^ , n d2 + 1 + inf · D-1 , u (q ) 12 n q {0, n , n ,...,1} n - 1 where ^ =1 u (q ) n 1 6d n en2 log d d 1 6 n log + 1 ^2 (q ) d ^ 2 (q ) log + + 3 log ^ and (q ) is ^ (q ) = sup 0 ,1 2n log , : PS y f ( x) . q (7) 4 Explanation of the Emargin Bound In Theorem 3, we adopt the partial inverse of the relative entropy to upper bound the generalization error. The key term in the Emargin bound is ^ inf q D-1 (q , u[(q )]). To better understand the bound, we make use of three different upper bounds of inf q D-1 (q , u) to obtain simpler forms of the Emargin bound. We list in the following lemma the upper bounds ^ of inf q D-1 (q , u[(q )]). Lemma 9 The fol lowing bounds holds. 1. inf D-1 q q ^ , u (q ) 0 ^ D-1 , u (0) ^ . u (0) 2. inf D-1 q q ^ , u (q ) ^ 2 1/2 u (q ) . inf q + q 3. inf D-1 q q ^ , u (q ) ^ q C u[ (q )] ^ q C u[ (q )] inf D -1 q ^ , u (q ) inf ^ C u[(q )], = where C > 0 is any constant and C max(2C, 8). , Note from Theorem 3 that 1l ^ = og n log |H | 1 + log u (q ) O ^ n (q )2 and q = PS y ^ f (x) (q ) . Thus we can derive the following three bounds from the Emargin bound by using the three inequalities in Lemma 9 respectively. Corollary 10 If |H | < , then for any > 0, with probability at least 1 - over the random choice of the training set S of n examples, every voting classifier f satisfies the fol lowing bounds: 1. 1 PD (y f (x) 0) O n l 1 og n log |H | + log 2 0 , where i = 1, i 0, where 0 is the minimum margin. 2. PD +y f ( x) 0 1 O 3. 1 PD (y f (x) 0) O for al l such that 1 PS (y f (x) ) O n l 1 og n log |H | + log 2 . n l og n log |H | 1 + log 2 , n l P (0,1] inf y S f (x) 1 /2 , og n log |H | 1 + log 2 there can be associated with a distribution over H by the coefficients {i }. We denote this distribution as ~ Q(f ). By choosing N elements independently and ran~ domly from H according to Q(f ), we can generate a classifier g CN (H ). The distribution of g is denoted by Q(f ). For any fixed (0 < < 1) y PD f (x) 0 + y PD,gQ(f ) g (x) y PD,gQ(f ) g (x) > , y f (x) 0 - . y + N 2 PD,gQ(f ) g (x) exp (8) 2 We next bound the first term on the right-hand side of the inequality. For any fixed g CN (H ), and for any positive number and nonnegative integer k such that k n, we consider the probability (over the random draw of n training examples) that the training error at margin is less than k /n, while the true error of g at margin is larger than . A compact representation of this probability is w + P Py k Pr g (x) ) > D (y g (x) ) > I S S D n n here PrS Dn denotes the probability over n training samples chosen independently at random according to D, and I is the indicator function. Note that P > y Pr n g (x) D S D The first bound in the Corollary has the same order of magnitude as the minimum margin bound. The second bound is the same as Theorem 1. So essentially, previous bounds can be derived from the Emargin bound. The third bound in the Corollary is new. og It states that the generalization error is O( log n l2 |H | ) n even in the non-zero error case, provided the margin error PS (y f (x) ) is small enough. 5 Pro ofs In this section, we give proofs of the theorems, lemmas and corollaries. 5.1 Pro of of Theorem 3 The proof uses the tool developed in [SFBL98]. The difference is that we do not bound the deviation of the generalization error from the empirical margin error directly, instead we consider the difference of the generalization error to a zero-one function of a certain empirical measure. This allows us to unify the zero-error and nonzero-error cases and it results in a sharper bound. For the sake of convenience, we follow the convention in [SFBL98]. Let C (H ) denote the convex hull of H . Also let CN (H ) denote the set of unweighted averages over N elements from the base classifier set H . Formally, . g N 1j CN (H ) = : g= hj , hj H N =1 For any voting classifier f= i hi S D n P Pr S >k + I S g ( x) n P y > y k g ( x) g ( x) D n y (1 - )n-r . P rk =0 n r r Then applying the relative entropy Chernoff bound to the Bernoulli trials, we further have - k . rk n r n-r (1 - ) exp nD r n =0 We thus obtain P S D Pr n >y D g (x) P y S I - exp g (x) . >k + n (9) C (H ), nD k n We only consider at the values in the set . 1 2 , ,...,1 U= |H | |H | There are no more than |H |N elements in CN (H ). Using the union bound we get > y g CN (H ), U, PD g (x) Pr n S D >k + I S g ( x) n . - k |H |(N +1) exp nD n Note that =y EgQ(f ) PD g (x) y , PD,gQ(f ) g (x) =P y >k EgQ(f ) I S g (x) n Py >k . PgQ(f ) g (x) S n We have > y f C (H ), U, PD,gQ(f ) g (x) Pr n y S D P We next bound the first term in the right-hand side of Eq.(10). Using the same argument for deriving Eq.(8), we have for any > Py >k PgQ(f ) g (x) S n +P y >k I S f (x) n P y >k , PgQ(f ) S g (x) > n . y k (11) PS f ( x ) n Note that the last term in Eq.(11) can be further bounded by PgQ(f ) (xi , yi ) S : yi g (xi ) and yi f (xi ) > - n exp N ( - )2 2 . (12) PgQ(f ) |H | Let (N +1) P S (y g (x) ) > . k n + - exp nD k n - nD = |H |(N +1) exp then k k n , Combining (8), (10), (11) and (12), we have that with probability at least 1 - over the draw of training examples, for all f C (H ), all U , all > , and all k , but fixed N y PD f (x) 0 + - + - N 2 N ( - )2 exp n exp 2 2 +Py >k I S f ( x) n k ( 1 n. , D -1 N + 1) log |H | + log nn Let - U, 2 |H | where 0 < 1. It is easy to check that the sum of the first two terms on the right-hand side of the above inequality can be bounded by 2 N e - . N 2 max n, exp xp 2|H | 8 = Let ( . 1 1 , N + 1) log |H | + log =D nn We obtain that with probability at least 1 - over the draw of the training samples, for all f C (H ), all U, y PD,gQ(f ) g (x) + Py >k PgQ(f ) g (x) S n k ( . 1 1 D -1 , N + 1) log |H | + log nn Using the union bound over k = 0, 1, . . . , n, then with probability at least 1 - over the draw of the training samples, for all f C (H ), all U , and all k y PD,gQ(f ) g (x) + Py >k PgQ(f ) g ( x) S n k 1( n. D-1 , N + 1) log |H | + log (10) nn -1 N = · 2-N , we can get a union bound over all N . Put 22 , 8 n N = 2 log log |H | note that if > then 2n > exp 8 |H | , . N 2|H | We obtain y log |H | PD f (x) 0 I P yn >k + + inf f ( x) D -1 S 0k I S (A) > S D n n - k , 2 · s(A, n2 ) exp nD n where = Pro of of Theorem 6. The right-hand side of the Emargin bound (3) is the minimum over all q n 1 - . n-1 n Proof of Lemma 12. The proof is the standard argument. We first show that for any 0 < < 1, > 0, and P any integer n P + k rn A A : PD (A) > I S (A) > S D n P 1 r A A : PS 1 - e-2n 2 2 S Dn , S Dn . P + k > I S (A) > (1 - ) n Then P> Pr S D n , S D n P -I S ( A) AA s up S (A) > k n = (1 - ) (A) d P Is up AA P> S (A) P -I S ( A) > k n d (1 - ) V Is dP up AA P k n P> S (A) P -I S ( A) Or equivalently, s S D n > d P up AA D (A) P -I S (A) Pr > k n > P S (A) (1 - ) V dP IP S (A P 1 P r up s 1 - e-2n 2 2 S Dn , S Dn AA . P > k - I S ( A) > (1 - ) n >P k ) - I S (A ) > n d (1 - ) P d (16) V dP 1 IP S 2 2 (A Let V denote the event P sup AA D ( A) ) P D (A ) - dP P P -I k S (A) > n > . = × 1 - e-2n - e-2n S D V 2 2 s Let A be (one of ) the optimal A so that a PD (A) - I P k S (A) > n P up D (A) - I P S ( A) > Pr n AA k n > . This completes the proof of (16). Take n n2 - n, 1 = , (n - 1) = chieves the maximum. Note that the following two events PS (A ) PD (A ) - we have P + k Pr A A : PD (A) > I S (A) > S D n n 2 Pr S D n , S D n and A A : PS + (A) P PD (A ) - I k S (A ) > n > >I P k S ( A) > n 1 ( - ) n-1 . imply that P )-I k n > (1 - ). Proceeding as [Dev82] and using the relative entropy Hoeffding inequality, the theorem follows. (A PS S (A ) > Pro of of Theorem 8. The proof is the same as Theorem 3 until we have Eq.(9). Let = , we need to 2 bound S D n and > y g CN (H ), > 0, PD g (x) 2 Py +. >k I S g (x) 2 n (x) = we have Pr q q, x(1 - q x) 12 Note that we only need to consider = 0, N , N , . . . , 1. Let ( , A(g ) = x, y ) X × {-1, 1} : y g (x) 2 and A = {A(g ) : g CN (H )} . By Sauer's lemma [Sau72] it is easy to see that en N d , s(A, n) d where d is the VC dimension of H . By Lemma 12, we have y > Pr n g CN (H ), > 0, PD g (x) S D 2 Py + >k I S g (x) 2 n - k e 2 Nd n , exp nD 2(N + 1) d n where n 1 - . n-1 n Using the argument as Theorem 3, the theorem follows. = 1 q q D( ||q ) = ( ) . 2 2 8 This completes the proof of Eq. (17). ^ Now if q C u[(q )], recall that C = max(2C, 8), and note D-1 is increasing function on its first and second parameter respectively. We have C q ^ ^ ,^ 2 -1 -1 , u (q ) D D u (q ) u (q ) C ^ ,C ^ 2 8 D -1 u (q ) u (q ) ^ . C u (q ) The lemma then follows. 6 Exp eriments Pro of of Lemma 9. The first inequality has already been proved in Lemma 11. For the second inequality, we only need to show u D-1 (q , u) q + /2, or equivalently D(q , q + u /2) u, since D is an increasing function in the second parameter. But this is immediate by a well known result [Hoe63]: D(q , q + ) 2 2 . For the third inequality we first show that for all 0