Risk Bounds for Randomized Sample Compressed Classifiers Mohak Shah Centre for Intelligent Machines McGill University Montreal, QC, Canada, H3A 2A7 mohak@cim.mcgill.ca Abstract We derive risk bounds for the randomized classifiers in Sample Compression settings where the classifier-specification utilizes two sources of information viz. the compression set and the message string. By extending the recently proposed Occam's Hammer principle to the data-dependent settings, we derive point-wise versions of the bounds on the stochastic sample compressed classifiers and also recover the corresponding classical PAC-Bayes bound. We further show how these compare favorably to the existing results. 1 Introduction The Sample compression framework [Littlestone and Warmuth, 1986, Floyd and Warmuth, 1995] has resulted in an important class of learning algorithms known as sample compression algorithms. These algorithms have been shown to be competitive with the state-of-the-art algorithms such as the SVM in practice [Marchand and Shawe-Taylor, 2002, Laviolette et al., 2005]. Moreover, the approach has also resulted in practical realizable bounds and has shown significant promise in using these bounds in model selection. On another learning theoretic front, the PAC-Bayes approach [McAllester, 1999] has shown that stochastic classifier selection can prove to be more powerful than outputing a deterministic classifier. With regard to the sample compression settings, this was further confirmed in the case of sample compressed Gibbs classifier by Laviolette and Marchand [2007]. However, the specific classifier output by the algorithm (according to a selected posterior) is generally of immediate interest since this is the classifier whose future performance is of relevance in practice. Diluting such guarantees in terms of the expectancy of the risk over the posterior over the classifier space, although gives tighter risk bounds, result in averaged statements over the expected true error. A significant result in obtaining such guarantees for the specific randomized classifier has appeared in the form of Occam's Hammer [Blanchard and Fleuret, 2007]. It deals with bounding the performance of algorithms that result in a set output when given training data. With respect to classifiers, this results in a bound on the true risk of the randomized classifier output by the algorithm in accordance with a learned posterior over the classifier space from training data. Blanchard and Fleuret [2007] also present a PAC-Bayes bound for the data-independent settings (when the classifier space is defined independently of the training data). Motivated by this result, we derive risk bounds for the randomized sample compressed classifiers. Note that the classifier space in the case of sample compression settings, unlike other settings, is data-dependent in the sense that it is defined upon the specification of training data. 1 The rest of 1 Note that the classifier space depends on the amount of the training data as we see further and not on the training data themselves. Hence, a data-independent prior over the classifier space can still be obtained in this setting, e.g., in the PAC-Bayes case, owing to the independence of the classifier space definition from the content of the training data. 1 the paper is organzed as follows: Section 2 provides a background on the sample compressed classifiers and establishes the context; Section 3 then states the Occam's Hammer for the dataindependent settings. We then derive bounds for the randomized sample compressed classifier in Section 4 followed by showing how we can recover bounds for the sample compressed Gibbs case (classical PAC-Bayes for sample compressed classifiers) in Section 5. We conclude in Section 6. 2 Sample Compressed (SC) Classifiers We consider binary classification problems where the input space X consists of an arbitrary subset def of Rn and the output space Y = {-1, +1}. An example z = (x, y ) is an input-output pair where x X and y Y . Sample Compression learning algorithms are characterized as follows: Given a training set S = {z1 , . . . , zm } of m examples, the classifier A(S ) returned by algorithm A is described entirely by two complementary sources of information: a subset z i of S , called the compression set, and a message string which represents the additional information needed to obtain a classifier from the compression set z i . Given a training set S , the compression set z i is def defined by a vector i of indices i = (i1 , i2 , . . . , i|i| ) with ij {1, . . . , m} j and i1 < i2 < . . . < i|i| and where |i| denotes the number of indices present in i. Hence, z i denotes the ith example of S whereas zi denotes the subset of examples of S that are pointed to by the vector of indices i defined above. We will use i to denote the set of indices not present in i. Hence, we have S = z i zi for any vector i I where I denotes the set of the 2 m possible realizations of i. Finally, a learning algorithm is a sample compression learning algorithm (that is identified solely by a compression set z i and a message string ) iff there exists a Reconstruction Function R : (X × Y )|i| × K - H, associated with A. Here, H is the (data-dependent) classifier space and K I × M s.t. M = iI M(i). That is, R outputs a classifier R(, zi ) when given an arbitrary compression set zi S and message string chosen from the set M(z i ) of all distinct messages that can be supplied to R with the compression set z i . We seek a tight risk bound for arbitrary reconstruction functions that holds uniformly for all compression sets and message strings. For this, we adopt the PAC setting where each example z is drawn according to a fixed, but unknown, probability distribution D on X × Y . The true risk R(f ) of any classifier f is defined as the probability that it misclassifies an example drawn according to D: R(f ) = Pr(x,y)D (f (x) = y ) = E(x,y)D I (f (x) = y ) def Let Zm denote the collection of m random variables whose instantiation gives a training sample S = zm = {z1 , . . . , zm }. To obtain the tightest possible risk bound, we will fully exploit the fact that the distribution of classification errors is a binomial. We now discuss the generic Occam's Hammer principle (w.r.t. the classification scenario) and then go on to show how it can be applied to the sample compression setting. where I (a) = 1 if predicate a is true and 0 otherwise. Given a training set S = {z 1 , . . . , zm } of m examples, the empirical risk R S (f ) on S , of any classifier f , is defined according to: im def 1 def RS (f ) = I (f (xi ) = yi ) = E(x,y)S I (f (x) = y ) m =1 3 Occam's Hammer for data independent setting In this section, we briefly detail the Occam's hammer [Blanchard and Fleuret, 2007] for dataindependent setting. For the sake of simplicity, we retain the key notations of Blanchard and Fleuret [2007]. Occam's hammer work by bounding the probability of bad event defined as follows. For every classifier h H, and a confidence parameter [0, 1], the bad event B (h, ) is defined as the region where the desired property on the classifier h does not hold, with probability . That is, PrS Dm [S B (h, )] . Further, it assumes that this region is nondecreasing in . Intuitively, this means that with decreasing the bound on the true error of the classifier h becomes tighter. With the above assumption satisfied, let, P be a non-negative reference measure on the classifier space H known as the volumic measure. Let be a probability distribution on H absolutely contind uous w.r.t. P such that = dP . Let be a probability distribution on (0, +) (the inverse density prior). Then Occam's Hammer [Blanchard and Fleuret, 2007] states that: 2 Theorem 1 [Blanchard and Fleuret, 2007] Given the above assumption and P, , defined as above, define the level function (h, u) = min( (h) (u), 1). x where (x) = 0 ud(u) for x (0, +). Then for any algorithm S S returning a probability density S over H with respect to P, and such that (S, h) S (h) is jointly measurable in its two variables, it holds that S B (h, (h, S (h)-1 )) , Pr m where Q is the distribution on H such that S D , hQ dQ dP = S . Note above that Q is the (data-dependent) posterior distribution on H after observing the data sample S while P is the data-independent prior on H. The subscript S in S denotes this. Moreover, the distribution on the space of classifiers may or may not be data-dependent. As we will see later, in the case of sample compression learning settings we will consider priors over the space of classifiers without reference to the data (such as PAC-Bayes case). To this end, we can either opt for a prior independent of the data or make it the same as the volume measure P which establishes a distribution on the classifier space without reference to the data. 4 Bounds for Randomized SC Classifiers We work in the sample compression settings and as mentioned before, each classifier in this setting is denoted in terms of a compression set and a message string. A reconstruction function then uses these two information sources to reconstruct the classifier. This essentially means that we deal with a data-dependent hypothesis space. This is in contrast with other notions of hypothesis class complexity measures such as VC dimension. The hypothesis space is defined, in our case, based on the size of data sample (and not the actual contents of the sample). Hence, we consider the priors built on the size of the possible compression sets and associated message strings. More precisely, we consider prior distribution P with probability density P (z i , ) to be facotorizable in its compression set dependent component and message string component (conditioned on a given compression set) such that: P (zi , ) = PI (i)PM(zi ) ( ) (1) m 1 with PI (i) = m p(|i|) such that d=0 p(d) = 1. The above choice of the form for P I (i) is (|i|) appropriate since we do not have any a priori information to distinguish one compression set from other. However, as we will see later, we should choose p(d) such that we give more weight to smaller compression sets. Let PK be the set of all distributions P on K satisfying above equation. Then, we are interested in algorithms that output a posterior Q P K over the space of classifiers with probability density Q(zi , ) factorizable as QI (i)QM(zi ) ( ). A sample compressed classifier is then defined by choosing a classifier (zi , ) according to the posterior Q(z i , ). This is basically the Gibbs classifier defined in the PAC-Bayes settings where the idea is to bound the true risk of this Gibbs classifier defined as R(GQ ) = E(zi ,)Q R((zi , )). On the other hand, we are interested in bounding the true risk of the specific classifier (zi , ) output according to Q. As shown in [Laviolette and Marchand, 2007], a rescaled posterior Q of the following form can provide tighter guarantees while maintaining the Occam's principle of parsimony. Definition 2 Given a distribution Q P K , we denote by Q the distribution: QI (i)QM(zi ) ( ) Q(zi , ) def Q(zi , ) = = = QI (i)QM(zi ) ( ) 1 |i|E(zi ,)Q |i| |i|E(zi ,)Q |1| i (zi , ) K Hence, note that the posterior is effectively rescaled for the compression set part. Hence, any classifier (zi , ) Q = i QI , QM(zi ) . Further, if we denote by dQ the expected value of the compression set size over the choice of parameters according to the scaled posterior, def dQ = EiQI ,QM(z ) |i|, then, i E(zi ,)Q 1 1 1 = = m - dQ |i| EiQI ,QM(z ) |i| i 3 Now, we proceed to derive the bounds for the randomized sample compressed classifiers starting with a PAC-Bayes bound. 4.1 A PAC-Bayes Bound for randomized SC classifier We exploit the fact that the distribution of the errors is binomial and define the following error quantities (for a given i, and hence z i over z|i| ): Definition 3 Let S D m with D a distribution on X × Y , and (z i , ) K. We denote by BinS (i, ), the probability that the classifier R(z i , ) of (true) risk R(z bi , ) makes |i|Rzi (zi , ) or fewer errors on zi D|i| . That is, BinS (i, ) = |i|Rz (zi ,) i =0 |( i| R(, zi )) (1 - R(, zi ))|i|- and by BS (i, ), the probability that this classifier makes exactly |i|R zi (zi , ) errors on zi D|i| . That is, | ( i| |i|R (z ,) |i|-|i|Rz (zi ,) i BS (i, ) = R(zi , )) zi i (1 - R(zi , )) |i|Rzi (zi , ) Now, approximating the binomial by relative entropy Chernoff bound [Langford, 2005], we have, for a classifier f : mRS (f ) m ( j R(f ))j (1 - R(f ))m-j exp(-m · kl(RS (f ) R(f ))) j =0 m= m a nd kl(RS (f ) R(f )) = As also shown in [Laviolette and Marchand, 2007], since j m- j kl(1 - RS (f ) 1 - R(f )), the above inequality holds true for each factor inside the sum on the left hand side. Consequently, in the case of sample compressed classifier, (z i , ) K and S (X × Y )m : - ( BS (i, ) exp |i| · kl(Rzi (, zi ) R(, zi )) 2) Bounding this by yields: PrS Dm k l(Rzi (, zi ) R(, zi )) ln 1 |i| 1- (3) for all RS (f ) R(f ). Now, consider the quantity in the probability in Equation 3 as the bad event over classifiers defined by a compression set i and an associated message string . Let zm (i, ) be the posterior probability density of the rescaled data-dependent posterior distribution Q over the classifier space with respect to the volume measure P. We can now replace for this bad event by the delta of the Occam's hammer defined as: 1 = 1 -1 -1 ln(min( (hS ) (zm (i, ) ), 1) ) = ln+ · · (h) min((k + 1)-1 zm (i, )- k+1 , 1) k 1 k +1 · max((k + 1)zm (i, ) k , 1) ln+ · (h) 1 k +1 ln+ · (k + 1) max(zm (i, ) k , 1) · (h) 1 + w k +1 ln · (k + 1) ln+ zm (i, ) k · (h) here ln+ denotes max(0, ln), the positive part of the logarithm. 4 However, note that we are interested in data-independent priors over the space of classifiers 2 , and hence, we consider our prior to be the same as the volume measure P over the classifier space yielding as unity. That is, our prior gives a distribution over the classifier space without any regard to the data. Substituting for zm (i, ) (the fraction of respective densities; Radon-Nikodym derivative)3, we obtain the following result: Theorem 4 For any reconstruction function R : D m × K - H and for any prior distribution P over compression set and message strings, the sample compression algorithms A(S ) returns a posterior distribution Q, then, for (0, 1] and k > 0, we have: k Pr l(Rzi (zi , ) R(zi , )) S D m ,iQI ,QM(zi ) 1 m - |i| l k+1 + 1 (1 + ) ln+ n k Q (zi , ) P (zi , ) 1- where Rzi (zi , ) is the empirical risk of the classifier reconstructed from (z i , ) on the training examples not in the compression set and R(z i , ) is the corresponding true risk. 1 1 Note that we do not encounter the m-d factor in the bound instead of m-|i| unlike the bound Q of Laviolette and Marchand [2007]. This is because the PAC-Bayes bound of Laviolette and Marchand [2007] computes the expectancy over the kl-divergence of the empirical and true risk of the classifiers chosen according to Q. This, as a result of rescaling of Q in preference of smaller compression sets, is reflected in the bound. On the other hand, the bound of Theorem 4 is a point-wise version bounding the true error of the specific classifier chosen according to Q and hence concerns the specific compression set utilized by this classifier. 4.2 A Binomial Tail Inversion Bound for randomized SC classifier A tighter condition can be imposed on the true risk of the classifier by considk ring tihe binomial tail e inversion over the distribution of errors. The binomial tail inversion Bin m , s defined as the largest risk value that a classifier can have while still having a probability of at least of observing at most k errors out of m examples: w d k r k ef , = sup ,r Bin : Bin m m here Bin k ,r d jk ef = mr j j m =0 (1 - r)m-j This bound can be converted to a training set bound in a standard manner by considering a measure over the classifier space (see for instance [Langford, 2005, Theorem 4.1]). Moreover, in the sample compression case, we are interested in the empirical risk of the classifier on the examples not in the compression set (consistent compression set assumption). Now, let r be a -weighed measure on the classifier space, i.e., i and . Then, for the compression sets and associated message strings, 2 3 From this definition, it follows that Bin (R S (f ), ) is the smallest upper bound, which holds with probability at least 1 - , on the true risk of any classifier f with an observed empirical risk R S (f ) on a test set of m examples (test set bound): R R PZm (f ) Bin Zm (f ), 1 - f (4) Hence, the missing S in the subscript of (h) in the r.h.s. above. Alternatively, let P (zi , ) and Q(zi , ) denote the probability densities of the prior distribution P and rescaled posterior distributions Q over classifiers such that dQ = Q(zi , )dµ and dP = P (zi , )dµ w.r.t. (z some measure µ. This too yields dQ = Q(zi , ) . Note that the final expression is independent of the underlying dP P i , ) measure µ. 5 consider the following bad event with empirical risk of the classifier measured as Bin S ((zi , )) for i QI , QM(zi ) : N R B (h, ) = (zi , ) > Bin(Rzi (zi , ), r ) ow, we replace r with the level function of Occam's hammer (with the same assumption of = P, = 1): min( (hS ) (zm (i, )-1 ), 1) · min((k + 1)-1 zm (i, )- k , 1) 1 · k +1 max((k + 1)zm (i, ) k , 1) 1 k +1 (k + 1) max(zm (i, ) k , 1) k +1 (k + 1)zm (i, ) k Hence, we have proved the following: Theorem 5 For any reconstruction function R : D m × K - H and for any prior distribution P over the compression set and message strings, the sample compression algorithms A(S ) returns a posterior distribution Q, then, for (0, 1] and k > 0, we have: R R Pr (zi , ) Bin zi (zi , ), 1- Q(z k +1 S D m ,iQI ,QM(zi ) (k + 1) P (zi ,) k i , ) We can obtain a looser bound by approximating the binomial tail inversion bound using [Laviolette et al., 2005, Lemma 1]: k +1 Corollary 6 Given all our previous definitions, the following holds with probability 1 - over the joint draw of S D m and i QI , QM(zi ) : - l m k ++ - |i| 1 +1 R(zi , ) 1 - exp n ln |i|Rzi (zi , ) m - |i| - |i|Rzi (zi , ) Q 5 (zi , ) 1 (1 + ) ln k P (zi , ) Recovering the PAC-Bayes bound for SC Gibbs Classifier Let us now see how a bound can be obtained for the Gibbs setting. We follow the general line of argument of Blanchard and Fleuret [2007] to recover the PAC-Bayes bound for the Sample Compressed Gibbs classifier. However, note that we do this for the data-dependent setting here and also utilize the rescaled posterior over the space of sample compressed classifiers. The PAC-Bayes bound of Theorem 4 basically states that E S D m [ where C Pr 1 m - |i| Pr l iQI ,QM(zi ) [kl(Rzi (zi , ) R(zi , )) > ( )]] (1 + 1 ) ln+ k Q (zi , ) P (zi , ) ( ) = n onsequently, E S D m [ k+1 + iQI ,QM(zi ) [kl(Rzi (zi , ) R(zi , )) > ( )]] Now, bounding the argument of expectancy above using the Markov inequality, we get: P Pr m [kl(Rzi (zi , ) R(zi , )) > ( )] > r S D iQI ,QM(zi ) 6 Now, discretizing the argument over ( i , i ) = ( 2-i , 2-i ), we obtain P Pr m [kl(Rzi (zi , ) R(zi , )) > (i i )] > i r i Taking the union bound over i , i 1 now yields: P > -2i -i r 1- [kl(Rzi (zi , ) R(zi , )) > ( 2 ] 2 Pr m S D iQI ,QM(zi ) S D iQI ,QM(zi ) i 0 Now, let us consider the argument of the above statement for a fixed sample S . Then, for all i 0, the following holds with probability 1 - : k l k+1 + 1 2i ln 2 l(Rzi (zi , ) R(zi , )) > Pr n m - |i| iQI ,QM(zi ) Q 1 (zi , ) + (1 + ) ln+ 2 -i k P (zi , ) and hence: S Pr (zi , ) > 2i ln 2 2 -i where: Q W k+1 - 1 (zi , ) S (zi , ) = (m - |i|)kl(Rzi (zi , ) R(zi , )) - ln (1 + ) ln+ k P (zi , ) e wish to bound, for the Gibbs classifier, E iQI ,QM(z ) S (zi , ): i 2 S EiQI ,QM(z ) [ (zi , )] Pr [S (zi , ) 2i ln 2]d(2i ln 2) i iQI ,QM(zi ) i ln 2>0 iQI ,QM(zi ) Now, we have: 2 ln 2 i 0 PriQI ,QM(z ) [S (zi , ) 2i ln 2] 3 i (5) where RS (GQ ) and R(GQ ) denote the empirical and true risk of the Gibbs classifier with posterior Q respectively. Hence, with Q = Lemma 7 [Laviolette and Marchand, 2007] For any f : K - R + , and for any Q, Q PK related by 1 Q(zi , ), Q (zi , )f (zi , ) = E(zi ,)Q f (z1,) i we have: 1 1 kl(RS (GQ ) R(GQ )) E(zi ,)Q f (zi , )kl(Rzi (zi , ) R(zi , )) E(zi ,)Q f (zi ,) Q and f (zi , ) = |i|, Lemma 7 yields: E(zi ,)Q (|i|kl(Rzi (zi , ) R(zi , ))) Further, EiQI ,QM(z ) i 1 1 m- d Q kl(RS (GQ ) R(GQ )) (6) l n+ Q(zi , ) P (zi , ) = EiQI ,QM(z ) Q (zi , ) i PI (i)PM(zi ) ( ) Q Q · (zi , ) (zi , ) E(zi ,)P ln+ PI (i)PM(zi ) ( ) PI (i)PM(zi ) ( ) - · Q Q (zi , ) (zi , ) ln E(zi ,)P PI (i)PM(zi ) ( ) PI (i)PM(zi ) ( ) max x ln x n+ 0x<1 l KL(Q P ) + 0.5 7 (7) Equations 6 and 7 along with Equation 5 and substituting k = m - 1 yields the final result: Theorem 8 For any reconstruction functionR : D m × K - H and for any prior distribution P over compression set and message strings, for (0, 1], we have: Prm Q PK : kl(RS (GQ ) R(GQ )) S D 1 K m+ 1 1 1 + L(Q P ) + + ln 3.5 1- m - dQ m-1 2(m - 1) Theorem 8 recovers almost exactly the PAC-Bayes bound for the Sample Compressed Classifiers of Laviolette and Marchand [2007]. The key differences are an additional (m-d 1 (m-1) weighted ) + KL-divergence term, ln( m ) instead of the ln( m 1 ) and the additional trailing terms bounded by 4 m-dQ . Note that the bound of Theorem 8 is derived in a relatively more straightforward manner with the Occam's Hammer criterion. Q 6 Conclusion It has been shown that stochastic classifier selection is preferable to deterministic selection by the PAC-Bayes principle resulting in tighter risk bounds over averaged risk of classifiers according to the learned posterior. Further, this observation resulted in tight bounds in the case of stochastic sample compressed classifiers [Laviolette and Marchand, 2007] also showing that sparsity considerations are of importance even in this scenario via. the rescaled posterior. However, of immediate relevance are the guarantees of the specific classifier output by such algorithms according to the learned posterior and hence a point-wise version of this bound is indeed needed. We have derived bounds for such randomized sample compressed classifiers by adapting Occam's Hammer principle to the data-dependent sample compression settings. This has resulted in bounds on the specific classifier output by a sample compression learning algorithm according to the learned data-dependent posterior and is more relevant in practice. Further, we also showed how classical PAC-Bayes bound for the sample compressed Gibbs classifier can be recovered in a more direct manner and show that this compares favorably to the existing result of Laviolette and Marchand [2007]. References Gilles Blanchard and Francois Fleuret. Occam's hammer. In Proceedings of the 20th Annual Con¸ ference on Learning Theory (COLT-2007), volume 4539 of Lecture Notes on Computer Science, pages 112­126, 2007. Sally Floyd and Manfred Warmuth. Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine Learning, 21(3):269­304, 1995. John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 3:273­306, 2005. Francois Laviolette and Mario Marchand. PAC-Bayes risk bounds for stochastic averages and major¸ ity votes of sample-compressed classifiers. Journal of Machine Learning Research, 8:1461­1487, 2007. Francois Laviolette, Mario Marchand, and Mohak Shah. Margin-sparsity trade-off for the set covering machine. In Proceedings of the 16th European Conference on Machine Learning, ECML 2005, volume 3720 of Lecture Notes in Artificial Intelligence, pages 206­217. Springer, 2005. N. Littlestone and M. Warmuth. Relating data compression and learnability. Technical report, University of California Santa Cruz, Santa Cruz, CA, 1986. Mario Marchand and John Shawe-Taylor. The Set Covering Machine. Journal of Machine Learning Reasearch, 3:723­746, 2002. David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355­363, 1999. 8