PAC-Bayesian Learning of Linear Classifiers

Pascal Germain Pascal.Germain.1@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca Fran¸ois Laviolette c Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca D´partement d'informatique et de g´nie logiciel, Universit´ Laval, Qu´bec, Canada, G1V-0A6 e e e e

Abstract
We present a general PAC-Bayes theorem from which all known PAC-Bayes risk bounds are obtained as particular cases. We also propose different learning algorithms for finding linear classifiers that minimize these bounds. These learning algorithms are generally competitive with both AdaBoost and the SVM.

their data-dependencies only comes through the training error of the classifiers. The fact that there also exists VC lower bounds, that are asymptotically identical to the corresponding upper bounds, suggests that significantly tighter bounds can only come through extra data-dependent properties such as the distribution of margins achieved by a classifier on the training data. Among the data-dependent bounds that have been proposed recently, the PAC-Bayes bounds (McAllester, 2003; Seeger, 2002; Langford, 2005; Catoni, 2007) seem to be especially tight. These bounds thus appear to be a good starting point for the design of a bound-minimizing algorithm. In this paper, we present a general PAC-Bayes theorem and show that all known PAC-Bayes bounds are corollaries of this general theorem. When spherical Gaussians, over the space of linear classifiers, are used for priors and posteriors, we show that the Gibbs classifier that minimizes any of the above-mentioned PAC-Bayes risk bound is obtained from the linear classifier that minimizes a non-convex objective function. We also propose two different learning algorithms for finding linear classifiers that minimize PAC-Bayes risk bounds and a third algorithm that uses cross-validation to determine the value of a parameter which is present in the risk bound of Catoni (2007). The first algorithm uses a non-informative prior to construct a classifier from all the training data. The second algorithm uses a fraction of the training set to construct an informative prior that is used to learn the final linear classifier on the remaining fraction of the training data1 . The third algorithm is, as the first one, based on a non-informative prior but uses the cross-validation methodology to choose one of the bound's parameters.
The idea of using a fraction of the training data to construct a prior has been proposed in (Ambroladze et al., 2006) for the problem of choosing the hyperparameter values of the SVM. In contrast, the priors are used here to directly minimize a PAC-Bayes bound.
1

1. Intoduction
For the classification problem, we are given a training set of examples--each generated according to the same (but unknown) distribution D, and the goal is to find a classifier that minimizes the true risk (i.e., the generalization error or the expected loss). Since the true risk is defined only with respect to the unknown distribution D, we are automatically confronted with the problem of specifying exactly what we should optimize on the training data to find a classifier having the smallest possible true risk. Many different specifications (of what should be optimized on the training data) have been provided by using different inductive principles but the final guarantee on the true risk, however, always comes with a so-called risk bound that holds uniformly over a set of classifiers. Hence, the formal justification of a learning strategy has always come a posteriori via a risk bound. Since a risk bound can be computed from what a classifier achieves on the training data, it automatically suggests the following optimization problem for learning algorithms: given a risk (upper) bound, find a classifier that minimizes it. Despite the enormous impact they had on our understanding of learning, the VC bounds are generally very loose. These bounds are characterized by the fact that
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

PAC-Bayesian Learning of Linear Classifiers

Our extensive experiments indicate that the second and third algorithms are competitive with both AdaBoost and the SVM and are generally much more effective than the first algorithm in their ability at producing classifiers with small true risk.

error rate of BQ . Hence R(BQ )  2R(GQ ). As shown in Langford and Shawe-Taylor (2003), this factor of 2 can sometimes be reduced to (1 + ). The following theorem gives both an upper and a lower bound on R(GQ ) by upper-bounding D(RS (GQ ), R(GQ )) for any convex function D : [0, 1] × [0, 1]  R. Theorem 2.1. For any distribution D, for any set H of classifiers, for any prior distribution P of support H, for any   (0, 1], and for any convex function D : [0, 1] × [0, 1]  R, we have
SD m

2. Simplified PAC-Bayesian Theory
We consider binary classification problems where the input space X consists of an arbitrary subset of Rn and the output space Y = {-1, +1}. An example is an input-output (x, y) pair where x  X and y  Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn according to a fixed, but unknown, distribution D on X × Y. The risk R(h) of any classifier h : X  Y is defined as the probability that h misclassifies an example drawn according to D. Given a training set S of m examples, the empirical risk RS (h) of any classifier h is defined by the frequency of training errors of h on S. Hence R(h) = RS (h) =
def def

Pr

 Q on H : D(RS (GQ ), R(GQ ))  1 E E emD(RS (h),R(h))  SDm hP  1-,

1 KL(Q P ) + ln m

where KL(Q P ) =

def

hQ

E

ln Q(h) is the KullbackP (h)

Leibler divergence between Q and P . E
(x,y)D

I(h(x) = y) , Proof. Since
hP

1 m

m

E

emD(RS (h),R(h)) is a non-negative

I(h(xi ) = yi ) ,
i=1

random variable, Markov's inequality gives Pr E emD(RS (h),R(h))  1 E E emD(RS (h),R(h))  SDm hP  1-.

where I(a) = 1 if predicate a is true and 0 otherwise. After observing the training set S, the task of the learner is to choose a posterior distribution Q over a space H of classifiers such that the Q-weighted majority vote classifier BQ will have the smallest possible risk. On any input example x, the output BQ (x) of the majority vote classifier BQ (sometimes called the Bayes classifier) is given by BQ (x) = sgn
def

SD m

hP

Hence, by taking the logarithm on each side of the innermost inequality and by transforming the expectation over P into an expectation over Q, we obtain Pr m Q : ln E P (h) mD(RS (h),R(h)) e  Q(h)  1-.

SD hQ

hQ

E h(x) ,

1 ln E E emD(RS (h),R(h))  SDm hP

where sgn(s) = +1 if s > 0 and sgn(s) = -1 otherwise. The output of the deterministic majority vote classifier BQ is closely related to the output of a stochastic classifier called the Gibbs classifier GQ . To classify an input example x, the Gibbs classifier GQ chooses randomly a (deterministic) classifier h according to Q to classify x. The true risk R(GQ ) and the empirical risk RS (GQ ) of the Gibbs classifier are thus given by R(GQ ) =
hQ

The theorem then follows from two applications of Jensen's inegality: one exploiting the concavity of ln(x) and the second the convexity of D. Theorem 2.1 provides a tool to derive PAC-Bayesian risk bounds. Each such bound is obtained by using a particular convex function D : [0, 1] × [0, 1]  R and by upper-bounding E m E emD(RS (h),R(h)) .
SD hP

E R(h)

;

RS (GQ ) =

hQ

E RS (h) .

Any bound for R(GQ ) can straightforwardly be turned into a bound for the risk of the majority vote R(BQ ). Indeed, whenever BQ misclassifies x, at least half of the classifiers (under measure Q) misclassifies x. It follows that the error rate of GQ is at least half of the

For example, a slightly tighter PAC-Bayes bound than the one derived by Seeger (2002) and Langford (2005) can be obtained from Theorem 2.1 by using D(q, p) = kl(q, p), where kl(q, p) = q ln
def

1-q q + (1 - q) ln . p 1-p

PAC-Bayesian Learning of Linear Classifiers

Corollary 2.1. For any distribution D, for any set H of classifiers, for any distribution P of support H, for any   (0, 1], we have Pr  Q on H : kl(RS (GQ ), R(GQ ))  (m) 1 KL(Q P ) + ln m  where (m) =
def m m k=0 k

SD m hP

E

E

emD(RS (h),R(h)) E E emF (R(h))-CmRS (h) Pm Pm
k=0 k=0 SD m

= = = =

hP SD m hP hP hP

E E

emF (R(h)) emF (R(h))

Pr

k (RS (h)= m )e-Ck

SD m

(m)R(h)k (1-R(h))m-k e-Ck k
m

 1-,

E

emF (R(h)) (R(h)e-C +(1-R(h)))

,

(k/m)k (1 - k/m)m-k .

and the result follows easily from Theorem 2.1 when 1 F is the convex function F(p) = ln (1- p [1-e-C ]) . It is interesting to compare the bounds of Corollaries 2.1 and 2.2. A nice property of the bound of Corollary 2.2 is the fact that its minimization is obtained from the Gibbs classifier GQ that minimizes C · mRS (GQ ) + KL(Q P ). As we will see, this minimization problem is closely related to the one solved by the SVM when Q is an isotropic Gaussian over the space of linear classifiers. Minimizing the bound given by Corollary 2.1 does not appear to be as simple because the upper bound on R(GQ ) is not an explicit function of RS (GQ ) and KL(Q P ). However, this upper bound does not depend on an arbitrary constant such as C in Corollary 2.2--which gives a computational advantage to Corollary 2.1 since, several bound minimizations (one for each value of C) would be needed in the case of Corollary 2.2. The tightness of these bounds can be compared with the following proposition. Proposition 2.1. For any 0  RS  R < 1, we have max
C0

Proof. The corollary immediately follows from Theorem 2.1 by choosing D(q, p) = kl(q, p). Indeed, in that case we have
SD m hP

E

E

emD(RS (h),R(h)) Em
k=0

= = =

hP SD

E

"

RS (h) R(h)

"mR

S (h)

"

1-RS (h) 1-R(h)

"m(1-R

S (h))

hP

E

Pm
m k

SD

,, k «k,, « -k m k 1- m k m Pr m (RS (h)= m ) R(h) 1-R(h)

Pm

k m-k , k=0 ( )(k/m) (1-k/m)

where last equality arises from the fact that RS (h) is a binomial random variable of mean R(h). See Banerjee (2006) for a very similar proof. Note also that we retreive the exact formulation of the PACBayes bound of Langford (2005) if we upper bound  (m) by m + 1. However, (m)  ( m). The PAC-Bayes bound of McAllester (2003) can be obtained by using D(q, p) = 2(q - p)2 . Let us now consider functions that are linear in the empirical risk, i.e., functions of the form D(q, p) = F(p) - C · q for convex F. As the next corollary shows, this choice for D gives a PAC-Bayes bound whose minimum is obtained for Gibbs classifiers minimizing a simple linear combination of RS (GQ ) and KL(Q P ). The next corollary has also been found by Catoni (2007)[Th.1.2.1]. Corollary 2.2. For any distribution D, any set H of classifiers, any distribution P of support H, any   (0, 1], and any positive real number C, we have   Q on H :
1 1-e-C

- ln 1 - R[1 - e-C ] - CRS

= kl(RS , R) .

Consequently, by omitting ln((m)), Corollary 2.1 always gives a bound which is tighter or equal to the one given by Corollary 2.2. On another hand, there always exists values of C for which Corollary 2.2 gives a tighter bound than Corollary 2.1. The next lemma shows that the bound of Corollary 2.2 has the interesting property of having an analytical expression of the optimal posterior Q for every prior P . Lemma 2.1. For any set H of classifiers and any prior P of support H, for any positive real number C, the posterior Q that minimizes the upper bound on R(GQ ) of Corollary 2.2 has a density which is given by the following Boltzmann distribution : Q (h) = 1 P (h)e-C·mRS (h) , Z

  1-exp - C ·RS (GQ )    1-.   1 1 + m KL(Q P ) + ln 

  R(GQ )   Prm SD  

Proof. Put D(q, p) = F(p) - C · q for some function F to be defined. Then

where m denotes the number of training examples in S and Z is a normalizing constant.

PAC-Bayesian Learning of Linear Classifiers

Proof. We present here a proof for the case where H is countable. But the theorem also holds for the continuous case. For any fixed C,  and P , the distribution Q minimizing the bound of Corollary 2.2 is the same as the one minimizing B(Q), where B(Q) = C ·
hH def

Q(h)RS (h) +

KL(Q P ) , m

under the constraint
hH

Q(h) = 1 . At optimality, Q

must satisfy Lagrange constraints, namely that there exists   R such that for any h  H, we have = Q(h) 1 B 1 + log . = C · RS (h) + Q(h) m P (h)

The task of the learner is to produce a posterior Q over the set of all possible weight vectors. If each possible feature vector  has N components, the set of all possible weight vectors is RN . Let Q(v) denote the posterior density evaluated at weight vector v. We restrict ourselves to the case where the learner is going to produce a posterior Qw , parameterized by a chosen weight vector w, such that for any weight vectors v and u we have Qw (v) = Qw (u) whenever v - w = -(u - w). Posteriors Qw satisfying this property are said to be symmetric about w. It can be easily shown that for any Qw symmetric about w and for any feature vector  : sgn
vQw

E

sgn (v ·  )

= sgn (w ·  ) .

(1)

Consequently, Q(h) = P (h)em(-C·RS (h))-1 = 1 P (h)e-C·mRS (h) , Z

where Z is a normalizing constant. It is well known that Bayes classifiers resulting from a Boltzmann distribution can only be expressed via integral formulations. Such intergrals can be approximated by some Markov Chain Mont´ Carlo sampling, e but, since the mixing time is unknown, we have no real control on the precision of the approximation. For this reason, we restrict ourselves here to the case where the posterior Q is chosen from a parameterized set of distributions. Building on the previous work of Langford and Shawe-Taylor (2003) and Langford (2005), we will focus on isotropic Gaussian distributions of linear classifiers since, in this case, we have an exact analytical expression for BQ , GQ , RS (BQ ), RS (GQ ), and KL(Q P ) in terms of the parameters of the posterior Q. These analytic expressions will enable us to perform our computations without performing any Mont´-Carlo sampling. e

In other words, for any input example, the output of the majority vote classifier BQw (given by the left hand side of Equation 1) is the same as the one given by the linear classifier hw whenever Qw is symmetric about w. Consequently, R(hw ) = R(BQw )  2R(GQw ) and, consequently, Corollary 2.1 and 2.2 provide upper bounds on R(hw ) for these posteriors. Building on the previous work of Langford and Shawe-Taylor (2003) and Langford (2005), we choose both the prior Pwp and the posterior Qw to be spherical Gaussians with identity covariance matrix respectively centered on wp and on w. Hence, for any weight vector v  RN : Qw (v) = 1  2
N

exp -

1 v-w 2

2

3. Specialization to Linear Classifiers
Let us apply Corollary 2.1 and 2.2 to linear classifiers that are defined over a space of features. Here we suppose that each x  X is mapped to a feature vector  (x) = (1 (x), 2 (x), . . .) where each i is given explicitly as a real-valued function or given implicitly by using a Mercer kernel k : X × X  R. In the latter case, we have k(x, x ) =  (x) ·  (x ) (x, x )  X × X . Each linear classifier hw is identified by a real-valued weight vector w. The output hw (x) of hw on any x  X is given by hw (x) = sgn (w ·  (x)) .

Thus, the posterior is parameterized by a weight vector w that will be chosen by the learner based on the values of RS (GQw ) and KL(Qw Pwp ). Here, the weight vector wp that parameterizes the prior Pwp represents prior knowledge that the learner might have about the classification task (i.e., about good direction for linear separators). We therefore have wp = 0 in the absence of prior knowledge so that P0 is the noninformative prior. Alternatively, we might set aside a subset S of the training data S and choose wp such that RS (GPwp ) is small. By performing simple Gaussian integrals, as in Langford (2005), we find KL(Qw Pwp ) R(GQw ) RS (GQw ) = = = 1 m
def

1 w - wp 2 E
(x,y)D m

2

 
w w (x, y) w w (xi , yi ) ,

i=1

where w (x, y) denotes the normalized margin of w on (x, y), i.e., w (x, y) =
 yw· (x) w  (x)

, and where (a)

PAC-Bayesian Learning of Linear Classifiers

denotes the probability that X > a when X is a N (0, 1) random variable, i.e.,  1 1 def exp - x2 dx . (a) =  (2) 2 2 a 3.1. Two objective functions to minimize By using the above expressions for RS (GQw ) and KL(Qw Pwp ), Corollaries 2.1 and 2.2 both provide upper bounds to R(GQw ) and to R(hw ) (since R(hw )  2R(GQw )). Hence, each bound depend on the same quantities: the empirical risk measure, RS (GQw ), and KL(Qw Pwp ) which acts as a regularizer. Minimizing the upper bound given by Corollary 2.1, in the case of linear classifiers, amounts to finding w that minimizes the following objective function B(S, w, ) = sup
def

learning strategy has its potential drawback. The single local minimum of the soft-margin SVM solution might be suboptimal, whereas the non-convex PACBayes bound might present several local minima. Observe that B, the objective function F2.1 , is defined only implicitly in terms of w via the constraints given by Equations (3) and (4). This optimization problem appears to be more involved than the (unconstrained) optimization of objective function F2.2 that arises from Corollary 2.2. However, it appears also to be more relevant, since, according to Proposition 2.1, the upper bound given by Corollary 2.1 is somewhat tighter than the one given by Corollary 2.2 (apart from the presence of a ln((m)) term). The optimization of the objective function F2.1 has also the advantage of not being dependent of any constant C like the one present in the objective objective function F2.2 .

: kl(RS (GQw ) )  , (F2.1 )

(m) 1 KL(Qw Pwp ) + ln m 

4. Gradient Descent of the PAC-Bayes Bound
We are now concerned with the problem of minimizing the (non-convex) objective functions F2.1 (for fixed wp ) and F2.2 (for fixed C and wp ). As a first approach, it makes sense to minimize these objective functions by gradient-descent. More specifically, we have used the Polak-Ribi`re conjugate gradient dee scent algorithm implemented in the GNU Scientific Library (GSL). The gradient (with respect to w) of objective function F2.1 is obtained by computing the partial derivative of both sides of Equation (3) with respect to wj (the jth component of w). After solving for B/wj , we find that the gradient is given by 1 B(1 - B) w - wp + ln m B - RS B(1 - RS ) RS (1 - B) ·

for a fixed value of the confidence parameter  (say  = 0.05). Consequently, our problem is to find weight vector w that minimizes B subject to the constraints kl RS (GQw ) B = B 1 (m) KL(Qw Pwp ) + ln (3) m  > RS (GQw ) . (4)

Minimizing the bound of Corollary 2.2, in the case of linear classifiers, amounts at finding w that minimizes the simple objective function CmRS (GQw ) + KL(Qw Pwp ) =
m

C
i=1


1 yi w ·  (xi ) + w - wp  (xi ) 2

2

, (F2.2 )

m


i=1

yi w ·  (xi )  (xi )

yi (xi ) , (5)  (xi )

for some fixed choice of C and wp . In the absence of prior knowledge, wp = 0 and the regularizer becomes identical to the one used by the SVM. Indeed, the learning strategy used by the soft-margin SVM consists at finding w that minimizes
m

where  (t) denotes the first derivative of  evaluated at t. We have observed that objective function F2.1 tends to have only one local minimum, even if it is not convex. We have therefore used a single gradient descent run to minimize F2.1 . The gradient of objective function F2.2 is
m

C
i=1

max 0, 1 - yi w ·  (xi ) +

1 w 2

2

,

for some fixed choice of C. Thus, for wp = 0, both learning strategies are identical except for the fact that the convex SVM hinge loss, max(0, ·), is replaced by the non-convex probit loss, (·). Hence, the objective function minimized by the soft-margin SVM is a convex relaxation of objective function F2.2 . Each

C
i=1


yi w ·  (xi )  (xi )

yi (xi ) + (w - wp ) .  (xi )

(6)

Since this objective function might have several local minima, especially for large values of C, each objective function minimization of F2.2 consisted of k different gradient-descent runs, where each run was initiated

PAC-Bayesian Learning of Linear Classifiers

from a new, randomly-chosen, starting position. In the results presented here, we have used k = 10 for C  10 and k = 100 for C > 10. 4.1. Proposed learning algorithms We propose three algorithms that can be used either with the primal variables (i.e., the components of w) or the dual variables {1 , . . . , m } that appear in the m linear expansion w = i=1 i yi (xi ). In this latter case, the features are implicitly given by a Mercer kernel k(x, x ) =  (x) ·  (x ) (x, x )  X × X . The objective functions F2.1 and F2.2 , with their gradients (Eq. (5) and (6)), can then straightforwardly be expressed in terms of k(·, ·) and the dual variables.2 The first algorithm, called PBGD1, uses the prior P0 (i.e., with wp = 0) to learn a posterior Qw by minimizing the bound value of Corollary 2.1 (objective function F2.1 ). In this paper, every bound computation has been performed with  = 0.05. The second algorithm, called PBGD2, was studied to investigate if it is worthwhile to use a fraction x of the training data to construct an informative prior Pwp , for some wp = 0, that will be used to learn a posterior Qw on the remaining 1 - x fraction of the training data. In its first stage, PBGD2 minimizes the objective function F2.2 by using a fraction x of the training data to construct one posterior for each value of C  {10k : k = 0, . . . , 6}. Note that a large value for C attempts to generate a w for which the training error of Gw is small. Each posterior is constructed with the same non-informative prior used for PBGD1 (i.e., with wp = 0). Then, each of these seven posteriors is used as a prior Pwp , with wp = 0, for learning a posterior Qw by minimizing the objective function F2.1 on the remaining fraction 1 - x of the training data. From the union bound argument, the  term in Corollary 2.1 needs to be replaced by (/7) to get a bound that uniformly holds for these seven priors. Empirically, we have observed that the best fraction x used for constructing the prior was 1/2. Hence, we report here only the results for x = 1/2. For the third algorithm, called PBGD3, we always used the prior P0 to minimize the objective function F2.2 . But, instead of using the solution obtained for the value of C that gave the smallest bound of Corollary 2.2, we performed 10-fold cross validation on the training set to find the "best" value for C and then used that value of C to find the classifier that minimizes objective function F2.2 . Hence PBGD3 folThis is true if wp can be expanded in terms of examples that do not belong to the training set.
2

lows the same cross-validation learning methodology normally employed with the SVM but uses the probit loss instead of the hinge loss. To compute the risk bound for the linear classifier returned by PBGD3 and the other comparison algorithms (AdaBoost and SVM), we performed a line search, along the direction of the weight vector w of the returned classifier, to find the norm w that minimizes the bound of Corollary 2.1.3 For each bound computation, we used the non-informative prior P0 . 4.2. PBGD with Respect to Primal Variables For the sake of comparison, all learning algorithms of this subsection are producing a linear classifier hw on the set of basis functions {1 , 2 , . . .} known as decision stumps. Each decision stump i is a threshold classifier that depends on a single attribute: its output is +b if the tested attribute exceeds a threshold value t, and -b otherwise, where b  {-1, +1}. For each attribute, at most ten equally-spaced possible values for t were determined a priori. We have compared the three PBGD algorithms to AdaBoost (Schapire et al., 1998) because the latter is a standard and efficient algorithm when used with decision stumps. Since AdaBoost is an algorithm that minm 1 imizes the exponential risk m i=1 exp(-yi w· (xi )), it never chooses a w for which there exists a training ex ample where -yi w · (xi ) is very large. This is to be contrasted with the PBGD's algorithms for which the empirical risk RS (GQw ) has the "sigmoidal" shape of Equation (2) (and never exceeds one). We thus anticipate that AdaBoost and PBGD's algorithms will select different weight vectors w on many data sets. The results obtained for all three algorithms are summarized in Table 1. Except for MNIST, all data sets were taken from the UCI repository. Each data set was randomly split into a training set S of |S| examples and a testing set T of |T | examples. The number n of attributes for each data set is also specified. For AdaBoost, the number of boosting rounds was fixed to 200. For all algorithms, RT (w) refers to the frequency of errors, measured on the testing set T , of the linear classifier hw returned by the learner. For the PBGD's aldef gorithms, GT (w) = RT (GQw ) refers to the empirical risk on T of the Gibbs classifier. The "Bnd" columns refer to the PAC-Bayes bound of Corollary 2.1, computed on the training set. All bounds hold with confidence 1 -  = 0.95. For PBGD1, PBGD3 and AdaBoost, the bound is computed on all the training data
3 This is justified by the fact that the bound holds uniformly for all weight vectors w.

PAC-Bayesian Learning of Linear Classifiers Table 1. Summary of results for linear classifiers on decision stumps.
Dataset Name |S| Usvotes 235 Credit-A 353 Glass 107 Haberman 144 Heart 150 Sonar 104 BreastCancer 343 Tic-tac-toe 479 Ionosphere 176 Wdbc 285 MNIST:0vs8 500 MNIST:1vs7 500 MNIST:1vs8 500 MNIST:2vs3 500 Letter:AvsB 500 Letter:DvsO 500 Letter:OvsQ 500 Adult 1809 Mushroom 4062 |T | 200 300 107 150 147 104 340 479 175 284 1916 1922 1936 1905 1055 1058 1036 10000 4062 (a) AdaBoost n RT (w) Bnd 16 0.055 0.346 15 0.170 0.504 9 0.178 0.636 3 0.260 0.590 13 0.259 0.569 60 0.231 0.644 9 0.053 0.295 9 0.357 0.483 34 0.120 0.602 30 0.049 0.447 784 0.008 0.528 784 0.013 0.541 784 0.025 0.552 784 0.047 0.558 16 0.010 0.254 16 0.036 0.378 16 0.038 0.431 14 0.149 0.394 22 0.000 0.200 (1) RT (w) 0.085 0.177 0.196 0.273 0.170 0.269 0.041 0.294 0.120 0.042 0.015 0.020 0.037 0.046 0.009 0.043 0.061 0.168 0.046 PBGD1 (2) GT (w) Bnd RT (w) 0.103 0.207 0.060 0.243 0.375 0.187 0.346 0.562 0.168 0.283 0.422 0.267 0.250 0.461 0.190 0.376 0.579 0.173 0.058 0.129 0.047 0.384 0.462 0.207 0.223 0.425 0.109 0.099 0.272 0.049 0.052 0.191 0.011 0.055 0.184 0.015 0.097 0.247 0.027 0.118 0.264 0.040 0.050 0.180 0.007 0.124 0.314 0.033 0.170 0.357 0.053 0.196 0.270 0.169 0.065 0.130 0.016 PBGD2 (3) GT (w) Bnd RT (w) 0.058 0.165 0.060 0.191 0.272 0.143 0.176 0.395 0.150 0.287 0.465 0.273 0.205 0.379 0.184 0.168 0.547 0.125 0.054 0.104 0.044 0.208 0.302 0.207 0.129 0.347 0.103 0.048 0.147 0.035 0.016 0.062 0.006 0.016 0.050 0.016 0.030 0.087 0.018 0.044 0.105 0.034 0.011 0.065 0.007 0.039 0.090 0.024 0.053 0.106 0.042 0.169 0.209 0.159 0.017 0.030 0.002 PBGD3 SSB GT (w) Bnd 0.057 0.261 0.159 0.420 0.226 0.581 0.386 0.424 0.214 0.473 0.209 0.622 0.048 0.190 0.217 0.474 (2,3)<(a,1) 0.125 0.557 0.051 0.319 0.011 0.262 0.017 0.233 0.037 0.305 (3)<(1) 0.048 0.356 0.044 0.180 0.038 0.360 0.049 0.454 0.160 0.364 (a)<(1,2) 0.004 0.150 (a,3)<(2)<(1)

with the non informative prior P0 . For PBGD2, the bound is computed on the second half of the training data with the prior Pwp constructed from the first half, and, as explain in Section 4.1, with  replaced by (/7). Note that the bounds values for the classifiers returned by PBGD2 are generally much lower those for the classifiers produced by the other algorithms. This almost always materializes in a smaller testing error for the linear classifier produced by PBGD2. To our knowledge, these training set bounds for PBGD2 are the smallest ones obtained for any learning algorithm producing linear classifiers. To determine whether or not a difference of empirical risk measured on the testing set T is statistically significant, we have used the test set bound method of Langford (2005) (based on the binomial tail inversion) with a confidence level of 95%. It turns out that no algorithm has succeeded in choosing a linear classifier hw which was statistically significantly better (SSB) than the one chosen by another algorithm except for the few cases that are list in the column "SSB" of Table 1. Overall, AdaBoost and PBGD2and3 are very competitive to one another (with no clear winner) and are generally superior to PBGD1. We therefore see an advantage in using half of the training data to learn a good prior over using a non-informative prior and keeping all the data to learn the posterior. 4.3. PBGD with Respect to Dual Variables In this subsection, we compare the PBGD algorithms to the soft-margin SVM. Here, all four learning algorithms are producing a linear classifier on a feature space defined by the RBF kernel k satisfying k(x, x ) = exp(- 1 x-x 2 / 2 ) (x, x )  X ×X . The 2 results obtained for all algorithms are summarized in

Table 2. All the data sets are the same as those of the previous subsection. The notation used in this table is identical to the one used for Table 1. For the SVM and PBGD3, the kernel parameter  and the soft-margin parameter C was chosen by 10fold cross validation (on the training set S) among the set of values proposed by Ambroladze et al. (2006). PBGD1and2 also tried the same set of values for  but used the bound of Corollary 2.1 to select a good value. Since we consider 15 different values of , the bound of PBGD1 is therefore computed with  replaced by (/15). For PBGD2, the bound is, as stated before, computed on the second half of the training data but with  replaced by (/(7·15)). Again, the bounds values for the classifiers returned by PBGD2 are generally much lower those for the classifiers produced by the other algorithms. To our knowledge, the training set bounds for PBGD2 are the smallest ones obtained for any learning algorithm producing linear classifiers. The same method as in the previous subsection was used to determine whether or not a difference of empirical risk measured on the testing set T is statistically significant. It turns out that no algorithm has chosen a linear classifier hw which was statistically significantly better than the choices of the others except the few cases listed in the "SSB" column. Thus, the SVM and PBGD3 are very competitive to one another (with no clear winner) and are both slightly superior to PBGD2, and a bit more than slightly superior than PBGD1.

5. Conclusion
We have shown that the standard PAC-Bayes risk bounds (McAllester, 2003; Seeger, 2002; Langford, 2005; Catoni, 2007) are specializations of Theorem 2.1 that are obtained by choosing a particular convex func-

PAC-Bayesian Learning of Linear Classifiers Table 2. Summary of results for linear classifiers with a RBF kernel.
Dataset Name |S| Usvotes 235 Credit-A 353 Glass 107 Haberman 144 Heart 150 Sonar 104 BreastCancer 343 Tic-tac-toe 479 Ionosphere 176 Wdbc 285 MNIST:0vs8 500 MNIST:1vs7 500 MNIST:1vs8 500 MNIST:2vs3 500 Letter:AvsB 500 Letter:DvsO 500 Letter:OvsQ 500 Adult 1809 Mushroom 4062 |T | 200 300 107 150 147 104 340 479 175 284 1916 1922 1936 1905 1055 1058 1036 10000 4062 (s) SVM (1) n RT (w) Bnd RT (w) 16 0.055 0.370 0.080 15 0.183 0.591 0.150 9 0.178 0.571 0.168 3 0.280 0.423 0.280 13 0.197 0.513 0.190 60 0.163 0.599 0.250 9 0.038 0.146 0.044 9 0.081 0.555 0.365 34 0.097 0.531 0.114 30 0.074 0.400 0.074 784 0.003 0.257 0.009 784 0.011 0.216 0.014 784 0.011 0.306 0.014 784 0.020 0.348 0.038 16 0.001 0.491 0.005 16 0.014 0.395 0.017 16 0.015 0.332 0.029 14 0.159 0.535 0.173 22 0.000 0.213 0.007 PBGD1 (2) GT (w) Bnd RT (w) 0.117 0.244 0.050 0.196 0.341 0.150 0.349 0.539 0.215 0.285 0.417 0.327 0.236 0.441 0.184 0.379 0.560 0.173 0.056 0.132 0.041 0.369 0.426 0.173 0.242 0.395 0.103 0.204 0.366 0.067 0.053 0.202 0.007 0.045 0.161 0.009 0.066 0.204 0.011 0.112 0.265 0.028 0.043 0.170 0.003 0.095 0.267 0.024 0.130 0.299 0.019 0.198 0.274 0.180 0.032 0.119 0.001 PBGD2 (3) GT (w) Bnd RT (w) 0.050 0.153 0.075 0.152 0.248 0.160 0.232 0.430 0.168 0.323 0.444 0.253 0.190 0.400 0.197 0.231 0.477 0.144 0.046 0.101 0.047 0.193 0.287 0.077 0.151 0.376 0.091 0.119 0.298 0.074 0.015 0.058 0.004 0.015 0.052 0.010 0.019 0.060 0.010 0.043 0.096 0.023 0.009 0.064 0.001 0.030 0.086 0.013 0.032 0.078 0.014 0.181 0.224 0.164 0.003 0.011 0.000 PBGD3 SSB GT (w) Bnd 0.085 0.332 0.267 0.375 0.316 0.541 0.250 0.555 0.246 0.520 0.243 0.585 0.051 0.162 0.107 0.548 (s,3)<(2)<(1) 0.165 0.465 0.210 0.367 0.011 0.320 0.012 0.250 0.024 0.291 0.036 0.326 (s)<(1) 0.408 0.485 0.031 0.350 0.045 0.329 0.174 0.372 (s,3)<(2) 0.001 0.167 (s,2,3)<(1)

tion D that binds Gibbs' true risk to its empirical estimate. Moreover, when spherical Gaussians over spaces of linear classifiers are used for priors and posteriors, we have shown that the Gibbs classifier GQw that minimizes the PAC-Bayes bound of Corollary 2.1 (resp. 2.2) is obtained from the weight vector w that minimizes the (non-convex) objective function F2.1 (resp. F2.2 ). When the prior is non-informative, a simple convex relaxation of F2.2 gives the objective function which is minimized by the soft-margin SVM. We have proposed two learning algorithms (PBGD1 and PBGD2) for finding linear classifiers that minimize the bound of Corollary 2.1, and another algorithm (PBGD3) that uses the cross-validation methodology to determine the value of parameter C in the objective function F2.2 . PBGD1 uses a noninformative prior to construct the final classifier from all the training data. In contrast, PBGD2 uses a fraction of the training set to construct an informative prior that is used to learn the final linear classifier on the remaining fraction of the training data. Our extensive experiments indicate that PBGD2 is generally much more effective than PBGD1 at producing classifiers with small true risk. Moreover, the training set risk bounds obtained for PBGD2 are, to our knowledge, the smallest obtained so far for any learning algorithm producing linear classifiers. In fact, PBGD2 is a learning algorithm producing classifiers having a good guarantee without the need of using any test set for that purpose. This opens the way to a feasible learning strategy that uses all the available data for training. Our results also indicate that PBGD2 and PBGD3 are competitive with both AdaBoost and the soft-margin SVM at producing classifiers with small true risk. However, as a consequence of the nonconvexity of the objective function F2.2 , PBGD2 and

PBGD3 are slower than AdaBoost and the SVM. Acknowledgements Work supported by NSERC Discovery grants 262067 and 0122405.

References
Ambroladze, A., Parrado-Hern´ndez, E., & Shawea Taylor, J. (2006). Tighter PAC-Bayes bounds. Proceedings of the 2006 conference on Neural Information Processing Systems (NIPS-06) (pp. 9­16). Banerjee, A. (2006). On bayesian bounds. ICML '06: Proceedings of the 23rd international conference on Machine learning (pp. 81­88). Catoni, O. (2007). PAC-Bayesian surpevised classification: the thermodynamics of statistical learning. Monograph series of the Institute of Mathematical Statistics, http://arxiv.org/abs/0712.0248. Langford, J. (2005). Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6, 273­306. Langford, J., & Shawe-Taylor, J. (2003). PAC-Bayes & margins. In S. T. S. Becker and K. Obermayer (Eds.), Advances in neural information processing systems 15, 423­430. Cambridge, MA: MIT Press. McAllester, D. (2003). PAC-Bayesian stochastic model selection. Machine Learning, 51, 5­21. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26, 1651­1686. Seeger, M. (2002). PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3, 233­269.