A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning

Massih R. Amini Laboratoire d'Informatique de Paris 6 ´ Universite Pierre et Marie Curie, Paris, France massih-reza.amini@lip6.fr

Francois Laviolette ¸ ´ Universite Laval ´ Quebec (QC), Canada francois.laviolette@ift.ulaval.ca

Nicolas Usunier Laboratoire d'Informatique de Paris 6 ´ Universite Pierre et Marie Curie, Paris, France nicolas.usunier@lip6.fr

Abstract
We propose two transductive bounds on the risk of majority votes that are estimated over partially labeled training sets. The first one involves the margin distribution of the classifier and a risk bound on its associate Gibbs classifier. The bound is tight when so is the Gibbs's bound and when the errors of the majority vote classifier is concentrated on a zone of low margin. In semi-supervised learning, considering the margin as an indicator of confidence constitutes the working hypothesis of algorithms which search the decision boundary on low density regions. Following this assumption, we propose to bound the error probability of the voted classifier on the examples for whose margins are above a fixed threshold. As an application, we propose a self-learning algorithm which iteratively assigns pseudo-labels to the set of unlabeled training examples that have their margin above a threshold obtained from this bound. Empirical results on different datasets show the effectiveness of our approach compared to the same algorithm and the TSVM in which the threshold is fixed manually.

1

Introduction

Ensemble methods [5] return a weighted vote of baseline classifiers. It is well known that under the PACBayes framework [9], one can obtain an estimation of the generalization error (also called risk) of such majority votes (referred as Bayes classifier). Unfortunately, those bounds are generally not tight, mainly because they are indirectly obtain via a bound on a randomized combination of the baseline classifiers (called the Gibbs classifier). Although the PAC-Bayes theorem gives tight risk bounds of Gibbs classifiers, the bounds of their associate Bayes classifiers come at a cost of worse risk (trivially a factor of 2, or under some margin assumption, a factor of 1+ ). In practice the Bayes risk is often smaller than the Gibbs risk. In this paper we present a transductive bound over the Bayes risk. This bound is also based on the risk of the associated Gibbs classifier, but it takes as an additional information the exact knowledge of the margin distribution of unlabeled data. This bound is obtained by analytically solving a linear program. The intuitive idea here is that given the risk of the Gibbs classifier and the margin distribution, the risk of the majority vote classifier is maximized when all its errors are located on low margin examples. We show that our bound is tight when the associated Gibbs risk can accurately be estimated and when the Bayes classifier makes most of its errors on low margin examples. The proof of this transductive bound makes use of the (joint) probability over an unlabeled data set that the majority vote classifier makes an error and the margin is above a given threshold. This second result naturally leads to consider the conditional probability that the majority vote classifier makes an error knowing that the margin is above a given threshold.


This conditional probability is related to the concept that the margin is an indicator of confidence which is recurrent in semi-supervised self-learning algorithms [3,6,10,11,12]. These methods first train a classifier on the labeled training examples. The classifier outputs serve then to assign pseudo-class labels to unlabeled data having margin above a given threshold. The supervised method is retrained using the initial labeled set and its previous predictions on unlabeled data as additional labeled examples. Practical algorithms almost fix the margin threshold manually. In the second part of the paper, we propose to find this margin threshold by minimizing the bound on the conditional probability. Empirical results on different datasets show the effectiveness of our approach compared to TSVM [7] and the same algorithm but with a manually fixed threshold as in [11] In the remainder of the paper, we present, in section 2, our transductive bounds and show their outcomes in terms of sufficient conditions under which unlabeled data may be of help in the learning process and a linear programming method to estimate these bounds. In section 4, we present experimental results obtained with a self-learning algorithm on different datasets in which we use the bound presented in section 2.2 for choosing the threshold which serve in the label assignment step of the algorithm. Finally, in section 5 we discuss the outcomes of this study and give some pointers to further research.

2

Transductive Bounds on the Risk of the Voted Classifier

We are interested in the study of binary classification problems where the input space X is a subset of Rd and the output space is Y = {-1, +1}. We furthermore suppose that the training set is composed of a l labeled set Z = ((xi , yi ))i=1  Z l and an unlabeled set XU = (xi )l+u+1  X u , where Z represents the i=l set of X × Y . We suppose that each pair (x, y )  Z is drawn i.i.d. with respect to a fixed, but unknown, probability distribution D over X × Y and we denote the marginal distribution over X by DX . To simplify the notation and the proofs, we restrict ourselves to the deterministic labeling case, that is, for each x  XU , there is exactly one possible label that we will denote by y .1 In this study, we consider learning algorithms that work in a fixed hypothesis space H of binary classifiers (defined without reference to the training data). After observing the training set S = Z  XU , the task of the learner is to choose a posterior distribution Q over H such that the Q-weighted majority vote classifier BQ (also called the Bayes classifier) will have the smallest possible risk on examples of XU . Recall that the Bayes classifier is defined by BQ (x) = sgn [EhQ h(x)] x  X . (1) where, sgn(x)=+1 if the real number x > 0 and -1 otherwise. We further denote by GQ the associated Gibbs classifier which for classifying any example x  X chooses randomly a classifier h according to the distribution Q. We accordingly define the transductive risk of GQ over an unlabeled set by: 1x de Ru (GQ ) =f EhQ [[h(x ) = y ]] (2) u
XU

Where, [[ ]] = 1 if predicate  holds and 0 otherwise, and for every unlabeled example x  XU we refer to y as its true unknown class label. In section 2.1 we show that if we consider the margin as an indicator of  confidence and that we dispose a tight upper bound Ru (GQ ) of the risk of GQ which holds with probability 1 -  over the random choice of Z and XU (for example using Theorem 17 or 18 of Derbelo et al. [4]), we are then able to accurately bound the transductive risk of the Bayes classifier: 1x de Ru (BQ ) =f [[BQ (x ) = y ]] (3) u
XU

This result follows from a bound on the joint Bayes risk: 1x de [[BQ (x ) = y Ru (BQ ) =f u
XU


mQ (x ) > ]]

(4)

Where mQ (·) = |EhQ h(·)| denotes the unsigned margin function. One of the practical issues that arises from this result is the possibility to define a threshold  for which the bound is optimal and that we use in a self-learning algorithm by iteratively assigning pseudo-labels to unlabeled examples having margin above this threshold. We finally denote by Eu z the expectation of a random variable z with respect to the uniform distribution over XU and for notation convenience we equivalently define Pu the uniform 1 probability distribution over XU i.e. For any subset A, P (A) = u card(A).
( The proofs can be inferred to the more general noisy case, but one has to replace the summation |) ,) in the definitions of equations (3) and (4). x ,y )XU ×{-1,+1}. P(x y D (y x
1

x

X

U

by


2.1

Main Result

Our main result is the following theorem which provides two bounds on the transductive risks of the Bayes classifier (3) and the joint Bayes risk (4). Theorem 1 Suppose that BQ is as in (1). Then for all Q and all   (0, 1] with probability at least 1 -  : ( P + 1 K < 5) Ru (BQ )  inf (mQ (x ) <  ) + (Q) - MQ ( ) u u   (0,1]
1   Where Ku (Q) = Ru (GQ ) + 2 (Eu mQ (x ) - 1), MQ (t) = Eu mQ (x )[[mQ (x ) and . + denotes the positive part (i.e. x + = [[x > 0]]x).

t]] for

being < or 

More generally, with probability at least 1 -  , for all Q and all   0: P + 1 K  < ) Ru (BQ )  inf u ( < mQ (x <  ) + u (Q) + MQ ( ) - MQ ( )   ( ,1]

( 6)

In section 2.2 we will prove that the bound (5) simply follows from (6). In order to better understand the  former bound on the risk of the Bayes classifier, denote by Fu (Q) the right hand side of equation (5): a P + 1 K  ) < de Fu (Q) =f inf u (mQ (x <  ) + u (Q) - MQ ( )   (0,1] nd consider the following special case where the classifier makes most of its errors on unlabeled examples with low margin. Proposition 2, together with the explanations that follow, makes this idea clearer. Proposition 2 Assume that x  XU , mQ (x) > 0 and that C  (0, 1] such that  > 0: Pu (BQ (x ) = y


mQ (x ) =  ) = 0  Pu (BQ (x ) = y


mQ (x ) <  )  C · Pu (mQ (x ) <  )

Then, with probability at least 1 -  :
 Fu (Q) - Ru (BQ ) 

1-C R (GQ ) - Ru (GQ ) Ru (BQ ) + u C 

(7)

Where   = sup { |Pu (BQ (x ) = y


mQ (x ) =  ) = 0}

Now, suppose that the margin is an indicator of confidence. Then, a Bayes classifier that makes its error mostly on low margin regions will admit a coefficient C in inequality (7) close to 1 and the bound of (5)  becomes tight (provided we have an accurate upper bound Ru (GQ ) ). In the next section we provide proofs of all the statements above and show in lemma 4 a simple way to compute the best margin threshold for which the general bound on the joint Bayes risk is the lowest. 2.2 Proofs All our proofs are based on the relationship between Ru (GQ ) and Ru (BQ ) and the following lemma: Lemma 3 Let (1 , .., N ) be the ordered sequence of the different strictly positive values of the margin on XU , that is {i , i = 1..N } = {mQ (x )|x  XU  mQ (x ) > 0} and i  {1, . . . , N - 1}, i < i+1 . Denote moreover bi = Pu (BQ (x ) = y  mQ (x ) = i ) for i  {1, . . . , N }. Then, Ru (GQ ) = iN
=1 N

bi  i +

1 (1 - Eu mQ (x )) 2 bi with k = max{i|i  }

(8)

  [0, 1], Ru (BQ ) =

i
=k+1

(9)

Proof Equation (9) follows the definition Ru (BQ ) = Pu (BQ (x ) = y x


mQ (x ) > ).

Equation (8) is obtained from the definition of the margin mQ which writes as XU , mQ (x ) = |EhQ [[h(x ) = 1]] - EhQ [[h(x ) = -1]]| = |1 - 2EhQ [[h(x ) = y ]]|


1 By noticing that for all x  XU the condition EhQ [[h(x ) = y ]] > 2 is equivalent to the statement E ) ) , y hQ h(x < 0 or BQ (x = y we can rewrite mQ without absolute values and hence get:

EhQ [[h(x ) = y ]] =

1 1 (1 + mQ (x ))[[BQ (x ) = y ]] + (1 - mQ (x ))[[BQ (x ) = y ]] 2 2

(10)

Finally equation (8) yields by taking the mean over x  XU and by reorganizing the equation using the notations of bi and i . Recall that the values the x for which mQ (x ) = 0 counts for 0 in the sum that P defined the Gibbs risk (see equation 2 and the definition of mQ ).
 roof of Theorem 1 First, we notice that equation (5) follows equation (6) from the fact that MQ (0) = 0 and the following inequality:

Ru (BQ ) = Ru0 (BQ ) + Pu (BQ (x ) = y


mQ (x ) = 0)  Ru0 (BQ ) + Pu (mQ (x ) = 0)

For proving equation (6), we know from lemma 3 that for a fix   [0, 1] there exist (b1 , . . . , bN ) such that 0  bi  Pu (mQ (x ) = i ) and which satisfy equations (8) and (9).
 Let k = max{i | i  }, assuming now that we can obtain an upper bound Ru (GQ ) of Ru (GQ ) which holds with probability 1 -  over the random choices of Z and XU , from the definition (4) of Ru (BQ ) with probability 1 -  we have then N

Ru (BQ )  max

i
=k+1

b1 ,..,bN

bi u.c. i, 0  bi  Pu (mQ (x ) = i ) and

iN
=1

 bi i  Ku (Q)

(11)

1   Where Ku (Q) = Ru (GQ ) - 2 (1 - Eu mQ (x )). It turns out that the right hand side in equation (11) is the solution of a linear program that can be solved analytically and which is attained for:  i 0  P ef i  k , K + k ) (12) bi = u (Q)- <j <i j Pu (mQ (x =j ) ) min lsewhere. u (mQ (x = i ) ,  i

For clarity, we defer the proof of equation (12) to lemma 4, and continue the proof of equation (6). k  < ) Using the notations defined in Theorem 1, we rewrite <j <i j Pu (mQ (x = j ) as MQ (i ) - MQ ( ). w i N I  <  hich implies i=k+1 bi = i=k+1 bi We further define I = max |Ku (Q) + MQ () - MQ (i ) > 0 from equations (11) and (12) with bI = bound on Ru (BQ ):
 < Ku (Q)+MQ ( )-MQ (I ) . I 

From this inequality we hence obtain a

Ru (BQ )  Pu ( < mQ (x ) < I ) +

 <  Ku (Q) + MQ () - MQ (I )

I

(13)

The proof of the second point in theorem 1 is just a rewriting of this result as from the definition of I , for any  > I , the right-hand side of equation (6) is equal to Pu (mQ (x ) <  ), which is greater than the right-hand side of equation (13). Moreover, for  < I , we notice that   Pu (mQ (x ) <  ) + L decreases.
 < Ku (Q)+MQ ( )-MQ ( )  

emma 4 (equation (12)) Let gi , i = 1...N be such that 0 < gi < gi+1 , pi  0, i = 1...N , B  0 and k  {1, . . . , N }. Then, the optimal value of the linear program:
N q1 ,...,qN

max

i
=k+1

qi

u.c. i, 0  qi  pi and

iN
=1

qi gi  B
j
 qj gj

(14)

p P -   is attained for q  defined by: i  k : qi = 0 and i > k , qi = min i , B

<i

gi

+

N roof Define O = {0}k × i=k+1 [0, pi ]. We will show that problem (14) has a unique optimal solution N in O, and that this solution is q  . In the rest of the proof, we denote F (q ) = i=k+1 qi .


First, the problem is convex, feasible (take i, qi = 0) and bounded. Therefore there is an optimal solution N o o o q opt  i=1 [0, pi ]. Define q opt,O by qi pt,O = qi pt if i > k and qi pt,O = 0 otherwise. Then, q opt,O  O, opt,O opt it is clearly feasible, and F (q ) = F (q ). Therefore, there is an optimal solution in O. Now, for (q , q )  RN × RN , define I(q , q ) = {i|qi > qi }, and consider the lexicographic order
) N N  , ) ) ,

:

(q , q  R × R , q q I(q q ) =  or (I(q , q =  and min I(q , q < min I(q q )) The crucial point is that q  is the greatest feasible solution in O for . Indeed, notice first that q  is N  necessarily feasible and i=1 qi = B . To see this result let M be the set {i > k | : qi < pi }, we then have two possibilities to consider. (a) M = . In this case q  is simply the maximal element for in O. (b)  M = . In this case, let K = min{i > k |qi < pi } and M = I(q , q  ). We claim that there are no feasible q  RN such that q q  . By way of contradiction, suppose such a q exists. Then, if M  k , we have qM > 0, and therefore q is not in O; if k < M < K , we have qM > pM , N N which yields the same result; and finally, if M  K , we have i=1 qi > i=1 qi = B , and q is not feasible. We now show that if q  O is feasible and q  q, then q is not optimal (which is equivalent to show that an optimal solution in O must be the greatest feasible solution for ). Let q  O be a feasible solution such that q  q. Since q q  , I(q  , q ) is not empty. If I(q , q  ) = , we have F (q  ) > F (q ), and therefore q is not optimal. We now treat the case where I(q , q  ) = . Let K = min I(q  , q ) and M = min I(q , q  ). We have qM > 0 by definition, and K < M because q  q a and q  O. Let  = min M , gM (pK - qK ) nd define q by: gK qi = qi if i  {K, M } , qK = qK + q

gM  and qM = qM - . gK ii We can see that q is feasible by the definition of , that it satisfies the box constraints, and q gi = i i gM gM qi gi  B . Moreover F (q ) = F (q ) + ( gK - 1) > F (q ) since qi gi + gK   gK -   gM = gK < gM and  > 0. Thus, q is not optimal. In summary, we have shown that there is an optimal solution in O, and that a feasible solution in O must be the greatest feasible solution for the lexicographic order in O to be optimal and which is q  . P roof of Proposition 2 First let us claim that Ru (BQ )  Pu (BQ (x ) = y where   = sup { |Pu (BQ (x ) = y
 

mQ (x ) <   ) +

+ 1K < u (Q) - MQ ( ) 

(15)

1 mQ (x ) =  ) = 0} and Ku (Q) = Ru (GQ ) + 2 (Eu mQ (x ) - 1).

Indeed, assume for now that equation (15) is true. Then, by assumption we have: + 1K < Ru (BQ )  C · Pu (mQ (x ) <   ) +  (16) u (Q) - MQ ( )  K + <   Since Fu (Q)  Pu (mQ (x ) <   ) + 1 (Q) - MQ (  ) , with probability at least 1 -  we obtain: u
 Fu (Q) - Ru (BQ )  (1 - C )Pu (mQ (x ) <   ) +  Ru (GQ ) - Ru (GQ ) 

(17)

< <   This is due to the fact that Ku (Q) - MQ (  ) + - Ku (Q) - MQ (  ) +  Ru (GQ ) - Ru (GQ ) when 1  ) Ru (GQ )  Ru (GQ ). Taking once again equation (16), we have Pu (mQ (x <   )  C Ru (BQ ). Plugging back this result in equation (17) yields Proposition 2.

K- Ku (Q)- i=1 1 bi i < and we therefore have bK  1 Ku (Q) - MQ (  ) + since bK  0 and i, bi  K K K Pu (mQ (x ) = i ). Finally, from equation (9), we have Ru (BQ ) = i=1 bi , which implies equation (15) K -1 by using the lower bound on bK and the fact that i=1 bi = Pu (BQ (x ) = y  mQ (x ) <   ).

Now, let us prove the claim (equation (15)). Since x  XU , mQ (x ) > 0, we have Ru (BQ ) = Ru0 (BQ ). Using the notations of lemma 3, denote K the index such that K =   . Then, it follows from equation (8) K 1 of lemma 3 that Ru (GQ ) = i=1 bi i + 2 (1 - Eu mQ (x )). Solving for bK in this equality yields bK =


In general, good PAC-Bayesian approximations of Ru (GQ ) are difficult to carry out in supervised learning [4] mostly due to the huge number of needed instances to obtain accurate approximations of the distribution of the absolute values of the margin. In this section we have shown that the transductive setting allows for high precision on the bounds from the risk Ru (GQ ) of the Gibbs classifier to the risk Ru (BQ ) if we suppose that the Bayes classifier makes its errors mostly on low margin regions.

3

Relationship with margin-based self-learning algorithms

In Proposition 2 we have considered the hypothesis that the margin is an indicator of confidence as one of the sufficient conditions which leads to a tight approximation of the risk of the Bayes classifier, Ru (BQ ). This assumption constitutes the working hypothesis of margin-based self-learning algorithms in which a classifier is first built on the labeled training set. The output of the learner can then be used to assign pseudo-labels to unlabeled examples having a margin above a fixed threshold (denoted by the set ZU in \ what follows) and the supervised method is repeatedly retrained upon the set of the initial labeled and unlabeled examples that have been classified in the previous steps. The idea behind this pseudo-labeling is that unlabeled examples having a margin above a threshold are less subject to error prone labels, or equivalently, are those which have a small conditional Bayes error defined as: Ru (BQ ) de Ru| (BQ ) =f Pu (BQ (x ) = y | mQ (x ) > ) = (18) Pu (mQ (x ) > ) In this case the label assignation of unlabeled examples upon a margin criterion has the effect to push away the decision boundary from the unlabeled data. This strategy follows the cluster assumption [10] used in the design of some semi-supervised learning algorithms where the decision boundary is supposed to pass through a region of low pattern density. Though margin-based self-learning algorithms are inductive in essence, their learning phase is nearly related to transductive learning which predicts the labels of a given unlabeled set. Indeed, in both cases the pseudo class-label assignation of unlabeled examples is interrelated to their margin. For all these algorithms the choice of the threshold is a crucial point, Input: Labeled and Unlabeled training sets: Z , XU as with a low threshold the risk Initialize ( to assign false labels to exam- (1) Train a classifier H on Z 2) Set ZU   \ ples is high and a higher value of the threshold would not provide repeat (3) Compute the margin threshold  minimizing (18) from (6) enough examples to enhance the (4) S  {(x , y ) | x  XU ; mQ (x )    y = sg n(H (x ))} current decision function. In order (5) ZU  ZU  S, XU = XU \S to examine the effect of fixing the \ \ (6) Learn a classifier H by optimizing a global loss function on threshold or computing it automatZ and ZU ically we considered the margin\ based self-training algorithm pro- until XU is empty or that there are no adds to ZU ; \ ¨ posed by Tur et al. [10, Figure 6] Output The final classifier H (referred as SLA in the following), in which unlabeled examples havFigure 1: Self-learning algorithm (SLA ) ing margin above a fixed threshold are iteratively added to the labeled set and are not considered in next rounds for label distribution. In our approach, the best threshold minimizing the conditional Bayes error (18) from equation (6) of theorem 1 is computed at each round of the algorithm (line 3, figure 1 - SLA ) while the threshold is kept fixed in [10, Figure 6] (line 3 is outside of the  repeat loop). The bound RQ (G), of the risk of the Gibbs classifier which is involved in the computation of the threshold in equation (18) was fixed to its worst value 0.5.

4

Experiments and Results

In our experiments, we employed a Boosting algorithm optimizing the following exponential loss2 as the baseline learner (line (6), figure 1): H ) 1x 1x -y H (x) + e -y ( x (19) Lc (H, Z , ZU ) = \ l |ZU |  \
Z
e

ZU \

Bennett et al. [1] have shown that the minimization of (19) allows to reach a local minima of the margin loss x x -y H (x) |H (x )| function LM (H, Z , ZU ) = 1 + |Z1 | . Z e \ Z e l U \
U \

2


t Where H = t ht is a linear weighted sum of decision stumps ht which are uniquely defined by an input feature jt  {1, . . . , d} and a threshold t as: ht (x) = 2[[jt (x) > t ]] - 1 With j (x) the j th feature characteristic of x. Within this setting, the Gibbs classifier is defined as a | random choice from the set of baseline classifiers {ht }T=1 according to Q such that t, PQ (ht ) = |t |tt | . t Accordingly the Bayes classifier is simply the weighted voting classifier BQ = sig n(H ). Although the self-learning model (SLA ) is an inductive algorithm we carried out experiments in a transductive setting in order to compare results with the transductive SVM of Joachims [7] and the self-learning algorithm (SLA) described in [11, Figure 6]. For the latter, after training a classifier H on Z (figure 1, step 1) we fixed different margin thresholds considering the lowest and the highest output values of H over the labeled training examples. We evaluated the performance of the algorithms on 4 collections from the benchmark data sets3 used in [3] as well as 2 data sets from the UCI repository [2]. In this case, we chose sets large enough for reasonable labeled/unlabeled partitioning, and that represent binary classification problems. Each experiment was repeated 20 times by partitioning, at each time, the data set into two random labeled and unlabeled training sets. Table 1: Means and standard deviations of the classification error on unlabeled training data over the 20 trials for each data set. d denotes the dimension, l and u refer respectively to the number of labeled and unlabeled examples in each data set. Dataset C O I L2 DIGIT G241c USPS PIMA WDBC d 241 241 241 241 8 30 l+u 1500 1500 1500 1500 768 569 l 10 10 10 10 10 10 SLA SLA TSVM  .302 ±.042 .255±.019 .286 ±.031 .201 ±.038 .149±.012 .156±.014 .314 ±.037 .248±.018 .252±.021 .342 ±.024 .278 ±.022 .261±.019 .379 ±.026 .305±.021 .318 ±.018 .168 ±.016 .124±.011 .141 ±.016 l 100 100 100 100 50 50 SLA SLA TSVM  .148 ±.015 .134±.011 .152 ±.016 .091 ±.01 .071±.005 .087 ±.009 .201 ±.017 .191±.014 .196±.022 .114 ±.012 .112 ±.012 .103±.011 .284 ±.019 .266±.018 .276±.021 .112 ±.011 .079±.007 .108 ±.01

For each data set, means and standard deviations of the classification error on unlabeled training data over the 20 trials are shown in Table 1 for 2 different splits of the labeled and unlabeled sets. The symbol  indicates that performance is significantly worse than the best result, according to a Wilcoxon rank sum test used at a p-value threshold of 0.01 [8]. In addition, we show in figure 2 the evolutions on the C O I L2 , D I G I T and USPS data sets of the classification and both risks of the Gibbs classifier (on the labeled and unlabeled training sets) for different number of rounds in the SLA algorithm. These figures are obtained from one of the 20 trials that we ran for these collections. The most important conclusion from these empirical results is that for all data sets, the self-learning algorithm becomes competitive when the margin threshold is found automatically rather than if it is fixed manually. The augmented self-learning algorithm achieves performance statistically better or equivalent to that of TSVM in most cases, while it outperforms the initial method over all runs.

Figure 2: Classification error, train and test Gibbs errors with respect to the iterations of the SLA algorithm for a fixed number of labeled training data l = 10.
3

http://www.kyb.tuebingen.mpg.de/ssl-book/benchmarks.html


In SLA the automatic choice of the margin-threshold has the effect to select, at the first rounds of the algorithm, many unlabeled examples for which their class labels can be predicted with high confidence by the voted classifier. The exponential fall of the classification rate in figure 2 can be explained by the addition of these highly informative pseudo-labeled examples at the first steps of the learning process (figure 1). After this fall, few examples are added because the learning algorithm does not increase the margin on unlabeled data. Hence, the number of additional pseudo-labeled examples decreases resulting in a plateau in the classification error curves in figure 2. We further notice that the error of the Gibbs classifier on labeled data increases fastly to a stationary error point and that on the unlabeled examples does not vary in time.

5

Conclusions

The contribution of this paper is two fold. First, we proposed a bound on the risk of the voted classifier using the margin distribution of unlabeled examples and an estimation of the risk of the Gibbs classifier. We have shown that our bound is a good approximation of the true risk when the errors of the associated Gibbs classifier can accurately be estimated and that the voted classifier makes most its errors on low margin examples. The proof of the bound passed through a second bound on the joint probability that the voted classifier makes an error and that the margin is above a given threshold. This tool led to the conditional probability that the voted classifier makes an error knowing that the margin is above a given threshold. We showed that the search of a margin threshold minimizing this conditional probability can be obtained by analytically solving a linear program. This resolution conducted to our second contribution which is to find automatically the margin threshold in a self-learning algorithm. Empirical results on a number of data sets have shown that the adaptive threshold allows to enhance the performance of a self-learning algorithm. References
[1] Bennett, K., Demiriz, A. & Maclin, R. (2002) Expoliting unlabeled data in ensemble methods. In Proc. ACM Int. Conf. Knowledge Discovery and Data Mining, 289-296. [2] Blake, C., Keogh, E. & Merz, C.J. (1998) UCI repository of machine learning databases. University of California, Irvine. [on-line] http://www.ics.uci.edu/ mlearn/MLRepository.html ¨ [3] Chapelle, O., Scholkopf, B. & Zien, A. (2006) Semi-supervised learning. MA: MIT Press. [4] Derbeko, P., El-Yaniv, R. & Meir, R. (2004) Explicit learning curves for transduction and application to clustering and compression algorithms. Journal of Artificial Intelligence Research 22:117-142. [5] Dietterich, T.G. (2000) Ensemble Methods in Machine Learning. In First International Workshop on Multiple Classifier Systems, 1-15. [6] Grandvalet, Y. & Bengio, Y. (2005) Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems 17, 529-536. Cambridge, MA: MIT Press. [7] Joachims, T. (1999) Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of the 16th International Conference on Machine Learning, 200-209. [8] Lehmann, E.L. (1975) Nonparametric Statistical Methods Based on Ranks. McGraw-Hill, New York. [9] McAllester, D. (2003) Simplified PAC-Bayesian margin bounds. In Proc. od the 16th Annual Conference on Learning Theory, Lecture Notes in Artificial Intelligence, 203-215. [10] Seeger, M. (2002) Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh. ¨ ¨ [11] Tur, G., Hakkani-Tur, D.Z. & Schapire, R.E. (2005) Combining active and semi-supervised learning for spoken language understanding. Journal of Speech Communication 45(2):171-186. [12] Vittaut, J.-N., Amini, M.-R. & Gallinari, P. (2002) Learning Classification with Both Labeled and Unlabeled Data. In European Conference on Machine Learning, 468-476.