The Asymptotics of Semi-Sup ervised Learning in Discriminative Probabilistic Mo dels
Nataliya Sokolovska sokolovska@telecom-paristech.fr Olivier Capp´ e cappe@telecom-paristech.fr LTCI, TELECOM ParisTech and CNRS, 46 rue Barrault, 75013 Paris, France Francois Yvon ¸ Universit´ Paris-Sud 11 and LIMSI-CNRS, 91403 Orsay, France e yvon@limsi.fr

Abstract
Semi-supervised learning aims at taking advantage of unlabeled data to improve the efficiency of supervised learning procedures. For discriminative models however, this is a challenging task. In this contribution, we introduce an original methodology for using unlabeled data through the design of a simple semi-supervised ob jective function. We prove that the corresponding semi-supervised estimator is asymptotically optimal. The practical consequences of this result are discussed for the case of the logistic regression model.

for instance when it comes to predicting the generalization error, dealing with uneven error costs, ranking, combining decisions from multiple sources, etc. Probabilistic generative models fare easily with the use of unlabeled data, usually through ExpectationMaximization (see, e.g., (Nigam et al., 2000; Klein & Manning, 2004) for successful implementations of this idea). It is however an extensively documented fact that discriminative models perform better than Generative models for classification tasks (Ng & Jordan, 2002). Integrating unlabeled data into discriminative models is however a much more challenging issue. Put in probabilistic terms, when learning to predict an output y from an observation x, a discriminative model attempts to fit P (y |x; ), where  denotes the parameter. The role to be played by any available prior knowledge about the marginal probability P (x) in this context is not obvious. Several authors indeed claim that knowledge of P (x) is basically useless (Seeger, 2002; Lasserre et al., 2006), although one of the contribution of this paper will be to show that this intuition relies on the implicit assumption that the model is wel l-specified, in the sense of allowing a perfect fit of the conditional probability. The most common approach is to make the unknown parameter vector  depend on the unlabeled data, either directly or indirectly. One way to achieve this goal is to use the unlabeled data to enforce constraints on the shape of P (y |x): the cluster assumption, for instance, stipulates that the decision boundary should be located in low density regions (Seeger, 2002; Chapelle & Zien, 2005). (Grandvalet & Bengio, 2004) use this intuition to devise a semi-supervised training method (termed entropy regularization ), which combines the usual log-likelihood term with an entropybased penalty; see also (Jiao et al., 2006), who extend this methodology to Conditional Random Fields, (Laf-

1. Intro duction
In most real-world pattern classification problems (e.g., for text, image or audio data), unannotated data is plentiful and can be collected at almost no cost, whereas labeled data are comparatively rarer, and more costly to gather. A sensible question is thus to find ways to exploit the unlabeled data in order to improve the performance of supervised training procedures. Many proposals have been made in the recent years to devise effective semi-supervised training schemes (see (Chapelle et al., 2006) for an up-to-date panorama). In this contribution, we focus on methods applicable to probabilistic classifiers, that is, classifiers designed to provide a probabilistic confidence measure associated with each decision. These classifiers do not necessarily perform better than other alternatives ­ particularly since probabilistic classification and minimum error classification are related, but different, tasks ­ but are important in some applications,
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


The Asymptotics of Semi-Sup ervised Learning in Discriminative Probabilistic Mo dels

ferty et al., 2001), or (Corduneanu & Jaakkola, 2003) for related ideas. This approach, as any attempt to distort the supervised training criterion with supplementary terms faces two risks: (i) to turn a well-behaved convex optimization problem into a non-convex one, fraught with local optima, thus making the results highly dependent of a proper initialization; (ii) to loose the asymptotic consistency property of the usual (conditional maximum likelihood) estimator. As a result, these methods are not guaranteed to improve over a trivial baseline which would only use the available annotated data. They furthermore require a fine tuning of the various optimization parameters (Mann & McCallum, 2007). The cluster assumption is also used in graph-based methods, which exploit the intuition that unlabeled data points should receive the same label as their labeled neighbors: in (Zhu & Ghahramani, 2002), a neighborhood graph is used to iteratively propagate labels from labeled to unlabeled data points until convergence. (Lasserre et al., 2006) explores yet another avenue, introducing two sets of parameters: one for the conditional P (y |x; ), and one for the marginal P (x;  ): the case where  and  are unrelated corresponds to the purely discriminative model, where unlabeled data are of no help; taking  =  recovers the traditional generative model; introducing (via their Bayesian prior distribution) dependencies between (,  ) allows to build a full range of hybrid models. Finally, we also mention (Mann & McCallum, 2007) who try to also exploit prior knowledge on the distribution of the labels Y , which may be available in some specific applications. In this paper, we try to challenge the view that unlabeled data cannot help purely discriminative models by exhibiting a semi-supervised estimator of the parameter  which is asymptotically optimal and, in some situations, preferable to the usual maximum (conditional) likelihood estimator. To this aim, we make the simplifying assumption that the marginal P (x) is fully known, which is true in the limit of infinitely many unlabeled data. An interesting observation about the proposed method is that it is most efficient when the Bayes error is very small which correlates well with the intuition underlying most semisupervised approaches that unlabeled data is most useful if one can assume that the classes are "wellseparated". In addition to the asymptotic results, we also discuss a number of empirical findings pertaining to logistic regression. This paper is organized as follows: in Section 2, we introduce our formal framework and formulate the main result of the paper (Theorem 1), which is first exposed

in its full generality, then particularized to the case of the logistic regression. Experiments with the logistic regression model are discussed in Section 3. Concluding remarks and perspectives close the paper.

2. Semi-Sup ervised Estimator
Let g (y |x; ) denote the conditional probability density function (pdf ) corresponding to a discriminative probabilistic model parameterized by   . In the following, we will always assume that the class variable Y takes its values in a finite set, Y , with a special interest for the binary case where Y = {0, 1}. We will further assume that the input (or explanatory) variable X also takes its values in a finite set X , which may be arbitrary large. The training procedure has access to a set of n i.i.d. labeled observations, (Xi , Yi )1in , as well as to a potentially unlimited number of unlabeled observations, where the quantity of unlabeled data is so large that we can consider that the marginal probability of X is fully known. Finally, for a function f : Rp  R, we denote by z f (z ) the p × 1 gradient vector and by zT z f (z ) the p × p Hessian matrix in z . When f : Rp  Rr , the notation zT f (z ) will be used to denote the r × p Jacobian matrix in z . 2.1. Preliminary: A Simple Case We first consider the case where the "model" of interest is very basic and simply consists in estimating the complete joint probability of X and Y , which is denoted by  (x, y ). We will also denote by  (y |x) and q (x), respectively, the conditional and the marginal probabilities associated with  . Although this case is not directly of interest for real-life statistical learning tasks, it highlights the role played by the knowledge of the marginal q in semi-supervised learning. It is well known that the maximum-likelihood estimator of  (x, y ) defined by n (x, y ) = ^
n 1i 1{Xi = x, Yi = y } n =1

(1)

is asymptotically efficient with asymptotic variance  (x, y ) =  (x, y )(1 -  (x, y )) (assuming that 0 <  (x, y ) < 1). Assume now that we are given q (x), the marginal distribution of X , and that 0 < q (x) < 1. It is easily checked that the maximum-likelihood estimator of  (x, y ) sub ject to the marginal constraint that


The Asymptotics of Semi-Sup ervised Learning in Discriminative Probabilistic Mo dels

y

Y

 (x, y ) = q (x) is given by n 1{Xi = x, Yi = y } s n (x, y ) = i=1 n ^ q (x) i=1 1{Xi = x}

(2)

denotes the empirical measure associated with the sample (Xi , Yi )1in , which also coincides with the maximum likelihood estimate of  (x, y ) defined in (1). If we now assume that the marginal q (x) is available, we know that n (x, y ) is dominated (asymptotically) ^ by the estimator n (x, y ) defined in (2), which we here ^s particularize to n (x, y ) = ^s   Pn=1 1{Xi =x,Yi =y}  iPn q (x)
i=1

where the superscript s stands for "semi-supervised" and the ratio is recognized as the maximum-likelihood estimate of the conditional probability  (y |x). As n (x, y ) is a ratio of two simple estimators, its asymp^s totic variance can be computed using the  -method, yielding  s (x, y ) =  (x, y )(1 -  (x, y )/q (x)) As 0 <  (x, y )  q (x) < 1,  s (x, y ) is less than  (x, y ). Hence, in general the semi-supervised esti^ mator n (x, y ) and n (x, y ) are not asymptotically ^s equivalent, and n (x, y ) is preferable. More precisely, ^s  s (x, y )/ (x, y ) = (1 -  (x, y )/q (x))/(1 -  (x, y )) which tends to zero as  (x, y ) gets closer to q (x). In other words, the performance of n (x, y ) is all the more ^s appreciable, compared to that of n (x, y ), that y is a ^ frequent label for x. In this case, knowledge of the marginal q (x) makes it possible to obtain a precise estimate of n (x, y )  q (x) even with a very limited ^s number of observations of x. 2.2. General Discriminative Mo del We now consider the extension of the previous simple observation to the case of a general discriminative probabilistic model; the main difference being the fact that a given parametric model {g (y |x; )} will generally not be able to fit exactly the actual conditional distribution  (y |x) of the data. As in the fullyspecified case above, it is nonetheless possible to exhibit a semi-supervised estimator which is asymptotically optimal and preferable to the usual conditional maximum likelihood estimator defined by
n 1i ^ (Yi |Xi ; ) n = arg min   n =1

1{Xi =x}

if

in

=1

1{Xi = x} > 0

(5)

  0

otherwise

By analogy with the construction used in the absence of information on q , we now define the corresponding semi-supervised estimator as ^s n = arg min En [(Y |y ; )], where the notation X ^s x s [f (Y , x)] = ^s En ^ X Y n (x, y )f (x, y ) is used somewhat loy sely here as it may happen that, for finite o x n, ^ ^ X Y n (x, y ) < 1, although n (x, y ) sums to one with probability one, for sufficiently large n. It ^s is easily checked that n may also be rewritten as ^s n = arg min
 

in

=1

q (Xi ) (Yi |Xi ; ) 1{Xj = Xi } j =1 n

(6)

Eq. (6) is a weighted version of (3) where the weight given to observations that share the same input x is common and reflects our prior knowledge on the marginal q (x). Theorem 1 Let the joint probability of X and Y factorize as  (x, y ) =  (y |x)q (x), where q is known, and define the fol lowing matrices H ( ) = Eq (V [ (Y |X ;  )|X ])  T I ( ) = Eq  (Y |X ;  ) { (Y |X ;  )} J ( ) = Eq [T  (Y |X ;  )] (7) ( 8) (9)

(3)

where (y |x; ) = - log g (y |x; ) denotes the inverse of the conditional log-likelihood function. Under then(classical) assumptions of Theorem 1 be1 low, n i=1 (Yi |Xi ; ) tends, uniformly in , to ^ E [(Y |X ; )] and thus the limiting value of n is given by  = arg min E [(Y |X ; )] (4)
 

Assume that (1) X and Y are finite sets; (2)  (x, y ) > 0 for al l (x, y )  X × Y ; (3) for al l (x, y )  X × Y , (y |x; ) is bounded on ; (4)  is the unique minimizer of E [(Y |X ; )] on ; (5) for al l (x, y )  X × Y , (y |x; ) is twice continuously differentiable on ; (6) the matrices H ( ) and J ( ) are non singular. ^ ^s Then, n and n are consistent and asymptotical ly normal estimators of  , which satisfy L 0 ( ^ n n -  - N , J -1 ( )I ( )J -1 ( ) 10) L 0 -1 (  ^s n n -  - N , J ( )H ( )J -1 ( ) 11) ^s Furthermore, n is asymptotical ly efficient.

The maximum likelihood estimator in (3) may also be ^ interpreted as n = arg min En [(Y |X ; )] where ^ n (x, y ) = ^
n 1i 1{Xi = x, Yi = y } n =1


The Asymptotics of Semi-Sup ervised Learning in Discriminative Probabilistic Mo dels

Theorem 1 asserts that the asymptotic covariance ma^s trix associated with n is optimal. Understanding the relations between H ( ) and I ( ) is thus important to assess the asymptotic performance achievable by any semi-supervised training method which assumes prior knowledge of q (x). Indeed, the well-known RaoBlackwell variance decomposition shows that I ( ) - H ( ) = Vq (E [ (Y |X ;  )|X ]) As a result, the difference between both estimators will mostly depend on whether E [ (Y |X ;  )|X = x] varies significantly or not around 0 as a function of x, given that, by definition,  is such that Eq (E [ (Y |X ;  )|X ]) = 0. Note that in the particular case where the model is wel l-specified, in the sense that  is such that g (y |x;  ) =  (y |x) for all (x, y )  X × Y , not only is Eq (E [ (Y |X ;  )|X ]) null but one indeed has the stronger result that for al l x  X , E [ (Y |X ;  )|X = x] = 0. This is the only case for which H ( ) = I ( ), and hence, where both estimators are asymptotically equivalent; it is also well known that in this case J ( ) = I ( ) so that all asymptotic covariance matrices coincide with the usual expression of the inverse of the Fisher information matrix for . Theorem 1 gives formal support to the intuition that it is impossible to improve over the classic maximum likelihood estimator for large n's when the model is well-specified, even when the marginal q is known. The results of Theorem 1 are stated in terms of parameter estimation which is usually not the primary interest for statistical learning tasks. Due to the nondifferentiability of the 0­1 loss, it is not directly possible to derive results pertaining to the error probability from Theorem 1. One may however state the following result in terms of the logarithmic risk, in which the negated log-likelihood (y |x; ) is interpreted as a loss function. Corollary 2 In addition to the assumptions of Theorem 1, assume that (y |x; ) has bounded Then, the logarithsecond derivative on . mic risk admits the fol lowing asymptotic equiva^  lent: EI n {E [(Y |X ; n )]} 1=, E [(Y |X ;  )] + + 1 where En detrace ( )J -1 ( ) on 2n notes the expectation with respect to the training data (Xi , Yi )1in ; for the semi-supervised es^s timator n , the first . order term is given by H 1 ( )J -1 ( ) 2n trace As a final comment on Theorem 1, note that the form ^s of the semi-supervised estimator in (6) shows that n will be consistent also in the presence of covariate shift

(i.e., when the marginal distribution of the training sample differs from q ), whereas the logistic regression estimates can only be consistent in this case if we assume that the model is well-specified (Shimodaira, 2000). In the presence of covariate shift however, the expressions of the asymptotic covariance matrices will be different. 2.3. Application to Logistic Regression To gain further insight into the results summarized in Theorem 1, we consider the example of the logistic regression model with binary labels Y and input variables X in Rp ; the parameter  is thus p-dimensional. In this model, the negative log-likelihood function is T given by (y |x; ) = -y T x + log(1 + e x )1 . Thus, the estimation equation which implicitly defines the value of the optimal fit  as the value for which E [ (Y |X ;  )] = 0 may be rewritten as Eq [X (g (1|X ;  ) -  (1|X ))] = 0 Similar direct calculations yield  H ( ) = Eq (1|X )(1 -  (1|X ))X X T  I ( ) = Eq (1|X )(1 -  (1|X )) J ( ) = Eq (12)

( 13) ( 14) ( 15)

XT + ( (1|X ) - g (1|X ;  ))2 X g (1|X ;  ){1 - g (1|X ;  )}X X T

J ( ) is the Fisher information matrix traditionally found in logistic regression. Interestingly, H ( ) is recognized as the Fisher information matrix for  corresponding to the fully supervised logistic regression model in the well-specified case (i.e. assuming that g (y |x;  ) =  (y |x)), although we made no such assumption here. For logistic regression, the difference i { I ( ) - H ( ) = Eq  (1|X ) - g (1|X ;  )}2 X X T s clearly a term that is all the more significant that the fit achievable by the model is poor. The second important factor that can lead to substantial differences ^ ^s between the asymptotic performances of n and n is revealed by the following observation: for a given distribution  , the largest (in a matrix sense) achievable value for I ( ) is given by w m I ( ) = Eq ax{ (1|X ), 1 -  (1|X )}X X T hereas H ( ) in (13) may be rewritten as m H ( ) = Eq ax{ (1|X ), 1 -  (1|X )} 1 min{ (1|X ), 1 -  (1|X )}X X T
Or log(1 + e- yx ) when the labels are coded as {-1, 1} rather than {0, 1}.
T


The Asymptotics of Semi-Sup ervised Learning in Discriminative Probabilistic Mo dels
500 450 400 500 450 400

squared error (x n)

300 250 200 150 100 50 0 10 50 100 500 1000 5000

squared error (x n)

350

350 300 250 200 150 100 50 0 10 50 100 500 1000 5000

n

n

Figure 1. Boxplots of the scaled squared parameter estimation error as a function of the number of observations. Left: for logistic regression, n n -  2; right: for the semi-supervised estimator, n n -  2. ^ ^s
7 7

6

6

logarithmic risk (x n)

4

logarithmic risk (x n)
10 50 100 500 1000 5000

5

5

4

3

3

2

2

1

1

0

0

10

50

100

500

1000

5000

n

n

Figure 2. Boxplots of the scaled excess logarithmic risk as a function of the numb er of observations. Left: for logistic ^ ^s regression, n(En {E [(Y |X ; n )]}-E [(Y |X ;  )]); right: for the semi-supervised estimator, n(En {E [(Y |X ; n )]}- E [(Y |X ;  )]).

Hence the difference between I ( ) and H ( ) can only become very significant in cases where min{ (1|X = x), 1 -  (1|X = x)} is small, that is, when the probability of incorrect decision is small, for some values of x. The overall effect will be all the more significant that this situation happens for many values of x, or, in other words, that the Bayes error associated with  is small.

precisely, we consider the case where each observation consists of a vector of p = 10 positive counts which sums to k = 3. Hence the logistic regression parameter  is ten-dimensional and the set X of possible count vectors contains exactly ((p+k-!1)!! = 220 different vecp-1) k tors. In this case, it is well-known that one can simulate data from well-specified logistic models by resorting to mixture of multinomial distributions. Denote by 1 the prior probability of class 1, and by 0 and 1 the vectors of multinomial parameters. Count vectors X generated from the mixture of multinomial have marginal probabilities q (x) = 1 mult(x; 1 ) + (1 - 1 ) mult(x; 0 ) and conditional probabilities P(Y = 1|X = x) = {1 + exp -[(log 1 - 1 log 0 )T x + log 1-1 ]}-1 , where the log is to be understood componentwise. In the following, we take 1 = 0.5, i.e., balanced classes, so as to avoid the bias term. In order to generate misspecified scenarios, we simply

3. Exp eriments
3.1. A Small Scale Exp eriment We consider here experiments on artificial data which correspond to the case of binary logistic regression discussed in Section 2.3. We focus on a small-scale problem where it is possible to exactly compute error probabilities and risks so as to completely bypass the empirical evaluation of trained classifiers. This setting makes it possible to obtain an accurate assessment of the performance as the only source of Monte Carlo error lies in the choice of the training corpus. More


The Asymptotics of Semi-Sup ervised Learning in Discriminative Probabilistic Mo dels

0.14 0.13

0.14 0.13

probability of error

0.11 0.1 0.09 0.08 0.07 0.06 0.05 10 50 100 500 1000

probability of error

0.12

0.12 0.11 0.1 0.09 0.08 0.07 0.06 0.05 10 50 100 500 1000

n

n

Figure 3. Boxplots of the probability of error as a function of the number of observations for a well-specified model. Left: for the logistic regression; right: for the semi-supervised estimator.

flipped the labels of a few (three in the case shown on figures. 1­2) x's taken among the most likely ones. This label flipping transformation leaves the Bayes error unchanged to that of the underlying unperturbed logistic model but the performance achievable by logistic regression is of course reduced. Figures 1 and 2 correspond to a case where the underlying unperturbed logistic model has a Bayes error of 1.7% and the probability of error associated with the best fitting logistic model is of 9.4%. Remember that in these figures, the only source of randomness is due to the choice of the training sample, which is repeated 1000 times independently for each size of the training sample, from n = 10 to n = 5000 observations. As logistic regression is very sensitive to the use of regularization for small sample sizes (here, when n is less than one thousand), both (3) and (6) were regularized by adding a L2 penalty term of the form n  2 , where n has been calibrated independently for each value of n. This being said, the optimal regularization parameter was always found to{be within a factor 2 of 1 P n = 1/n for (3) and n = n x: 1=1 1{Xi =x}>0} q (x) i for (6). The effect of regularization is also negligible for the two rightmost boxplots in each graph (i.e., when n is greater than 1000). On figures 1 and 2, the superimposed horizontal dashed lines correspond to the theoretical averages computed from Theorem 1 and Corollary 2, respectively. When n is larger than one thousand, figures 1 and 2 perfectly correlate with the theory which predicts some advantage for the semi-supervised estimator as we are considering a case where the Bayes error is small and the model misspecification is significant. For large values of n, the semi-supervised estimator not only achieves better average performance but also does so more constantly, with a reduced variability. For smaller values of n, the picture is more contrasted,

particularly when n ranges from 50 to 100 where the semi-supervised estimator may perform comparatively worse than the logistic regression. In this example, in terms of the probability of error, the semi-supervised estimator performs marginally better than logistic regression when n = 10 and n = 5000 (although the difference is bound to be very small in the latter case) and somewhat worse in between. As expected, the difference between both approaches for large values of n decreases for scenarios with larger error probabilities. In those scenarios, the semisupervised estimator performs worse than logistic regression for smaller values of n and equivalently for large values of n. A finding of interest is the fact that for well-specified models (i.e., with data generated from a multinomial mixture model) with low Bayes error, the semi-supervised approach does perform better than logistic regression, for smal l values of n. This effect can be significant even when considering the probability of error of the trained classifiers, as exemplified on Figure 3 in a case where the Bayes error is 6.3%. This observation is promising and deserves further investigation as the analysis of Section 2 only explains the behavior observed for large values of n, which in the case of well-specified models results in the two approaches being equivalent. 3.2. Text Classification Exp eriment To evaluate our methodology on a more realistic test bed, we have used a simple binary classification task, consisting in classifying mails as spam or ham based on their textual content. The corpus used is the SpamAssassin corpus (Mason, 2002), which contains approximately 6 000 documents. Adapting our technique to real-world data requires to provide an estimate for the marginal q (x). This was carried out by performing a discrete quantification of the data vectors as fol-


The Asymptotics of Semi-Sup ervised Learning in Discriminative Probabilistic Mo dels

lows. We first use unsupervised clustering techniques to partition the available unlabeled collection of documents in k clusters. More specifically, we used a mixture of multinomial model as in (Nigam et al., 2000) with k = 10 components. We then simply adapt (6) by replacing q (Xi ) by the empirical frequency of the cluster to which Xi belongs, likewise the denominator n j =1 1{Xj = Xi } is replaced by the numb er of training documents belonging to the same cluster as Xi . We believe that this methodology is very general and makes the proposed approach applicable to a large variety of data. In effect, observations belonging to clusters which are underrepresented in the training corpus have higher relative weights, while the converse if true for observations belonging to overrepresented clusters. Note that, at this stage, no attempts have been made at tuning the number k of clusters, although intuition suggests that it would probably be reasonable to increase k (slowly) with n.
0.2 0.18 0.16

0.14 0.12 0.1 0.08 0.06 L50 S50 L300 S300

prior idea on what type of information is to be provided by the unlabeled data. The result of Theorem 1 provides both proper theoretical support for the claim that the unlabeled data does not matter asymptotically when the model is wel l-specified and a better understanding of the cases where the unlabeled data does matter. In particular, it confirms the intuition that unlabeled data is most useful when the Bayes error is small. One advantage of the proposed method is that it does not compromise the simplicity of the maximum likelihood approach because the weighted semi-supervised criterion stays convex. In addition, one could easily incorporate prior knowledge as used in other semi-supervised approaches: for instance the "cluster assumption" can be implemented by modifying (5) so as to incorporate a Bayesian prior that connects conditional probabilities for neighboring values of the input vector. In Section 3.2, we suggested a means by which the method can be extended to larger scales problem, including applications in which the feature vector is either continuous or has a more complex structure. We are in particular currently investigating the extension of the proposed approach to the case of sequence labelling with conditional random fields. Another open issue is the theoretical analysis of the behavior of the proposed criterion when n is small, which cannot be deduced from the asymptotic analysis presented here.

error rate

App endix: Sketch of Pro ofs
First note that (10) is the well-known result that pertains to the behavior of the maximum likelihood estimator in misspecified models ­ see, for instance, (White, 1982) or Lemma 1 of (Shimodaira, 2000). ^s Now, the fact that n = arg min En [(Y |X ; )] im^s ^s plicitly defines the semi-supervised estimator n as a function of the maximum-likelihood estimator of the conditional probabilities n 1{Xi = x, Yi = y } n (y |x) = i=1 n ^ i=1 1{Xi = x} In our setting, the conditional probability  may be represented by a finite dimensional vector block defined by  = ( (x1 ), . . . ,  (xd ))T , where  (xi ) = ( (y1 |xi ), . . . ,  (yk |xi ))T , {x1 , . . . , xd } denote the elements of X , and, {y0 , . . . , yk } denote the elements of Y . As usual in polytomous regression models, we omit one of the possible valuesyof Y (by convention, y0 ) due to the constraint that Y  (y |x) = 1, for all x  X . ^ The estimator  n is defined similarly with n (y |x) sub^ ^ stituted for n (y |x).  n is the maximum likelihood estimator of  and it is asymptotically efficient with

Figure 4. Boxplots of the error rates for, L50: logistic regression with n = 50; S50: semi-supervised estimator with n = 50; L300 and S300, idem with n = 300.

We tested the method with n = 50 and n = 300 randomly chosen training documents, the remaining mails serving as the test set; each trial gave rise to 50 Monte Carlo replications. For each value of n, the best regularization parameter was determined experimentally both for the usual logistic regression and the semi-supervised estimator. Each document is here represented as a count vector of dimension 1500. The resulting error rates are plotted as boxplots on Figure 4. Although the difference between both methods is certainly not very significant in this preliminary experiment, we note that, as in the simple case of Section 3.1, the semi-supervised estimator provides a more less variable performance when n is small.

4. Conclusion
In this contribution, we have tried to address the problem of semi-supervised learning without using any


The Asymptotics of Semi-Sup ervised Learning in Discriminative Probabilistic Mo dels

asymptotic covariance matrix given by K -1 ( ), the inverse of the Fisher information matrix for  , blockdefined by w K -1 K -1 ( ) = diag (x1 ;  ), . . . , K -1 (xd ;  ) here T -1 d K (xi ;  ) = q (xi )-1 iag ( (xi )) -  (xi ) T (xi ) o obtain the asymptotic behavior of the semi^s ^s supervised estimator n , remark that n is obtained ^ as a function  of  n , where  is implicitly defined by the optimality equation s( ,  ( )) = 0 where s is the (negative of the) score function defined by s( , ) =  E [ (Y |X ; )] = x y q (x)  (y |x) (y |x; )
X Y

Chapelle, O., & Zien, A. (2005). Semi-supervised classification by low density separation. In Proc. of the Tenth International Workshop on Artificial Intel ligence and Statistics. Corduneanu, A., & Jaakkola, T. (2003). On information regularization. In Proc. of the 19th conference on Uncertainty in Artificial Intel ligence (UAI). Grandvalet, Y., & Bengio, Y. (2004). Semi-supervised learning by entropy minimization. NIPS. Jiao, F., Wang, S., Lee, C. H., Greiner, R., & Schuurmans, D. (2006). Semi-supervised conditional random fields for improved sequence segmentation and labeling. ACL/COLING. Klein, D., & Manning, C. D. (2004). Corpus-based induction of syntactic structure: models of dependency and constituency. ACL. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICML. Lasserre, J. A., Bishop, C. M., & Minka, T. P. (2006). Principled hybrids of generative and discriminative models. IEEE CVPR. Mann, G., & McCallum, A. (2007). Simple, robust, scalable semi-supervised learning via expectation regularization. ICML. Mason, J. (2002). SpamAssassin corpus. Ng, A., & Jordan, M. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. NIPS. Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39, 103­134. Seeger, M. (2002). Learning with labeled and unlabeled data (Technical Report). University of Edinburgh, Institute for Adaptive and Neural Computation. Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 227­244. White, H. (1982). Maximum likelihood estimation in misspecified models. Econometrica, 50, 1­25. Zhu, X., & Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label propagation (Technical Report). Carnegie Mellon University.

(16)

^ =  ( n ), is an asymptotBecause  =  ( ) and ically efficient estimator of  with asymptotic covari T . ance matrix given by T  ( )K -1 ( )  T  ( ) The Jacobian matrix T  ( ) may be evaluated thanks to the implicit function theorem as T  ( ) = {T s( ,  )}
-1

^s n

^s n

T s( ,  )

From the definition of the score function in (16), it is obvious that T s( ,  ) = J ( ). In order to calculate T s( ,  ), we differentiate the rightmost expression in (16) using the fact that  (y0 |x) = 1 - y =y0  (y |x) to obtain  s( , ) = q (x) [ (y |x; ) -  (y0 |x; )]   (x|y ) The expression given in Theorem 1 or the asymp^s totic variance of n follows by computing the product T  ­ which factoT s( ,  )K -1 (xi ;  )  T s( ,  ) ries into blocks y f size k ­ and using the fact that o  (y0 |x) = 1 - =y0  (y |x).

Corollary 2 is based on the classical asymptotic ex1^ ^ pansion of E [(Y |X ; n )] - E [(Y |X ;  )] as 2 (n - 1 ^  )T J ( )(n -  ) + op ( n ), see, for instance, (Bach, 2006).

References
Bach, F. (2006). Active learning for misspecified generalized linear models. NIPS. Chapelle, O., Sch¨lkopf, B., & Zien, A. (2006). Semio supervised learning. MIT Press.