The Value of Labeled and Unlabeled Examples when the Model is Imperfect Kaushik Sinha Department of Computer Science and Engineering Ohio State University Columbus, OH 43210 sinhak@cse.ohio-state.edu Mikahil Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 43210 mbelkin@cse.ohio-state.edu Abstract Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received significant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identifiable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, "cluster" and "manifold" assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identifiable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model. 1 Introduction In recent years semi-supervised learning, i.e. learning from labeled and unlabeled data, has drawn significant attention. The ubiquity and easy availability of unlabeled data together with the increased computational power of modern computers, make the paradigm attractive in various applications, while connections to natural learning make it also conceptually intriguing. See [15] for a survey on semi-supervised learning. 1 From the theoretical point of view, semi-supervised learning is simple to describe. Suppose the data is sampled from the joint distribution p(x, y ), where x is a feature and y is the label. The unlabeled data comes from the marginal distribution p(x). Thus the the usefulness of unlabeled data is tied to how much information about joint distribution can be extracted from the marginal distribution. Therefore, in order to make unlabeled data useful, an assumption on the connection between these distributions needs to be made. In the non-parametric setting several such assumptions have been recently proposed, including the the cluster assumption and its refinement, the low-density separation assumption [7, 6], and the manifold assumption [3]. These assumptions relate the shape of the marginal probability distribution to class labels. The low-density separation assumption states that the class boundary passes through the low density regions, while the manifold assumption proposes that the proximity of the points should be measured along the data manifold. However, while these assumptions has motivated several algorithms and have been shown to hold empirically, few theoretical results on the value of unlabeled data in the non-parametric setting are available so far. We note the work of Balcan and Blum ([2]), which attempts to unify several frameworks by introducing a notion of compatibility between labeled and unlabeled data. In a slightly different setting some theoretical results are also available for co-training ([4, 8]). Far more complete results are available in the parametric setting. There one assumes that the distribution p(x, y ) is a mixture of two parametric distribution p1 and p2 , each corresponding to a different class. Such mixture is called identifiable, if parameters of each component can be uniquely determined from the marginal distribution p(x). The study of usefulness of unlabeled data under this assumption was undertaken by Castelli and Cover ([5]) and Ratsaby and Venkatesh ([11]). Among several important conclusions from their study was the fact under a certain range of conditions, labeled data is exponentially more important for approximating the Bayes optimal classifier than unlabeled data. Roughly speaking, unlabeled data may be used to identify the parameters of each mixture component, after which the class attribution can be established exponentially fast using only few labeled examples. While explicit mixture modeling is of great theoretical and practical importance, in many applications there is no reason to believe that the model provides a precise description of the phenomenon. Often it is more reasonable to think that our models provide a rough approximation to the underlying probability distribution, but do not necessarily represent it exactly. In this paper we investigate the limits of usefulness of unlabeled data as a function of how far the best fitting model strays from the underlying probability distribution. The rest of the paper is structured as follows: we start with an overview of the results available for identifiable mixture models together with some extensions of these results. We then describe how the relative value of labeled and unlabeled data changes when the true distribution is a perturbation of a parametric model. Finally we discuss various regimes of usability for labeled and unlabeled data and represent our findings in Fig 1. 2 Relative Value of Labeled and Unlabeled Examples Our analysis is conducted in the standard classification framework and studies the behavior P error - PB ayes , where Perror is probability of misclassification for a given classifier and PB ayes is the classification error of the optimal classifier. The quantity Perror - PB ayes is often referred to as the excess probability of error and expresses how far our classifier is from the best possible. In what follows, we review some theoretical results that describe behavior of the excess error probability as a function of the number of labeled and unlabeled examples. We will denote number of labeled examples by l and the number of unlabeled examples by u. We omit certain minor technical details to simplify the exposition. The classifier, for which Perror is computed is based on the underlying model. Theorem 2.1. (Ratsaby and Venkatesh [11]) In a two class identifiable mixture model, let the equiprobable class densities p1 (x), p2 (x) be d-dimensional Gaussians with ul it covarilance matrin ces. Then for sufficiently small > 0 and arbitrary > 0, given l = O og -1 abeled and 2 u d2 u = O 3 (d log -1 + log -1 ) nlabeled examples respectively, with confidence at least 1 - , probability of error Perror PB ayes (1 + c ) for some positive constant c. Since the mixture is identifiable, parameters can be estimated from unlabeled examples alone. Labeled examples are not required for this purpose. Therefore, unlabeled examples are used to estimate the mixture and hence the two decision regions. Once the decision regions are established, labeled examples are used to label them. An equivalentdform of the above result in terms of labeled and + unlabeled examples is Perror - PB ayes = O u1/3 O (exp(-l)). For a fixed dimension d, this indicates that labeled examples are exponentially more valuable than the unlabeled examples in reducing the excess probability of error, however, when d is not fixed, higher dimensions slower these rates. Independently, Cover and Castelli provided similar results in a different setting under Bayesian framework. Theorem 2.2. (Cover and Castelli [5]) In a two class mixture model, let the parametric class densities, p1 (x), p2 (x) be let the prior h( ) over the unknown mixing parameter . Then 1+ exp{-Dl + o(l)} Perror - PB ayes = O u p where D = - log{2 (1 - ) 1 (x)p2 (x)dx} In their framework, Cover and Castelli [5] assumed that parameters of individual class densities are known, however the associated class labels and mixing parameter are unknown. Under such assumption their result shows that the above rate is obtained when l 3+ u-1 0 as l + u . In particular this implies that, if ue-Dl 0 and l = o(u) the excess error is essentially determined by the number of unlabeled examples. On the other hand if u grows faster than e Dl , then excess error is determined by the number of labeled examples. For detailed explanation of the above statements see pp-2103 [5]. The effect of dimensionality is not captured in their result. Both results indicate that if the parametric model assumptions are satisfied, labeled examples are exponentially more valuable than unlabeled examples in reducing the excess probability of error. In this paper we investigate the situation when the parametric model assumptions are only satisfied to a certain degree of precision, which seems to be a natural premise in a variety of practical settings. It is interesting to note that uncertainty can appear for different reasons. One source of uncertainty is a lack of examples, which we call Type-A. Imperfection of the model is another source of uncertainty, which we will refer to as Type-B. · Type-A uncertainty for perfect model with imperfect information: Individual class densities follow the assumed parametric model. Uncertainty results from finiteness of examples. Perturbation size specifies how well parameters of the individual class densities can be estimated from finite data. · Type-B uncertainty for imperfect model: Individual class densities does not follow the assumed parametric model. Perturbation size specifies how well the best fitting model can approximate the underlying density. Before proceeding further, we describe our model and notations. We take the instance space X R d with labels {-1, 1}. True class densities are always represented by p 1 (x) and p2 (x) respectively. In case of Type-A uncertainty they are simply p1 (x|1 ) and p2 (x|2 ). In case of Type-B uncertainty p1 (x), p2 (x) are perturbations of two d-dimensional densities from a parametric family F . We will denote the mixing parameter by t and the individual parametric class densities by f 1 (x|1 ), f2 (x|2 ) respectively and the resulting mixture density as tf1 (x|1 ) + (1 - t)f2 (x|2 ). We will show some specific results when F consists of spherical Gaussian distributions with unit covariance matrix and 1 t = 2 . In such a case 1 , 2 Rd represent the means of the corresponding densities and the mixture density is indexed by a 2d dimensional vector = [1 , 2 ]. The class of such mixtures is ^ identifiable and hence using unlabeled examples alone can be estimated by R2d and so can be individual class densities. By || · || we represent the standard Euclidean norm in R d and by || · || d ,2 2 the Sobolev norm. Note that for some > 0, || · || d ,2 < implies || · ||1 < and || · || < . We 2 3 will frequently use the following term L(a, t, e) = resent the optimal number of labeled examples for correctly classifying estimated decision regions with high probability (as will be clear in the next section) where, t represents mixing parameter, e represents perturbation size and a is an integer variable and A, B are constants. 2.1 Type-A Uncertainty : Perfect Model Imperfect Information Due to finiteness of unlabeled examples, density parameters can not be estimates arbitrarily close to the true parameters in terms of Euclidean norm. Clearly, how close they can be estimated depends on the number of unlabeled examples used u, dimension d and confidence probability . Thus, Type-I uncertainty inherently gives rise to a perturbation size defined by 1(u, d, ) such that a fixed u defines a perturbation size 1(d, ). Because of this perturbation, estimated decision regions differ from the true decision regions. From [11] it is clear that only very few labeled examples are good enough to label these two estimated decision regions reasonably well with high probability. Let such a number of labeled examples be l . But what happens if the number of labeled examples available is greater than l ? Since the individual densities follow the parametric model exactly, these extra labeled examples can be used to estimate the density parameters and hence the decision regions. However, using a simple union bound it cd n be shown ([10]) that the asymptotic rate for a . d Thus, provided we have u unlabeled convergence of such estimation procedure is O l log( ) log( d ) (t-Ae)(1-2 log( a ) (PB ayes +B e)(1-PB ayes -B e)) to rep- number of labeled examples (following [11]) but then it reduces only at a rate O Provided we use the following strategy, this extends the result of [11] as given in the Theorem below. We adopt the following strategy to utilize labeled and unlabeled examples in order to learn a classification rule. Strategy 1: 1. Given u unlabeled examples, and confidence probability > 0 use maximum likelihood estimation method to learn the parameters of the mixture model such that the estimates dc ^^ 1 , 2 are only 1(u, d, ) = O u1/3 lose to the actual parameters with probability at least 1 - 4 . 3. If l > l examples are available use them to estimate the individual density parameters with probability at least 1 - 2 . examples if we want to represent the rate at which excess probability of error reduces as a function of the number of labeled examples, it is clear that initially the error reduces exponentid lly fast in a . l 2. Use l labeled examples to label the estimated decision regions with probability of incorrect labeling no greater than 4 . Theorem 2.3. Let the model be a mixture of two equiprobable d dimensional spherical Gaussians p1 (x|1 ), p2 (x|2 ) having unit covariance matrices and means 1 , 2 Rd . For any arbitrary 1 > > 0, if strategy 1 is used with u unlabeled examples then there exists a perturbation size 1(u, d, ) > 0 and positive constants A, B such that using l l = L(24, 0.5, 1) labeled examples, Perror - PB ayes reduces exponentially fast in the number of labeled examples with confidence at least (1 - ). If more labeled examples l > l are provided then widh confidence at least (1 - 2 ), t 2 a Perror - PB ayes asymptotically converges to zero at a rate O l log( d ) represent the reduction rate of this excess error(Perror - PB ayes ) as a function of labeled examples Ree (l), then this can be compactly represented as, ( O (expd-l)) O l s l . If we Ree (l) = log( d ) if l l i f l > l After using l labeled examples Perror = PB ayes + O( 1). 4 2.2 Type-B Uncertainty: Imperfect Model In this section we address the main question raised in this paper. Here the individual class densities do not follow the assumed parametric model exactly but are a perturbed version of the assumed model. The uncertainty in this case is specified by the perturbation size 2 which roughly indicates by what extent the true class densities differ form that of the best fitting parametric model densities. For any mixing parameter t (0, 1) let us consider a two class mixture model with individual class densities p1 (x), p2 (x) respectively. Suppose the best knowledge available about this mixture model is that individual class densities approximately follow some parametric form from a class F . We assume that best approximations of p1 , p2 within F are f1 (x|1 ), f2 (x|2 ) respectively, such that d for i {1, 2}, (fi - pi ) are in Sobolev class H 2 and there exists a perturbation size 2 > 0 such that ||p1 - f1 || d ,2 2 and ||p2 - f2 || d ,2 2. Here, the Sobolev norm is used as a smoothness 2 2 condition and implies that true densities are smooth and not "too different" from the best fitting parametric model densities and in particular, if ||fi - pi || d ,2 2 then ||fi - pi ||1 2 and 2 ||fi - pi || 2. We first show that due to the presence of this perturbation size, even complete knowledge of the best fitting model parameters does not help in learning optimum classification rule in the following sense. In the absence of any perturbation, complete knowledge of model parameters implies that the decision boundary and hence two decision regions are explicitly known but not their labels. Thus, using only a very small number of labeled examples Perror reduces exponentially fast in the number of labeled examples to PB ayes as number of labeled examples increases. However, due to the presence of perturbation size, Perror reduces exponentially fast in number of labeled examples only up to PB ayes + O( 2). Since beyond this, parametric model assumptions do not hold due to the presence of perturbation size, some non parametric technique must be used to estimate the actual decision boundary. For any such nonparametric technique Perror now reduces at a much slower rate. This trend is roughly what the following theorem says. Here f1 , f2 are general parametric densities not necessarily Gaussians. In what follows we assum1 th.at p1 , p2 C and hence convergence rate for e non parametric classification (refer [14]) is O l Slower rate results if infinite differentiability condition is not satisfied. Theorem 2.4. In a two class mixture model with individual class densities p 1 (x), p2 (x) and mixing parameter t (0, 1), let the mixture density of best fitting parametric model be tf 1 (x|1 ) + (1 - t)f2 (x|2 ) where f1 , f2 belongs to some parametric class F and true densities p1 , p2 are perturbed version of f1 , f2 . For a perturbation size 2 > 0, if ||f1 - p1 || d ,2 2, ||f2 - p2 || d ,2 2 2 2 and 1 , 2 are known then for any 0 < < 1, there exists positive constants A, B such that for l l = L(6, t, 2) labeled example, Perror - PB ayes reduces exponentially fast in the number of labeled examples with confidence at least (1 - ). If more labe1ed axamples l > l are provided le Perror - PB ayes asymptotically converges to zero at a rate O s l . l After using l labeled examples Perror = PB ayes + O ( 2). Thus, it can be thought that as labeled examples are added, initially the excess error reduces at a very fast rate (exponentially in the number of labeled examples) until Perror - PB ayes = O ( 2). After that the excess error reduces only polynomially fast in the number of labeled examples. In proving of the above theorem we used first order Taylor series approximation to get an crude upper bound for decision boundary movement. However, in case of a specific class of parametric densities such a crude approximation may not be necessary. In particular, as we show next, if the best fitting model is a mixture of spherical Gaussians where the boundary is linear hyperplane, explicit upper bound of boundary movement can be found. In the following we assume the class F to be a class of d dimensional spherical Gaussians with identity covariance matrix. However, the true model is an equiprobable mixture of perturbed versions of these individual class densities. As before, given u unlabeled examples and l labeled examples we want a strategy to learn a classification rule and analyze the effect of these examples and also of perturbation size 2 in reducing excess probability of error. One option to achieve this task is to use the unlabeled examples to estimate the true mixture density 1 1 2 p1 + 2 p2 , however number of unlabeled examples required to estimate mixture density using non parametric kernel density estimation is exponential to the number of dimensions [10]. Thus, for high dimensional data this is not an attractive option and also such an estimate does not provide any 5 clue as to where the decision boundary is. A better option will be to use the unlabeled examples to estimate the best fitting Gaussians within F . Number of unlabeled examples needed to estimate such a mixture of Gaussians is only polynomial in the number of dimensions [10] and it is easy to show that the distance between the Bayesian decision function and the decision function due to Gaussian approximation is at most 2 away in ||.|| d ,2 norm sense. 2 Now suppose we use the following strategy to use labeled and unlabeled examples. Strategy 2: After using l labeled examples, Perror = PB ayes + O( + 2). Note that when number of unlabeled examples is infinite, parameters of the best fitting model can be estimated arbitrarily well, i.e., 0 and hence Perror - PB ayes reduces exponentially fast in the number of labeled examples until Perror - PB ayes = O( 2). On the other hand if = O( 2), Perror - PB ayes still reduces exponentially fast in the number of labeled examples until Perror - PB ayes = O( 2). This implies that O( 2) close estimate of parameters of the best fitting model is "good" enough. A more precise estimate of parameters of the best fitting model using more unlabeled examples does not help reducing Perror - PB ayes at the same exponential rate beyond Perror - PB ayes = O( 2). The following Corollary states this important fact. Corollary 2.6. For a perturbation size 2 > 0, let the best fitting model for a mixture of equiprobable densities be a mixture of equiprobable d dimensionalu spherical Gaussians with unit covariance d2 matrices. If using u = O 3 (d log 1 + log 1 ) nlabeled examples parameters of the best 2 fitting model can be estimated O( 2) close in Euclidean norm sense, then any additional unlabeled examples u > u does not help in reducing the excess error. 2 to zero at most at a rate O l s l . If we represent the reduction rate of this excess error (Perror - PB ayes ) as a function of labeled examples as Ree (l), then this can compactly represented as, O (exp(-l)) if l l i 1 Ree (l) = f l > l O l 1. Assume the examples are distributed according to a mixture of equiprobable Gaussians with unit covariance matrices and apply maximum likelihood estimation method find the best Gaussian approximation of the densities. 2. Using small number of labeled examples l to label the two approximate decision regions correctly with high probability. 3. If more (l > l ) labeled examples are available use them to learn a better decision function using some nonparametric technique. Theorem 2.5. In a two class mixture model with equiprobable class densities p 1 (x), p2 (x), let the 1 mixture density of the best fitting parametric model be 2 f1 (x|1 ) + 1 f2 (x|2 ) where f1 , f2 are 2 d dimensional spherical Gaussians with means 1 , 2 Rd and p1 , p2 are perturbed version of f1 , f2 , such that for a perturbation size 2 > 0, ||f1 - g1 || d ,2 2, ||f2 - p2 || d ,2 2. For any 2 2 > 0 and 0 < < 1, there exists positive constants A, B such that if strategy 2 is used with d2 a u = O 3 (d log 1 + log 1 ) nd l = L (0.5, 12, ( + 2)) labeled examples then for l l , Perror - PB ayes reduces exponentially fast in the number of labeled examples with confidence at least (1 - ). If more labeled examples l > l are provided Perror - PB ayes asymptotically converges 1a 3 Discussion on different rates of convergence In this section we discuss the effect of perturbation size 2 on the behavior of Perror - PB ayes and its effect on controlling the value of labeled and unlabeled examples. Different combinations of number of labeled and unlabeled examples give rise to four different regions where P error - PB ayes behaves differently as shown in Figure 1 where the x axis corresponds to the number of unlabeled examples and the y axis corresponds to the number of labeled examples. Let u be the number of unlabeled examples required to estimate the parameters of the best fitting model O( 2) close in Euclidean norm sense. Using O notation to hide the log factors, according 6 d3 . When u > u , unlabeled examples have no role to play in to Theorem 2.5, u = O 3 2 reducing Perror - PB ayes as shown in region II and part of III in Figure 1. For u u , unlabeled examples becomes useful only in region I and IV. If u unlabeled examples are available to estimate the parameters of the best fitting model O( 2) close, let the number of labeled examples required to label the estimated decision regions so that Perror - PB ayes = O( 2) be l . The figure is just for graphical representation of different regions where Perror - PB ayes reduces at different rates. l l1 I O " d " l IV : O l on-parametric methods 1 F l II : +O 2 " d u1/3 " I : O (exp(-l)) + O u1/3 d N u = O I : O (exp(-l)) I d3 3 2 u igure 1: The Big Picture. Behavior of Perror - PB ayes for different labeled and unlabeled examples 3.1 Behavior of Perror - PB ayes in Region-I In this region u u unlabeled examples estimate the decision regions and lu labeled examples, which depends on u, are requird d tof correctly label these estimated regions. P error - PB ayes reduces e at a rate O (exp(-l)) + O or u < u and l < lu . This rate can be interpreted as the rate 1 u3 at which unlabeled examples estimate the parameters of the best fitting model and rate at which labeled examples correctly label these estimated decision regions. However, for small u estimation of the decision regions will be bad and and corresponding lu > l . Instead of using these large number labeled examples to label poorly estimated decision regions, they can instead be used to estimate the parameters of the best fitting model and as will be seen next, this is precisely what happens in region III.dThus in region I, l is restricted to l < l and Perror - PB ayes reduces at a rate . exp (-O(l)) + O 1 u3 3.2 Behavior of Perror - PB ayes in Region-II In this section l l and u > u . As shown in Corollary 2.6, using u unlabeled examples parameters of the best fitting model can be estimated O( 2) close in Euclidean norm sense and more precise estimate of the best fitting model parameters using more unlabeled examples u > u does not help reducing Perror - PB ayes . Thus, unlabeled examples have no role to play in this region and for small number of labeled examples l l , Perror - PB ayes reduces at a rate O (exp(-l)). 3.3 Behavior of Perror - PB ayes in Region-III In this region u u and hence model parameters have not been estimated O( 2) close to the parameters of the best fitting model. Thus, in some sense model assumptions are still valid and there is a scope for better estimation of the parameters. Number of labeled examples available in this region is greater than what is required for mere labeling the estimated decision regions using u unlabeled examples and hence these excess labeled examples can be used to estimate the model parameters. Note that once the parameters have been estimated O( 2) close to the parameters of the best fitting model using labeled examples, parametric model assumptions are no longer valid. If l 1 is the number of such labeled examples, then in this region l < l l1 . Also note that depending on number of unlabeled examples u u , l , and l1 are not fixed numbers but will depend on 7 u. In prd se. ce of labeled examples alone, using Theorem 2.3, P error - PB ayes reduces at a rate en Since parameters are being estimated both using labeled and unlabeled examples, the O l effective rate at which Perror - PB ayes reduces at this region can be thought of as the mean of the two. 3.4 Behavior of Perror - PB ayes in Region-IV In this region when u > u , l > l and when u u , l > l1 . In either case, since the parameters of the best fitting model have been estimated O( 2) close to the parameters of the best fitting model, parametric model assumptions are also no longer valid and excess labeled examples must be used in nonparametric way. For nonparametric classification technique either one of the two basic families of classifiers, plug-in classifiers or empirical risk minimization (ERM) classifiers can be used [13, 9]. A nice discussion on the rate and fast rate of convergence of both these types of classifiers can be found in [1, 12]. The general convergence rate i.e. the rate at which expected value of (Perror - PB ayes ) reduces is of the order O(l - ) as l where > 0 is some exponent and is typically 0.5. Also it was shown in [14] that under general conditions this bound can not be improved in a minimax sense. In particular it was shown that if the true densities belong to C class 1 then this rate is O( l ). However, if infinite differentiability condition is not satisfied then this rate is much slower. References [1] J. Y. Audibert and A. Tsybakov. Fast convergence rate for plug-in estimators under margin conditions. In Unpublished manuscript, 2005. [2] M-F. Balcan and A. Blum. A PAC-style model for learning from labeled and unlabeled data. In 18th Annual Conference on Learning Theory, 2005. [3] M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds. Machine Learning, 56, Invited, Special Issue on Clustering:209­239, 2004. [4] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In 11th Annual Conference on Learning Theory, 1998. [5] V. Castelli and T. M. Cover. The relative values of labeled and unlabeld samples in pattern recognition with an unknown mixing parameters. IEEE Trans. Information Theory, 42((6):2102­2117, 1996. [6] O. Chapelle, J. Weston, and B. Scholkopf. Cluster kernels for semi-supervised learning. NIPS, 15, 2002. [7] O. Chapelle and A. Zien. Semi-supervised classification by low density separation. In 10th International Workshop on Artificial Intelligence and Statistics, 2005. [8] S. Dasgupta, M. L. Littman, and D. McAllester. PAC generalization bounds for co-training. NIPS, 14, 2001. [9] L. Devroye, L. Gyorfi, and G. Lugosi. A probabilistic theory of pattern recognition. Springer, New York, Berlin, Heidelberg, 1996. [10] J. Ratsaby. The complexity of learning from a mixture of labeled and unlabeled examples. In Phd Thesis, 1994. [11] J. Ratsaby and S. S. Venkatesh. Learning from a mixture of labeled and unlabeled examples with parametric side information. In 8th Annual Conference on Learning Theory, 1995. [12] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Statist., 32(1):135­166, 1996. [13] V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. [14] Y. Yang. Minimax nonparametric classification- part I: Rates of convergence, part II: Model selection for adaptation. IEEE Trans. Inf. Theory, 45:2271­2292, 1999. [15] X. Zhu. Semi-supervised literature survey. Technical Report 1530, Department of Computer Science, University of Wisconsin Madison, December 2006. 8