Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning ´ ´ Shai Ben-David and Tyler Lu and David Pal David R. Cheriton School of Computer Science University of Waterloo Waterloo, ON, Canada {shai,ttlu,dpal}@cs.uwaterloo.ca Abstract We study the potential benefits to classification prediction that arise from having access to unlabeled samples. We compare learning in the semi-supervised model to the standard, supervised PAC (distribution free) model, considering both the realizable and the unrealizable (agnostic) settings. Roughly speaking, our conclusion is that access to unlabeled samples cannot provide sample size guarantees that are better than those obtainable without access to unlabeled data, unless one postulates very strong assumptions about the distribution of the labels. In particular, we prove that for basic hypothesis classes over the real line, if the distribution of unlabeled data is `smooth', knowledge of that distribution cannot improve the labeled sample complexity by more than a constant factor (e.g., 2). We conjecture that a similar phenomena holds for any hypothesis class and any unlabeled data distribution. We also discuss the utility of semi-supervised learning under the common cluster assumption concerning the distribution of labels, and show that even in the most accommodating cases, where data is generated by two uni-modal label-homogeneous distributions, common SSL paradigms may be misleading and may result in poor prediction performance. 1 Introduction While the problem of classification prediction based on labeled training samples has received a lot of research attention and is reasonably well understood, in many practical learning scenarios, labeled data is hard to come by and unlabeled data is more readily available. Consequently, users try to utilize available unlabeled data to assist with the classification learning process. Learning from both labeled and unlabeled data is commonly called semi-supervised learning (SSL). Due to its wide potential applications, this approach is gaining attention in both the application oriented and the theoretical machine learning communities. However, theoretical analysis of semi-supervised learning has, so far, been scarce and it falls short of providing unequivocal explanation of merits of using unlabeled examples in learning. We take steps toward rectifying this theorypractice gap by providing formal analysis of some semi-supervised learning settings. The question we focus on is whether unlabeled data can be utilized to provably improve the sample complexity of classification learning. We investigate what type of assumptions about the data generating distribution (or which circumstances) are sufficient to make the SSL approach yield better bounds on the predictions accuracy than fully supervised learning. The bulk of this paper focuses on showing that without prior knowledge about the distribution of labels, SSL cannot guarantee any significant advantages in sample complexity (e.g., no more than a constant factor for learning tasks over the real line). we carry our analysis in a simplified, utopian, model of semi-supervised learning, in which the learning algorithm has perfect knowledge of the probability distribution of the unlabeled data. We focus on estimating the labeled sample complexity of learning. Since our model provides the learner with more information than just a sample of the unlabeled data distribution, lower bounds on the labeled sample complexity of learning in our model imply similar lower bounds for common notions of semi-supervised learning. Upper bounds, or sample size sufficiency results (for the labeled samples) in our model, apply to the common SSL setting only once sufficiently large unlabeled samples are available to the learner. In this paper we mainly discuss lower bounds, and when we address upper bounds we settle for stating that they apply eventually as the unlabeled sample sizes grow. Our model of semi-supervised learning can be viewed as learning with respect to a fixed distribution, (see Benedek and Itai [5]). However, our emphasis is different. Our goal is to compare how the knowledge of the unlabeled distribution helps, as opposed to learning when the only access to the underlying unlabeled data distribution is via the training labeled sample. We call the former setting semi-supervised and the latter supervised or fully supervised learning. We present explicit formalization of different ways in which the merits of the semi-supervised paradigm can be measured. We then investigate the extent by which SSL can provide provable advantages over fully supervised learning with respect to these measures. Roughly speaking, we conclude that no special unlabeled data distribution (like, say, one that breaks into clear data clusters) suffices to render SSL an advantage over fully su- pervised learning. Unlabeled data can make a difference only under strong assumptions (or prior knowledge) about the conditional labeled distribution. One should note however, that in many cases such knowledge can also be utilized by a fully supervised algorithm. The search for justification to the SSL paradigm therefore leaves us with one setting - the cases where there exists prior knowledge about the relationship between the labels and the unlabeled data structure (and not just about the labels per se). However, we show in Section 3 that common applications of SSL paradigms for utilizing such relationship (like the popular cluster assumption or the related algorithmic bias towards class boundaries that pass through low-density data regions) may lead to poor prediction accuracy, even when the data does comply with the underlying data model (say, the data is generated by a mixture of two Gaussian distributions, one for each label, each generating a homogeneously labeled set of examples). The potential merits of SSL, in both settings - either with or without making assumptions about the labeled distribution, have been investigated before. Vapnik's model of trans¨¨ ¨ ductive learning [15], as well as Kaariainen's paper [12] address the setting without restrictions on the way labels are generated while Balcan-Blum's augmented PAC model for semi-supervised learning [3, 4] offers a framework for formalizing prior knowledge about the relationship between labels and the structure of the unlabeled distribution. We elaborate more about these in the next section on related work. One basic difference between these works and ours is that they try to provide explanations of the success of the SSL paradigm while we focus on investigating its inherent limitations. This paper does not resolve the issue of the utility of unlabeled data in full generality. Rather, we provide answers for relatively simple classes of concepts over the real line (thresholds and unions of d intervals). We believe that these answers generalize to other classes in an obvious way. We also pose some conjectures and open questions. The paper is organized as follows. We start by discussing previous related work in Section 2. In Section 3 and show that a commonly held assumption can result in performance degradation of SSL. We continue on our main path in Section 4 where we formally define our model of semi-supervised learning and introduce notation. Section 5 casts the previous paradigms in our model and formally poses the question of the utility of unlabeled data to sample based label prediction. This question guides the rest of the paper. Section 6 analyzes this question for basic learning tasks over the real line. The section concludes by asking a slightly different question about the possible meaningful formalizations of the SSL and supervised learning comparison. We conclude our paper in section 7 where we also discuss open questions and directions for further research. 2 Related Work Analysis of performance guarantees for semi-supervised learning can be carried out in two main setups. The first focuses on the unlabeled marginal data distribution and does not make any prior assumptions about the conditional label distribution. The second approach focuses on assump- tions about the conditional labeled distribution, under which the SSL approach has potentially better label prediction performance than learning based on just labeled samples. The investigation of the first setup was pioneered by Vapnik in the late 70s in his model of transductive learning, e.g. [15]. There has been growing interest in this model in the recent years due to the popularity of using unlabeled data in practical label prediction tasks. This model assumes that unlabeled examples are drawn IID from an unknown distribution, and then the labels of some randomly picked subset of these examples are revealed to the learner. The goal of the learner is to label the remaining examples minimizing the error. The main difference between this model and SSL is that the error of learner's hypothesis is judged only with respect to the known initial sample. However, there are no known bounds in the transductive setting that are strictly better than supervised learning bounds. Vapnik's bounds [15] are almost identical. El-Yaniv and Pechyony [10] prove bounds that are similar to the usual margin bounds using Rademacher complexity, except that the learner is allowed to decide a posteriori the concept class given the unlabeled examples. But they do not show whether it can be advantageous to choose the class in this way. Their earlier paper [9] gave bounds in terms of a notion of uniform stability of the learning algorithm, and in the broader setting where examples are not assumed to come IID from an unknown distribution. But again, it's not clear whether and when the resulting bounds beat the supervised learning bounds. ¨¨ ¨ Kaariainen [12] proposes a method for semi-supervised learning without prior assumption on the conditional label ¨¨ ¨ distributions. The algorithm of Kaariainen is based on the observation that one can output the function that minimizes the unlabeled data weights in the symmetric differences to all other functions of the version space. This algorithm can be reduce the error of supervised ERM by a factor of 2. For more details on these algorithms, see Section 5. Earlier, Benedek and Itai [5] discuss a model of "learning over a fixed distribution". Such a model can be viewed as SSL learning, since once the unlabeled data distribution is fixed, it can be viewed as being known to the learner. The idea of Benedek and Itai's algorithm is to construct a minimum -cover of the hypothesis space under the pseudometric induced by the data distribution. The learning algorithm they propose is to apply empirical risk minimization (ERM) on the functions in such a cover. Of course this cover algorithm requires knowledge of the unlabeled distribution, without which the algorithm reduces to ERM over the original hypothesis class. The second, certainly more popular, set of semi-supervised approaches focuses on assumptions about the conditional labeled distributions. A recent PAC model of SSL proposed by Balcan and Blum [3, 4] attempts to formally capture such assumptions. They propose a notion of a compatibility function that assigns a higher score to classifiers which "fit nicely" with respect to the unlabeled distribution. The rationale is that by narrowing down the set of classifiers to only compatible ones, the capacity of the set of potential classifiers goes down and the generalization bounds of empirical risk minimization improve. However, since the set of potential classi- fiers is trimmed down by a compatibility threshold, if the presumed label-structure relationship fails to hold, the learner may be left with only poorly performing classifiers. One serious concern about this approach is that it provides no way of verifying these crucial modeling assumptions. In Section 3 we demonstrate that this approach may damage learning even when the underlying assumptions seem to hold. In Claim 3 we show that without prior knowledge of such relationship that the Balcan and Blum approach has poor worstcase generalization performance. Common assumptions include the smoothness assumption and the related low density assumption [7] which suggests that the decision boundary should lie in a low density region. In section 3, we give examples of mixtures of two Gaussians showing that the low density assumption may be misleading even under favourable data generation models, resulting in low density boundary SSL classifiers with larger error than the outcome of straightforward supervised learning that ignores the unlabeled data. Many other assumptions about the labels/unlabeled data structure relationship have been investigated, most notably co-training [6] and explicit generative data models [8]. However, all these approaches, are based on very strong assumptions about the data generating distributions. Assumptions that are hard to verify, or to justify on the basis of prior knowledge of a realistic learner. 3 On SSL and the Cluster Assumption This paper has several results of the form "as long as one does not make any assumptions about the behavior of the labels, SSL cannot help much over algorithms that ignore the unlabeled data." However, two arguments can be raised against such claims. First, SSL is not really intended to be used without any prior assumption about the distribution of labels. In fact, SSL can be viewed as applying some prior knowledge (or just belief) that the labels are somehow correlated with the unlabeled structure of the data. Can we say anything (anything negative, naturally . . . ) under such an assumption? Second, maybe using unlabeled data can't always help you, but if it can help sometimes why not use it (always)? Well, can we show that in some cases the use of unlabeled data can indeed hurt the learner? Of course, nothing of that kind can apply for all potential learners, since a learner can choose to ignore the unlabeled data and then of course not get hurt by "using" it. We are therefore left with asking, "can the use of unlabeled data hurt the performance of concrete common SSL paradigms?" We briefly address these two questions below by demonstrating that for certain common SSL strategies ("low density cut" and Balcan-Blum style use of "compatibility threshold") SSL can sometimes hurt you, even when the (vaguely stated) "cluster assumption" does hold (when the data breaks into clear uni-modal distributions, each labeled homogeneously). We also show a general lower bound on the sample complexity of SSL under a general model of the cluster assumption. In Figures 1, 2, and 3 we depict three examples of simple data distributions over the real line. In all of these examples, the data is generated by a mixture of two uni-modal distributions, each of these modes generates examples labeled ho- mogeneously, each by a different label. However, the minimum density point of the unlabeled mixture data is significantly off the optimal label prediction decision boundary. Figure 1 shows a mixture of two equal-variance symmetric Gaussians, Figure 2 is a mixture of different Gaussians and Figure 3 shows an extreme case of uni-modal density functions for which the error of the minimum density partition has classification error that is twice that of the optimal decision boundary. Note that in all such examples, not only does the minimumdensity bias mislead the learning process, but also, if one follows the paradigm suggested by Balcan and Blum [4], a wrong choice of the compatibility threshold level will doom the learning process to failure (whereas a simple empirical risk minimization that ignores unlabeled data will succeed based on a small number of labeled samples). In [13] Rigollet present a formal model of the clsuter assumption. Given a probability distribution, D over some Euclidean data domain, define, for any positive real number, a, L(a) = x : p(x) > a. The cluster assumption says that points in each of the connected components of L(a) [after removal of "lines or thin ribbons"] have the same Bayesian optimum label. This is a quite strong assumption under which one can apply an SSL approach. However, in spite of this strong cluster assumption, we can prove that the ratio between the sample complexity of SSL and SL is at most d- the Euclidean dimension of the data. Namely, on one hand, thk results of Section 6, below, e o provide a lower bound of +ln(1/) n the sample com2 plexity of SSL learning under this cluster assumption, where k is the number of connected components of L(a). On the other hand, a learner that has access to only labeled examples, can apply the basic ERM algorithm to the class of all k cell Voronoi partitions of the space. Since the VC-dimension of the class of all k -cell Voronoi partitions in Rd is of order k d, the usual VC-bok nds on the e ample complexity of such u s d+ln(1/ ) an SL learner is O xamples. 2 4 A No-Prior-Knowledge Model of Semi-Supervised Learning We work in the common (agnostic) PAC framework, in which a learning problem is modeled by a probability distribution P over X × {0, 1} for some domain set, X . Any function from X to {0, 1} is called a hypothesis. Examples are pairs, (x, y ) X × {0, 1}, and a sample is a finite sequence S = {(xi , yi )}m 1 of examples. i= Definition 1 (SL and SSL). · A m pervised learning (SL) algorithm is a function, L : su m {0, 1}X , that mapping samN (X × {0, 1}) ples to a hypotheses. · A semi-supervised learning (SSL) algorithm is a funcm m X tion L : N (X × {0, 1}) × P {0, 1} , where P is a set of probability distributions over X . Namely, an SSL algorithm takes as input not only a finite labeled sample but also a probability distribution over the domain set (and outputs a hypothesis, as before). 1 2 (N (0, 1) + N (2, 1)) 1 2 N (0, 1) 1 2 N (2, 1) For a distribution P over X × {0, 1}, let D(P ) denote the marginal distribution over X . That is, formally, for X X we define D(P )(X ) = P (X × {0, 1}) (provided that X × {0, 1} is P -measurable). For a learning problem P , we call D(P ) the unlabeled distribution of P . Following the common PAC terminology and notation, the error of a hypothesis h, with respect to P , is ErrP (h) = Pr(x,y)P [h(x) = y ]. Similarly, the empirical error, ErrS (h), of a hypothesis h on a sample S is defined as ErrS (h) = 1 m |{i : i {1, 2, . . . , m}, h(xi ) = yi }|. Definition 2 (The sample complexities (SSL and SL) of a class). For a class H of hypotheses, the sample complexity of a semi-supervised learning algorithm A with respect to P , confidence > 0 and accuracy > 0, is m m(A, H, P, , ) = min N : . P Prm [Err (A(S, D(P ))) - inf ErrP (h ) > ] < S P h H -4 -3 -2 -1 0 1 2 3 4 5 6 OP T Figure 1: Mixture of two Gaussians N (0, 1) (labeled '-') and N (2, 1) (labeled '+') shows that the optimum threshold is at x = 1, the densest point of the unlabeled distribution. The sum of these two Gaussians is unimodal. The sample complexity of a supervised learning algorithm A is defined similarly, except that the second input parameter D(P ) is omitted. We consider two settings, realizable and agnostic. In the agnostic setting, P can be arbitrary. The realizable setting is defined by assuming that there exists hypothesis h H such that ErrP (h) = 0; consequently inf h H ErrP (h ) = 0. In particular, this implies that for any x X , the conditional probabilities, P (y = 0| x) and P (y = 1| x) are always either 0 or 1; in the agnostic setting the conditionals can be arbitrary. Without reference to any learning problem, an unlabeled distribution D is simply any distribution over X . We use Ext(D) to denote all possible extensions of D, that is, Ext(D) is the family of all possible distributions P over X × {0, 1} such that D(P ) = D. For an unlabeled distribution D and hypothesis h, Dh denotes the probability distribution in Ext(D) such that Dh (y = h(x) | x) = 1. For a hypothesis h and an "unlabeled sample" S = {xi }m 1 , where xi X , we denote i= by (S, h(S )) the sample {(xi , h(xi ))}m 1 . i= For a subset T of some domain set, we use 1T to denote its characteristic function. In particular, if T X then 1T is a hypothesis over X . For two hypothesis g , h we use g h to denote their "symmetric difference", that is, g h is a hypothesis 1{x X : g (x) = h(x)}. Finally, VC(H ) denotes the VC-dimension [14] of hypothesis class H . 1 2 (N (0, 1) + N (4, 2)) 1 2 N (0, 1) min density 1 2 N (4, 2) OP T 0 1 2 3 4 -2 -1 Figure 2: Mixture of two Gaussians N (0, 1) (labeled '-') and N (4, 2) (labeled '+') with difference variances. The minimum density point of the unlabeled data (the sum of the two distributions) does not coincide with the optimum labelseparating threshold where the two Gaussians intersect. The classification error of optimum is 0.17 and that of the minimum density partition is 0.21. P2 ,O lope = 1 - s 5 Previous No Prior Knowledge Paradigms Previous approaches to SSL algorithms for the no prior knowledge paradigm have used the unlabeled sample to figure out the "geometry" of the hypothesis space with respect to the unlabeled (marginal) distribution. A common approach is to use that knowledge to reduce the hypothesis search space. In doing so, one may improve the generalization upper bounds. Recall that given an unlabeled distribution D and a hypothesis class H , an -cover is a subset H H such that for any h H there exists g H such that D(g h) . Note that if H is an -cover for H with respect to D, then for every extension P Ext(D) the inf gH ErrP (g ) inf hH ErrP (h) + . Err(min density) 2Err(OPT) P1 , slope = -1 Err(OPT) min density PT Figure 3: The solid line indicates the distribution P1 (labeled '-') and the dotted line is P2 (labeled '+'). The x coordinate of their intersection is the optimum label prediction boundary. The slope of the solid line is slightly steeper than that of the dotted line (| - 1| > 1 - ). The minimum density point occurs where the density of P1 reaches 0. The error of In some cases the construction of a small -cover is a major use of unlabeled data. Benedek and Itai [5] analyze the approach, in the case when the unlabeled distribution is fixed and therefore can thought of as known to the learner. They show that the smaller an -cover is the better its generalization bound one for the ERM algorithm over this cover. Balcan and Blum [4] suggest a different way of using the unlabeled data to reduce the hypothesis space. However, we claim that without making any prior assumptions about the relationship between the labeled and unlabeled distributions, their approach boils down to the -cover construction described above. Claim 3. Let H be any hypotheses class, , > 0, and D be any unlabeled distribution. Let H H be the set of "compatible hypotheses." Suppose A is an SSL algorithm that outputs any hypothesis in H . If H does not contain an -cover of H with respect to D, the error of the hypothesis that A outputs is at least regardless of the size of the labeled sample. Proof. Since H does not contain an -cover of H , there exist a hypothesis h H such that for all g H , D(g h) > . Thus, for any g H , ErrDh (g ) > . Algorithm A outputs some g H and the proof follows. ¨¨ ¨ Kaariainen [12] utilizes the unlabeled data in a different way. Given the labeled data his algorithm constructs the version space F H of all sample-consistent hypotheses, and then applies the knowledge of the unlabeled distribution D to find the "center" of that version space. Namely, a hypothesis g F that minimizes maxhF D(g h). Clearly, all the above paradigms depend on the knowledge of the unlabeled distribution D. In return, better upper bounds on the sample complexity of the respective algorithms (or equivalently on the errors of the hypotheses produced by such algorithms) can be shown. For example, Benedek and Itai give (for the realizable case) an upper bound on the sample complexity that depends on the size of the -cover--the smaller -cover, the smaller the upper bound. In the next section we analyze the gains that such knowledge of unlabeled data distribution can make in the no prior knowledge setting. We prove that over the real line for any "smooth" unlabeled distribution D, ERM over the full hypothesis class H has worst case sample complexity that is at most by constant factor bigger than the worst case sample complexity of any SSL algorithm. We conjecture that this a more general phenomenon. Conjecture 4. For any hypothesis class H , there exists a constant c 1 and a supervised algorithm A, such that for any distribution D over the domain and any semi-supervised learning algorithm B , sup m(A, H, Dh , , ) c · sup m(B , H, Dh , , ) hH hH any distribution D over the domain and any semi-supervised learning algorithm B , sup P Ext(D ) m(A, H, P, , ) c · sup P Ext(D ) m(B , H, P, , ) for any and small enough, say smaller than 1/c. 6 Inherent Limitations of Semi-Supervised Learning This section is devoted to proving the inherent limitations of SSL paradigm in the no prior knowledge model over the real line. In Section 6.2 we prove Conjecture 4 for thresholds on the real line in the realizable setting, under the condition that the unlabeled distribution is absolutely continuous. In Section 6.3 we prove Conjecture 5 for thresholds and union of d intervals over the real line in the agnostic setting (under the same unlabeled distribution condition). The former follows from Theorems 8 and 10. The latter follows from Corollary 13 (for thresholds) and from Corollary 16 (for union of d intervals). To prove the results we rely on a simple "rescaling trick" that we explain in Section 6.1. We briefly sketch the idea of the proofs. Let us start by defining the hypothesis classes. The class of thresholds is defined as H = {1(-, t] : t R} and the class of union of d intervals U Id = {1[a1 , a2 ) [a3 , a4 ) · · · [a2 -1 , a2 ) : d, a1 a2 · · · a2 . } The rescaling trick says that the SSL sample complexity of learning H (resp. U Id ) under any two absolutely continuous unlabeled distributions is exactly the same. We can thus focus on the sample complexity of learning under some fixed absolutely continuous distribution; for concreteness and convenience we chose the uniform distribution over (0, 1). By proving a sample complexity lower bound on the learning under the uniform distribution over (0, 1), we are effectively proving a lower bound on the sample complexity of SSL under any absolutely continuous distribution. Through the use of techniques from the probabilistic method, we obtain lower bounds on the SSL sample complexity that is within a constant factor of the well-known upper bounds on SL sample complexity (e.g. VC upper bounds on the sample complexity of ERM for any unknown distribution). In Section 6.4 we discuss other possible formulations of the comparison between SL and SSL algorithms. 6.1 Rescaling Trick In this section we show that learning any "natural" hypothesis class on the real line has the same sample complexity for any absolutely continuous unlabeled distribution independent of its shape. Intuitively, if we imagine the real axis made of rubber, then a natural hypothesis class is one that is closed under rescaling (stretching) of the axis. Classes of thresholds and union of d intervals are examples of such natural classes, since under any rescaling an interval remains an interval. The rescaling will apply also on the unlabeled distribution over the real line and it will allow us to go from any absolutely continuous distribution to the uniform distribution over (0, 1). for any and small enough, say smaller than 1/c. Conjecture 5. For any hypothesis class H , there exists a constant c 1 and a supervised algorithm A, such that for More formally, a rescaling is a continuous increasing function f from an open interval I onto an open interval J . We denote by H |A the restriction of a class H to a subset A, that is, H |A = {h|A : h H }. We use to denote function composition. We say that a hypothesis class H over R is closed under rescaling whenever for any rescaling f : I J , if h|J H |J , then h|J f H |I . If H is any class closed under rescaling, then any rescaling f induces a bijection h|J h|J f between H |I and H |J . (This follows since f -1 is also rescaling.) Clearly, the class of thresholds and the class of unions of d intervals are closed under rescaling. We show that the sample complexity is unaffected by the rescalings provided that the hypothesis class is closed under rescalings. We split the results into two lemmas--Lemma 6 and Lemma 7. The first lemma shows that if we have a supervised algorithm with certain sample complexity for the case when the unlabeled distribution is the uniform distribution over (0, 1), then the algorithm can be translated into an SSL algorithm with the same sample complexity for the case when the unlabeled distribution is any absolutely continuous distribution. The second lemma shows the translation in the other direction. Namely, that a SSL algorithm with certain sample complexity on some absolutely continuous unlabeled distribution can be translated to a supervised algorithm for the case when unlabeled distribution is uniform over (0, 1). Lemma 6 (Rescaling trick I). Let H be a hypothesis class over R closed under rescaling. Let U be the uniform distribution over (0, 1). Let , > 0. (a) (Realizable case): If A is any supervised or semisupervised algorithm, then there exists an semi-supervised learning algorithm B such that for any distribution D over an open interval I which is absolutely continuous with respect to Lebesgue measure on I sup m(B , H, Dh , , ) sup m(A, H, Ug , , ) . hH g H for any (measurable) M (0, 1) × {0, 1}. It is not hard to see that if S is distributed according to P m , then S is distributed according to (F (P ))m . Clearly, D(F (P )) = U i.e. F (P ) Ext(U ). Further note that since D is absolutely continuous, F is a rescaling. Hence ErrF (P ) (h) = ErrP (h F ) and inf hH ErrP (h) = inf hH ErrF (P ) (h). Henceforth, for any and any m N S P m Pr [ErrP (B (S, D)) - inf ErrP (h) > ] hH S F (P )m = = S Pr [Err (A(S F ) - inf ErrF (P ) (h) > ] hH F (P ) P ) F (P )m Pr [Err (A(S ) - inf ErrF (P ) (h) > ] . hH ) Therefore, for any , > 0, m(B , H, P, , ) = m(A, H, F (P ), , ) sup m(A, H, Q, , ) . QExt(P ) Taking supremum over P Ext(D) finishes the proof. Lemma 7 (Rescaling trick II). Let H be a hypothesis class over R closed under rescaling. Let U be the uniform distribution over (0, 1). Let , > 0. (a) (Realizable case): If B is any supervised or semisupervised algorithm and D is any distribution over an open interval I , which is absolutely continuous with respect to the Lebesgue measure on I , then there exists a supervised learning algorithm A such that sup m(A, H, Ug , , ) sup m(B , H, Dh , , ) . g H hH (3) (1) (b) (Agnostic case): If A is any supervised or semi-supervised sup m(A, H, Q, , ) sup m(B , H, P, , ) . (4) algorithm, then there exists an semi-supervised learning alQExt(U ) P Ext(D ) gorithm B such that for any distribution D over an open interval I which is absolutely continuous with respect to Lebesgue Proof. Fix H , B and D. Let F : I (0, 1) be the be measure on I cumulative distribution function of D, that is, F (t) = D(I (-, t)). Since D is absolutely continuous, F is a rescaling sup m(B , H, P, , ) sup m(A, H, Q, , ) . (2) and inverse F -1 exists. P Ext(D ) QExt(U ) Now, we construct algorithm A. Algorithm A maps inProof. Fix H and A. We construct algorithm B as follows. put sample S = {(xi , yi )}m 1 to sample S = {(xi , yi )}m 1 i= i= The algorithm B has two inputs, a sample S = {(xi , yi )}m 1 i= where xi = F -1 (xi ). On a sample S the algorithm A simand a distribution D. Based on D the algorithm computes ulates algorithm B and computes g = B (S, D). (If B is the cumulative distribution function F : I (0, 1), F (t) = supervised, then the second input is omitted.) Finally, A outD(I (-, t]). Then, B computes from S transformed puts h = g F -1 . = i m i {(x , yi )}i=1 where x = F (xi ). On a sample sample S It remains to show that for any D with continuous cumuS the algorithm B simulates algorithm A and computes h = lative distribution function (3) and (4) holds for any , > 0. A(S ). (If A is semi-supervised we fix its second input to be We prove (4), the other equality is proved similarly. U ). Finally, B outputs g = h F . Let Q Ext(U ). Slightly abusing notation, we define It remains to show that for any D with continuous cumuthe "pre-image" distribution F -1 (Q) over I × {0, 1} to be lative distribution function (1) and (2) holds for any , > 0. F -1 (Q)(M ) = Q ({(F (x), y ) : (x, y ) M }) We prove (2), the other equality is proved similarly. Let P Ext(D). Slightly abusing notation, we define for any (measurable) M I × {0, 1}. It is not hard to the "image" distribution F (P ) over (0, 1) × {0, 1} to be see that if S is distributed according to Q, then S is distribF (P )(M ) = P ({(x, y ) : (F (x), y ) M }) uted according to (F -1 (Q))m . Clearly, D(F -1 (U ) = D i.e. (b) (Agnostic case): If B is any supervised or semi-supervised algorithm and D is any distribution over an open interval I , which is absolutely continuous with respect to the Lebesgue measure on I , then there exists a supervised learning algorithm A such that F -1 (Q) Ext(D). Since F -1 is a rescaling, ErrF (Q) (h) = -1 ErrQ (hF -1 ) and inf hH ErrQ (h) = inf hH ErrF (Q) (h). Henceforth, for any > 0 and any m N S Qm -1 the interval (s, t] does not contain any sample points. This happens with probability (1 - D((s, t]))m (1 - )m . If m ln(1/) , then (1 - )m exp(- m) . Theorem 9 (SSL upper bound). Let H be the class of thresholds and B be the semi-supervised learning algorithm defined above. For any absolutely continuous distribution D 1 over an open interval, any (0, 1 ), (0, 2 ), and any 4 "target" h H , m(B , H, Dh , , ) ln(1/ ) ln 2 2 +2 . Pr [ErrQ (A(S )) - inf ErrQ (h)] hH = = S F -1 (Q)m Pr [Err (B (S, D) F -1 ) - inf ErrF Q hH F -1 (Q) -1 (Q) (h)] S F Pr -1 (Q) [Err m (B (S, D))- inf ErrF hH -1 (Q) (h)] . Therefore, for any , > 0, m(A, H, Q, , ) = m(B , H, F (Q), , ) sup m(B , H, P, , ) P Ext(D ) -1 Taking supremum over Q Ext(U ) finishes the proof. Sample Complexity of Learning Thresholds in the Realizable Case In this section we consider learning the class of thresholds, H = {1(-, t] : t R}, on the real line in the realizable setting and show that for absolutely continuous unlabeled distributions SSL has at most factor 2 advantage over SL in the sample complexity. First, in Theorem 8, we show ln(1/) upper bound on the sample complexity of supervised learning. This seems to be a folklore result. Second, we consider sample complexity of semi-supervised learning in the case when D(P ) is absolutely continuous with respect to the Lebesgue measure on R. In Theorems 9 and 10 we show that the sample complex1/ 1 ity is between ln(2 /) + O( 1 ) and ln(01 ) - O( 1 ).1 Ignoring 2. the lower order terms, we see that the sample complexity of supervised learning is (asymptotically) at most 2-times larger than that of semi-supervised learning. We will make use the following of two algorithms: supervised algorithm L and semi-supervised algorithm B pro¨¨ ¨ posed by Kaariainen [12]. Both algorithms on a sample S = ((x1 , y2 ), (x2 , y2 ), . . . , (xm , ym )) first compute = max{xi : i {1, 2, . . . , m}, yi = 1} , r = min{xi : i {1, 2, . . . , m}, yi = 0} . Algorithm L simply outputs the hypothesis 1(-, ]. Algorithm B makes use of its second input, distribution D. Provided that < r, B computes t = sup{t : D(( , t ]) D(( , r])/2} and outputs hypothesis 1(-, t ]. Theorem 8 (SL upper bound). Let H be the class of thresholds and L be the supervised learning algorithm defined above. For any D, for any , > 0, and any "target" h H , m(A, H, Dh , , ) ln(1/ ) . 6.2 Proof. By rescaling trick (Lemma 6 part (a)) we can assume 1 that D is uniform over (0, 1). Fix (0, 1 ), (0, 2 ) and 4 h H . We show that, for any m 2, m S Dh Pr [ErrDh (B (S, Dh )) ] 2(1 - 2 )m , (5) 1 from which the theorem easily follows, since if m ln(2 /) + ln 2 m 2 , then m 2 and 2(1 - 2 ) 2 exp(-2m ) . In order to prove (5), let h = 1(-, t] be the "target". Without loss of generality t [0, 1 ]. With a little 2 abuse, we assume that [0, t] and r [t, 1]. For convenience, we define a : [0, t] [t, 1], b : [0, t] [t, 1] as a( ) = max(2t - - 2 , t) and b( ) = min(2t - + 2 , 1) i respectively. It is easily verified that ErrDh (B (S, Dh )) f and only if r [a( ), b( )]. We lower bound the probability of success p = Pr m [ErrDh (B (S, Dh )) ] . S Dh There are two cases: Case 1: If t > 2 , then we integrate over all possible choices of the rightmost positive example in S (which determines ) and leftmost negative example in S (which determines r). There are m(m - 1) choices for the rightmost positive example and leftmost negative example. We have t p p1 = m(m - 1) 0 a( ) b( ) (1 - r + )m-2 drd . Case 2: If t 2 , then we integrate over all possible choices of the rightmost positive example in S and leftmost negative example in S . Additionally we also consider samples without positive examples, and integrate over all possible choices of the leftmost (negative) example. We have t p p2 = m(m - 1) 0 b( ) + 1 - r + )m-2 drd ( t2 Proof. Let h = 1(-, t) and let s = sup{s : D((s, t]) }. The event ErrDh (L(S )) occurs precisely when The 2.01 in the lower bound can be replaced by arbitrary number strictly greater than 2. This slight imperfection is a consequence of that the true dependence of the sample complexity on , in this case, is of the form 1/ ln(1 - 2 ) and not 1/(2 ). 1 a( ) m (1 - r)m-1 dr Both cases split into further subcases. Subcase 1a: If t > 2 and t + 4 1 and t + 1/2, then 0 2t + 2 - 1 t - 2 t and 2t+2 -1 b( ) (1 - r + )m-2 drd p1 =+ (m - 1) m 0 2 t-2 a( ) ) Subcase 2a: If t 2 and t + 1/2, then t - 2 0 2t + 2 - 1 t and 2t+2 -1 b( ) (1 - r + )m-2 drd p2 =+ (m - 1) m 0 a( ) ) b( a( ) () + t+2 -1 (1 - r + )m-2 drd + t 2t+2 -1 t2 b( a( ) (1 - r + )m-2 drd t = t-2 (1 - r + )m-2 drd a( ) b 2t+2 + (m - 1) m 0 2 t-2 -1 1 2t- -2 (1 - r + )m-2 drd (1 - r)m-1 dr 2t+2 -1 1 (1 - r + )m-2 drd =+ (m - 1) m m 0 t +2 2t- +2 + t+2 -1 (1 - r + )m-2 drd (1 - r + )m-2 drd t + 2t+2 -1 m t 2t- t = t-2 t2 2t- -2 t- +2 (1 - r + )m-2 drd 1 1 1 - (1 - 2t - 2 )m - (-1 + 2t + 6 )m - (1 - 2 )m 2 2 1 - 2(1 - 2 )m . Subcase 1b: If t > 2 and t + 1/2, then 2t + 2 -1 0 t - 2 t and b( ) 0 t-2 p1 =+ (m - 1) m (1 - r + )m-2 drd a( ) (1 - t) - (1 - 2 )m 3 1 = 1 - (1 - 2 )m - (2t + 2 - 1)m 2 2 1 - 2(1 - 2 )m . Subcase 2b: If t 2 and t + 1/2, then t - 2 0, 2t + 2 - 1 0 and t b( ) p2 =+ (m - 1) m (1 - r + )m-2 drd 0 t2 a( ) t = t-2 () (1 - r + )m-2 drd b a( ) 0 t-2 m (1 - r)m-1 dr t 2t- t +2 2t- +2 m + (m - 1) 2t- -2 (1 - r + )m-2 drd =+ (m - 1) m 0 (1 - r + )m-2 drd t = t-2 t2 t- +2 (1 - r + )m-2 drd 1 1 1 - (1 - 2 )m + (1 - 2t - 2 )m - (1 - 2t + 2 )m 2 2 3 1 - (1 - 2 )m . 2 Subcase 1c: If t > 2 and t + 4 1, then 0 t - 2 2t + 2 - 1 t, and b( ) 0 t-2 p1 =+ (m - 1) m (1 - r + )m-2 drd a( ) (1 - t)m - (1 - 2 )m 1 3 = 1 - (1 - 2 )m - (1 - 2t - 2 )m 2 2 1 - 2(1 - 2 )m . Theorem 10 (SSL lower bound). For any (randomized) semisupervised algorithm A, any (0, 0.001), any > 0, any absolutely continuous probability distribution D over an open interval, there exists h H , such that ln(1/ ) ln 2 m(A, H, Dh , , ) 2.01 - 2.01 . Proof. By rescaling trick (Lemma 7 part (a)) we can assume that D is uniform over (0, 1). Fix A, , . We show the existence of required h by a probabilistic argument. We consider picking t uniformly at random from (0, 1) and let h = 1(-, t]. We prove that for any m 0, 1 E Pr m [ErrDh (A(S, Dh )) ] (1 - 2 )m . (6) t S Dh 2 The left-hand side can rewritten as E Pr m [ErrDh (A(S, Dh )) ] t S Dh 2t+2 + t-2 -1 () (1 - r + )m-2 drd a( ) b t = 2t+2 -1 b( ) (1 - r + )m-2 drd 1 2t- -2 a( ) 0 t-2 + (m - 1) m 2t+2 + t-2 -1 (1 - r + )m-2 drd (1 - r + )m-2 drd t1 2t- +2 t m t = 2t+2 -1 (1 - r + )m-2 drd =E = = m t S Dh E 1{(t, S ) : ErrDh (A(S, D)) } 1 1 1 - (1 - 2 ) - (1 - 2t + 2 )m - (2t + 2 - 1)m 2 2 1 - 2(1 - 2 )m . S D m t E E 1{(t, S ) : ErrDh (A((S, h(S )), D)) } S D m t E Pr[ErrDh (A(h(S ), Dh )) ] To lower bound the last expression, fix unlabeled points 0 x1 x2 · · · xm 1. For convenience, let x0 = 0 and xm+1 = 1. We claim that Pr t sis we have xm+1 Im (xm+1 ) = m 0 (max(xm - 2 , 0)) m E rrDh (A((S, h(S )), D)) im =0 m + max(xm+1 - xm - 2 , 0) · xm-1 dxm xm+1 xm - 2 )m dxm =m 2 ( 0 xm+1 -2 max(xi+1 - xi - 2 , 0) . (7) +m = (xm+1 - xm - 2 )xm-1 dxm m To prove that we also fix i {0, 1, 2, . . . , m} and restrict t to lie in the interval (xi , xi+1 ]. The labels in (S, h(S )) are hence fixed. Hence the hypothesis g = A((S, h(S )), D) is fixed. It is not hard to see that regardless of g xi+1 1 xi m (xm+1 - 2 )m+1 m+1 1 + (xm+1 - 2 )m+1 m+1 = (xm+1 - 2 )m+1 . ln(1/ ) 2.01 - t : ErrDh (g ) dt max(xi+1 -xi -2 , 0) , which follows from that the set {t : ErrDh (g ) < } is contained in an interval of length at most 2 . Summing over all i we obtain (7). In order to prove (6) we will compute expectation over S Dm of both sides of (7). Expectation of the left side of (7) equals to the left side of (6). The expectation of the right side of (7) is equal to xm+1 Im = m! 0 0 0 ln 2 2.01 . To finish the proof of the theorem, suppose m < Then 1 (1 - 2 )m > , since 2 1 m = ln (1 - 2 ) 2 - ln 2 + m ln(1 - 2 ) > - ln 2 - m(2.01 ) > ln , xm m xm-1 ··· 0 times x2 where we have used that ln(1 - 2 ) > -2.01 for any (0, 0.001). Therefore from (6), for at least one target h = 1(-, t], with probability greater than , algorithm A fails to output a hypothesis with error less than . 1/ Remark. The ln(01 ) - O( 1 ) lower bound applies to super2. vised learning as well. However, we do not know of any supervised algorithm (deterministic or randomized) that has asymptotic sample complexity c ln(1/) for any constant c < 1. For example, the randomized algorithm that outputs with probability 1/2 the hypothesis 1(-, ] and with probability 1/2 the hypothesis 1(-, r) still cannot achieve the SSL sample complexity. We conjecture that all supervised algorithms for learning thresholds on real line in the realizable setting have asymptotic sample complexity at least ln(1/) . im =0 max(xi+1 - xi - 2 , 0) dx1 · · · dxm-2 dxm-1 dxm , since there are m! equiprobable choices for the order of the points x1 , x2 , . . . , xm among which we choose, without loss of generality, the one with x1 x2 · · · xm . We look at Im as a function of xm+1 and we prove that Im (xm+1 ) = (max(xm+1 - 2 , 0)) m+1 6.3 Sample Complexity in Agnostic Case , (8) for any m 0 and any xm+1 [0, 1]. The bound (6) follows from (8), since Im = Im (1) = (1 - 2 )m+1 1 (1 - 2 )m 2 for 1/4. In turn, (8) follows, by induction on m, from the recurrence xm+1 Im (xm+1 ) = m 0 Im-1 (xm ) + max(xm+1 - xm - 2 , 0) · xm-1 dxm , m which is valid for all m 1. In the base case, m = 0, I0 (x1 ) = max(x1 - 2 , 0) trivially follows by definition. In the inductive case, m 1, we consider two cases. First case, xm+1 < 2 , holds since max(xi+1 - xi - 2 , 0) = 0 and hence by definition Im (xm+1 ) = 0. In the second case, xm+1 2 , from the recurrence and the induction hypothe- In this section, we show that even in the agnostic setting SSL does not have more than constant factor improvement over SL. We prove some lower bounds for some classes over the real line. We introduce the notion of a b-shatterable distribution, which intuitively, are distributions where there are b "clusters" that can be shattered by the concept class. The main lower bound of this section are for such distributions (see Theorem 15). We show how this lower bound results in tight sample complexity bounds for two concrete problems. The first is learning thresholds on the real line where we show a bound of (ln(1/f / 2). Then we show sample ) 2 d+ln(1/ ) complexity of or the union of d intervals on 2 the real line. The sample complexity of the union of d intervals for a fixed distribution in a noisy setting has also been investigated by Gentile and Helmbold [11]. w hey show a lower bound T 2 1 here is the distance to of d log /((1 - 2 )2 ) the target that the learning algorithm should guarantee with high probability, and is the probability of a wrong label appearing (see classification noise model of [1]). This notation implies that the difference in true error of target and the algorithm's output is = (1 - 2 ). Setting = 1/2 - /4 gives (2d/ 2). We note that we do not make the assumption of a constant level of noise for each unlabeled example. It turns out, however, that in our proofs we do construct worst case distributions that have a constant noise rate that is slightly below 1/2. We point out two main differences between our results and that of Gentile and Helmbold. The first being that we explicitly construct noisy distributions to obtain 2 in the denominator. The second difference is that our technique appears to be quite different from theirs, which uses an information theory approach, whereas we make use of known techniques based on lower bounding how well one can distinguish similar noisy distributions, and then applying an averaging argument. The main tools used in this section come from Anthony and Bartlett [2, Chapter 5]. We first cite a result on how many examples are needed to distinguish two similar, Bernoulli distributions in Lemma 11. Then in Lemma 12 we prove an analogue of this for arbitrary unlabeled distributions. The latter result is used to give us a lower bound in Theorem 15 for b-shatterable distributions (see Definition 14). Corollary 13 and 16 gives us tight sample complexity bounds for thresholds and union of intervals on R. Lemma 11 (Anthony and Bartlett [2]). Suppose that P is a random variable uniformly distributed on {P1 , P2 } where P1 , P2 are Bernoulli distributions over {0, 1} with P1 (1) = 1/2 - and P2 (1) = 1/2 + for 0 < < 1/2. Suppose that 1 , . . . , m are IID {0, 1} valued random variables with Pr(i = 1) = P (1) for each i. Let f be a function from {0, 1}m {P1 , P2 }. Then 1 1 =- 1 4m 2 E Prm [f ( ) = P ] > - exp - P P 4 1 - 4 2 : F (m, ). One can view the lemma this way: if one randomly picks two weighted coins with similar biases, then there's a lower bound on the confidence with which one can accurately predict the coin that was picked. The next result is similar except an unlabeled distribution D is fixed, and the distributions we want to distinguish will be extensions of D. Lemma 12. Fix any X , H , D over X , and m > 0. Suppose there exists h, g H with D(hg ) > 0. Let Ph and Pg be the extension of D such that Ph ((x, h(x))|x) = Pg ((x, g (x))|x) = 1/2+ . Let AD : (hg ×{0, 1})m H be any function. Then for any x1 , . . . , xm hg , there exists P {Ph , Pg } such that if yi Pxi for all i, Pr[Err (AD ((x1 , y1 ), . . . , (xm , ym ))) - OP TP yi P most , we require 1 m 4 2 -1 l n 1 . 8 (9) Proof. Suppose for a contradiction this is not true. Let P = {Ph , Pg }. Then there exists an AD and x1 , . . . , xm such that P P , Pr[ErrP (AD ((x1 , y1 ), . . . , (xm , ym )))-OP TP yi > D(hg )] F (m, ). (10) Then we will show that the lower bound in Lemma 11 can be violated. Now hg can be partitioned into 0 = {x : h(x) = 0} and 1 = {x : h(x) = 1}. Without loss of generality assume {x1 , . . . , xl } 0 and {xl+1 , . . . , xm } 1 . Let A = AD ((x1 , y1 ), . . . , (xm , ym )). From the triangle inequality D(Ah) + D(Ag ) D(hg ). Thus if A is closer to h then D(Ag ) D(hg )/2 and vice versa. Let P be a random variable uniformly distributed on P . We have Pr(y1 = 1) = · · · = Pr(yl = 1) = P0 (1) = Pr(yl+1 = 0) = · · · = Pr(ym = 0) = P1 (0). Let 1 , . . . , m P0 so that Pr(i = 1) = 1/2 - when P = Ph and equal to 1/2 + when P = Pg . Let us define the function f : {0, 1}m P as follows. It will take as input 1 , . . . , m then transform this to an input of AD as I = (x1 , 1 ), . . . , (xl , l ), (xl+1 , 1 - l+1 ), . . . , (xm , 1 - m ) so that i and 1 - j is from the same distribution as yi and yj , respectively, for i l, j > l. Now define P h if D (AD (I )h) < D (AD (I )g ) f (1 , . . . , l ) = . Pg otherwise We have E m P P Pr [f ( ) = P ] 0 E Pr [D(AD (I )OP TP ) > D(hg )/2] P E E Pr rrP (AD (I )) - OP TP > D(hg ) P F (m, ) where the last inequality follows from (10). This is a contradiction, so the lower bound from Lemma 11 must apply. If the probability of failure F (m, ) is at most , solving the inequality for m gives (9). Corollary 13. The SSL sample complexity of learning thresholds over the uniform distribution over (0, 1) is (ln(1/ )/ 2). Proof. Upper bound comes from any ERM algorithm. Let h = 1(-, 0] and g = 1(-, 1] so D(hg ) = 1. Set = as in Lemma 12. Definition 14. The triple (X , H , D) is b-shatterable if there exists disjoint sets C1 , C2 , . . . , Cb with D(Ci ) = 1/b for each i, and for each S {1, 2, . . . , b}, there exists h H such that b = i i h Ci Ci . =1 S > D(hg )] > F (m, ) . Where Px is the conditional distribution of P given x, and OP TP = 1/2 - . Thus if the probability of failure is at Theorem 15. If (X , H , D) is b-shatterable and H contains h, g with D(hg ) = 1 then a lower bound on the SSL sample complexity for 0 < , < 1/64 is b . + ln 1 2 Proof. The proof is similar to Theorem 5.2 in Anthony and Bartlett [2]. Let G = {h1 , h2 , . . . , h2b } be the class of functions that b-shatters D with respect to C = {C1 , . . . , Cb }. We construct noisy extensions of D, P = {P1 , P2 , . . . , P2b } so that for each i, Pi ((x, hi (x))) = (1 + 2 )/(2b). For any h H let snap(h) = argminh G D(hh ). Suppose P P , let h denote the optimal classifier which is some g G depending on the choice of P . If i = j and N (hi , hj ) is the number of sets in C where hi and hj disagree, then D(hi hj ) N (hi , hj )/b, and since G is a 1/b-packing, ErrP (h) ErrP (h ) + = N (snap(h), h ) b . 1EP rr (snap(h)) + ErrP (h ) (11) 2 6.4 No Optimal Semi-Supervised Algorithm One could imagine a different formulation of the comparison between SL and SSL paradigms. For example, one might ask naively whether, for given class H , there is a semisupervised algorithm A, such that for any supervised algorithm B , and any , , on any probability distribution P the sample complexity of A is no higher than the sample complexity of B . The answer to the question is easily seen to be negative, because for any P there exists a supervised learning algorithm BP that ignores the labeled examples and simply outputs hypothesis h H with minimum error ErrP (h) (or even Bayes optimal classifier for P ). On P the sample complexity of BP is zero, unfortunately, on P , sufficiently different from P , the sample complexity of BP is infinite. One might disregard algorithms such as BP and ask the same question as above, except that one quantifies over only the subset of algorithms that on any distribution over X × {0, 1} have sample complexity that is polynomial in 1/ and ln(1/ ). Such algorithms are often called PAC (Probably Approximately Correct). The following theorem demonstrates that such restriction does not help and the answer to the question is still negative. Theorem 17. Let H = {1(-, t] : t R} be the class of thresholds over the real line. For any absolutely continuous distribution D (with respect to Lebesgue measure on R), any semi-supervised algorithm A, any > 0 and (0, 1 ), 2 there exists a distribution P Ext(D) and a supervised PAC learning algorithm B such that m(A, H, P, , ) > m(B , H, P, , ) . Proof. Fix any A, D and m. Let L be the algorithm that chooses the left most empirical error minimizer, that is, on a sample S , L outputs 1(-, ], where t . S S ) = inf R : Err (1(-, t]) = min Err (h h H Modifying the proof of Anthony and Bartlett with the use of Lemma 12 rather than Lemma 11 we get that there exists a P P such that whenever m b/(320 2), E Prm rrP (snap(A(D, S ))) - ErrP (h ) > 2 > . S P Whenever A fails, we get from (11) Err (A(D, S )) - Err (h ) 1EP rr (snap(h)) + ErrP (h ) . 2 To get (ln(1/ )/ 2), apply Lemma 12 with h and g . We will now apply the above theorem to give the sample complexity for learning union of intervals on the real line. Recall that by the rescaling trick, we only need to consider the sample complexity with respect to the uniform distribution on (0, 1). Corollary 16. The SSL sample complexity for learning the class of union of at most d intervals U Id = {[a1 , a2 ) · · · [a2l-1 , a2l ) : l d, 0 a1 a2 · · · a2l 1} over uniform distribution on (0, 1) is 2 . d + ln 1 2 Proof. We have VC(U Id ) = 2d, thus the upper bound follows immediately. Construct 2d-shatterable sets by letting Ci = [(i - 1)/2d, i/2d) for ii = 1, . . . , 2d. For any S {1, . . . , 2d} define hS = S Ci . Now if |S | d then clearly hS U Id , if |S | > d then hS U Id since |S | < d. But then [0, 1)\hS can be covered by at most d intervals, so hS U Id . Thus the set {hS : S {1, . . . , 2d}} 2d-shatters D on [0, 1]. Also let h = [0, 0) = and g = [0, 1). Now apply Theorem 15 for the bound. P P For any h H we also define algorithm Lh , which outputs h if ErrS (h) = 0, and otherwise Lh outputs L(S ). First, note that L L1 . Second, for any h, Lh outputs a hypothesis that minimizes empirical error, and since VC(H ) = 1, it is a PAC algorithm. Third, clearly the sample complexity of Lh on Dh is zero (regardless of and ). Theorem 10 shows that there exists h H such that the sample complexity of A on Dh is positive, in fact, it is increasing as and approach zero. Thus there exists supervised algorithm B = Lh with lower sample complexity than A. 7 Conclusion We provide a formal analysis of the sample complexity of semi-supervised learning compared to that of learning from labeled data only. We focus on bounds that do not depend on assumptions concerning the relationship between the labels and unlabeled data distribution. Our main conclusion is that in such a setting semi-supervised learning has limited advantage. Formally, we show that for basic concept classes over the real line this advantage is never more than a constant factor of the sample size. We believe that this phenomena applies much more widely. We also briefly address the error bounds under common assumptions on the relationship between unlabeled data and the labels. We demonstrate that even when such assumptions apply common SSL paradigms may be inferior to standard empirical risk minimization. We conclude that prior beliefs like the cluster assumption should be formulated more precisely to reflect the known practical merits of SSL. This discussion highlights a dire deficiency in current approach to semi-supervised learning; common assumptions about these labels-unlabeled structure relationships do not offer any method for reliably checking if they hold (in any given learning problem). The paper calls attention to, and formalizes, some natural fundamental questions about the theory-practice gap concerning semi-supervised learning. The major open question we raise is whether any semi-supervised learning algorithm can achieve sample size guarantees that are unattainable without access to unlabeled data. This is formalized in Conjectures 5 and 4. Acknowledgements. We like to thank Nati (Nathan) Srebro and Vitaly Feldman for useful discussions. [12] [13] [14] [15] information-theoretic approach. In Proceedings of COLT 1998, pages 104­115. ACM, 1998. ¨¨ ¨ Matti Kaariainen. Generalization error bounds using unlabeled data. In Proceedings of COLT 2005, pages 127­142. Springer, 2005. P.Rigollet. Generalization error bounds in semisupervised classification under the cluster assumption. The Journal of Machine Learning Research, 8:1369­ 1392, 2007. Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998. Vladimir N. Vapnik. Transductive inference and semi-supervised learning. In Olivier Chapelle, Bern¨ hard Scholkopf, and Alexander Zien, editors, SemiSupervised Learning, chapter 24, pages 453­472. MIT Press, September 2006. References [1] Dana Angluin and Philip D. Laird. Learning from noisy examples. Machine Learning, 2(4):343­370, 1987. [2] Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, January 1999. [3] Maria-Florina Balcan and Avrim Blum. A PAC-style model for learning from labeled and unlabeled data. In Proceedings of 18th Annual Conference on Learning Theory 2005, pages 111­126. Springer, 2005. [4] Maria-Florina Balcan and Avrim Blum. An augmented PAC model for semi-supervised learning. In Olivier ¨ Chapelle, Bernhard Scholkopf, and Alexander Zien, editors, Semi-Supervised Learning, chapter 21, pages 61­89. MIT Press, September 2006. [5] Gyora M. Benedek and Alon Itai. Learnability with respect to fixed distributions. Theoretical Computer Science, 86(2):377­389, 1991. [6] Avrim Blum and Tom M. Mitchell. Combining labeled and unlabeled sata with co-training. In COLT, pages 92­100, 1998. ¨ [7] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien, editors. Semi-Supervised Learning. MIT Press, September 2006. [8] Fabio Cozman and Ira Cohen. Risks of semi-supervised learning: How unlabeled data can degrade performance of generative classifiers. In Olivier Chapelle, Bern¨ hard Scholkopf, and Alexander Zien, editors, SemiSupervised Learning, chapter 4, pages 57­72. MIT Press, September 2006. [9] Ran El-Yaniv and Dmitry Pechyony. Stable transductive learning. In COLT, pages 35­49, 2006. [10] Ran El-Yaniv and Dmitry Pechyony. Transductive rademacher complexity and its applications. In COLT, pages 157­171, 2007. [11] Claudio Gentile and David P. Helmbold. Improved lower bounds for learning from noisy examples: and