Statistical Analysis of Semi-Supervised Regression John Lafferty Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Larry Wasserman Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213 larry@stat.cmu.edu Abstract Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on manifold regularization and graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1 Introduction Suppose that we have labeled data L = {(X1 , Y1 ), . . . (Xn , Yn )} and unlabeled data U = {Xn+1 , . . . XN } where N n and Xi RD . Ordinary regression and classification techniques use L to predict Y from X . Semi-supervised methods also use the unlabeled data U in an attempt to improve the predictions. To justify these procedures, it is common to invoke one or both of the following assumptions: Manifold Assumption (M): The distribution of X lives on a low dimensional manifold. Semi-Supervised Smoothness Assumption (SSS): The regression function m(x) = E[Y | X = x] is very smooth where the density p(x) of X is large. In particular, if there is a path connecting Xi and Xj on which p(x) is large, then Yi and Yj should be similar with high probability. While these assumptions are somewhat intuitive, and synthetic examples can easily be constructed to demonstrate good performance of various techniques, there has been very little theoretical analysis of semi-supervised learning that rigorously shows how the assumptions lead to improved performance of the estimators. In this paper we provide a statistical analysis of semi-supervised methods for regression, and propose some new techniques that provably lead to better inferences, under appropriate assumptions. In particular, we explore precise formulations of SSS, which is motivated by the intuition that high density level sets correspond to clusters of similar objects, but as stated above is quite vague. To the best of our knowledge, no papers have made the assumption precise and then explored its consequences in terms of rates of convergence, with the exception of one of the first papers on semi-supervised learning, by Castelli and Cover (1996), which evaluated a simple mixture model, and the recent paper of Rigollet (2006) in the context of classification. This situation is striking, given the level of activity in this area within the machine learning community; for example, the recent survey of semi-supervised learning by Zhu (2006) contains 163 references. 1 Among our findings are: 1. Under the manifold assumption M, the semi-supervised smoothness assumption SSS is superfluous. This point was made heuristically by Bickel and Li (2006), but we show that in fact ordinary regression methods are automatically adaptive if the distribution of X concentrates on a manifold. 2. Without the manifold assumption M, the semi-supervised smoothness assumption SSS as usually defined is too weak, and current methods don't lead to improved inferences. In particular, methods that use manifold regularization based on graph Laplacians do not achieve faster rates of convergence. 3. Assuming specific conditions that relate m and p, we develop new semi-supervised methods that lead to improved estimation. In particular, we propose estimators that reduce bias by estimating the Hessian of the regression function, improve the choice of bandwidths using unlabeled data, and estimate the regression function on level sets. The focus of the paper is on a theoretical analysis of semi-supervised regression techniques, rather than the development of practical new algorithms and techniques. While we emphasize regression, most of our results have analogues for classification. Our intent is to bring the statistical perspective of minimax analysis to bear on the problem, in order to study the interplay between the labeled sample size and the unlabeled sample size, and between the regression function and the data density. By studying simplified versions of the problem, our analysis suggests how precise formulations of assumptions M and SSS can be made and exploited to lead to improved estimators. 2 Preliminaries The data are (X1 , Y1 , R1 ), . . . , (XN , YN , RN ) where Ri {0, 1} and we observe Yi only if Ri = 1. The labeled data are L = {(Xi , Yi ) : Ri = 1} and the unlabeled data are U = {(Xi , Yi ) : Ri = 0}. For convenience, assume that data are labeled so that Ri = 1 for i = 1, . . . , n and Ri = 0 for i = n + 1, . . . , N . Thus, the labeled sample size is n, and the unlabeled sample size is u = N - n. Let p(x) be the density of X and let m(x) = E(Y | X = x) denote the regression function. Assume that R Y | X (missing at random) and that Ri | Xi Bernoulli( (Xi )). Finally, let µ = P(Ri = 1) = (x)p(x)dx. For simplicity we assume that (x) = µ for all x. The missing at random assumption R Y | X is crucial, although this point is rarely emphasized in the machine learning literature. It is clear that without some further conditions, the unlabeled data are useless. The key assumption we need is that there is some correspondence between the shape of the regression function m and the shape of the data density p. We will use minimax theory to judge the quality of an estimator. Let R denote a class of regression functions and let F denote a class of density functions. In the classical setting, we observe labeled data (X1 , Y2 ), . . . , (Xn , Yn ). The pointwise minimax risk, or mean squared error (MSE), is defined by (1) Rn (x) = inf sup E(mn (x) - m(x))2 mn mR,pF b A typical assumption is that R is the Sobolev space of order two, meaning essentially that m has smooth second derivatives. In this case we have1 Rn n-4/(4+D) . The minimax rate is achieved by kernel estimators and local polynomial estimators. In particular, for kernel estimators if we use a product kernel with common bandwidth hn for each variable, choosing hn n-1/(4+D) yields an estimator with the minimax rate. The difficulty is that the rate Rn n-4/(4+D) is extremely slow when D is large. 1 We write an bn to mean that an /bn is bounded away from 0 and infinity for large n. We have suppressed some technicalities such as moment assumptions on = Y - m(X ). where the infimum is over all estimators. The global minimax risk is defined by ( mn (x) - m(x))2 dx. Rn = inf sup E mn mR,pF b (2) 2 In more detail, let C > 0 and let B be a positive definite matrix, and define m C m (x - x0 )T B (x - x0 ) (x) - m(x0 ) - (x - x0 )T m(x0 ) R= : 2 p . F = : p(x) b > 0, |p(x1 ) - p(x2 )| c x1 - x2 2 ( 3) (4) Fan (1993) shows that the local linear estimator is asymptotically minimax for this class. This n T estimator is given by mn (x) = a0 where (a0 , a1 ) minimizes i=1 (Yi - a0 - a1 (Xi - 2 -1 / 2 x)) K (H (Xi - x)), where K is a symmetric kernel and H is a matrix of bandwidths. The asymptotic MSE of the local linear estimator m(x) using the labeled data is 2 1 0 2 1 2 µ2 (K )tr(Hm (x)H ) + + o( tr(H )) (5) R(H ) = 2 n|H |1/2 p(x) K2 where Hm (x) is the Hessian of m at x, µ2 (K ) = (u) du and 0 is a constant. The optimal bandwidth matrix H is given by 2 2/(D+4) 1/ 2 0 |Hm | H = (Hm )-1 (6) µ2 (K )nDp(x) 2 and R(H ) = O(n-4/(4+D) ). This result is important to what follows, because it suggests that if the Hessian Hm of the regression function is related to the Hessian Hp of the data density, one may be able to estimate the optimal bandwidth matrix from unlabeled data in order to reduce the risk. 3 The Manifold Assumption It is common in the literature to invoke both M and SSS. But if M holds, SSS is not needed. This is argued by Bickel and Li (2006) who say, "We can unwittingly take advantage of low dimensional structure without knowing it." Suppose X RD has support on a manifold M with dimension d < D. Let mh be the local linear estimator with diagonal bandwidth matrix H = h2 I . Then Bickel and Li show that the bias and variance are J2 (x) (1 + oP (1)) (7) b(x) = h2 J1 (x)(1 + oP (1)) and v (x) = nhd -1/(4+d) -4/(4+d) for some functions J1 and J2 . Choosing h n yields a risk of order n , which is the optimal rate for data that to lie on a manifold of dimension d. To use the above result we would need to know d. Bickel and Li argue heuristically that the following procedure will lead to a reasonable bandwidth. First, estimate d using the procedure in Levina and b b Bickel (2005). Now let B = {1 /n1/(d+4) , . . . , B /n1/(d+4) } be a set of bandwidths, scaling b the asymptotic order n-1/(d+4) by different constants. Finally, choose the bandwidth h B that minimizes a local cross-validation score. We now show that, in fact, one can skip the step of estimating d. Let E1 , . . . , En be independent Bernoulli ( = 1 ) random variables. Split the data into two groups, so that I0 = {i : Ei = 0} and 2 I1 = {i : Ei = 1}. Let H = {n-1/(4+d) : 1 d D}. Construct mh for h H using the data i 2 h in I0 , and estimate the risk from I1 by setting R(h) = |I1 |-1 I1 (Yi - mh (Xi )) . Finally, let R(h) and set m = mb . For simplicity, let us assume that both Yi and Xi are bounded by minimize h a finite constant B . Theorem 1. Suppose that and |Yi | B and |Xij | B for all i and j . Assume the conditions in Bickel and Li (2006). Suppose that the data density p(x) is supported on a manifold of dimension d 4. Then we have that . 1 2 O (8) E(m(x) - m(x)) = n4/(4+d) 3 The notation O allows for logarithmic factors in n. i R(h) = 1 n1 2i n1 Proof. The risk is, up to a constant, R(h) = E(Y - m(X))2 , where (X, Y ) is a new pair and Y = m(X ) + . Note that (Y - mh (X ))2 = Y 2 - 2Y mh (X ) + m2 (X ), so R(h) = E(Y 2 ) - h 2E(Y mh (X )) + m2 (X ). Let n1 = |I1 |. Then, h Yi2 - I1 I1 for some c > 0. Setting n = P By conditioning on the data in I0 and applying Bernstein's inequality, we have m h | 2 R(h) - R(h)| > P ax | P R(h) - R(h)| > De-nc hH H Yi mh (Xi ) + 1i n1 I1 m2 (Xi ). h (9) (10) m Let h minimize R(h) over H. Then, except on a set of probability tending to 0, C C log n R log n h) R(h) + (h ) + (12) R( n n C C + ( 1 1 log n log n O R(h ) + 2 2 13) =O = n n n4/(4+d) n4/(4+d) l where we used the assumption d 4 in the last equality. If d = 4 then O( og n/n) = l n4/(4+d) . O(n-4/(4+d) ); if d > 4 then O( og n/n) = o W e conclude that ordinary regression methods are automatically adaptive, and achieve the lowdimensional minimax rate if the distribution of X concentrates on a manifold; there is no need for semi-supervised methods in this case. Similar results apply to classification. log n/n for suitably large C , we conclude that - C log n R(h) - R(h)| > ax | 0. hH n C (11) 4 Kernel Regression with Manifold Regularization In practice, it is unlikely that the distribution of X would be supported exactly on a low-dimensional manifold. Nevertheless, the shape of the data density p(x) might provide information about the regression function m(x), in which case the unlabeled data are informative. Several recent methods for semi-supervised learning attempt to exploit the smoothness assumption SSS using regularization operators defined with respect to graph Laplacians (Zhu et al., 2003; Zhou et al., 2004; Belkin et al., 2005). The technique of Zhu et al. (2003) is based on Gaussian random fields and harmonic functions defined with respect to discrete Laplace operators. To express this method in statistical terms, recall that standard kernel regression corresponds to the locally constant estimator n in K (X , x) Yi 2 nhi mn (x) = arg min Kh (Xi , x)(Yi - m(x)) = i=1 (14) m(x) i=1 Kh (Xi , x) =1 where Kh is a symmetric kernel depending on bandwidth parameters h. In the semi-supervised approach of Zhu et al. (2003), the locally constant estimate m(x) is formed using not only the labeled data, but also using the estimates at the unlabeled points. Suppose that the first n data points (X1 , Y1 ), . . . , (Xn , Yn ) are labeled, and the next u = N - n points are unlabeled, Xn+1 , . . . , Xn+u . The semi-supervised regression estimate is then (m(X1 ), m(X2 ), . . . , m(XN )) given by m = arg min m i N jN Kh (Xi , Xj ) (m(Xi ) - m(Xj ))2 (15) =1 =1 4 where the minimization is carried out subject to the constraint m(Xi ) = Yi , i = 1, . . . , n. Thus, the estimates are coupled, unlike the standard kernel regression estimate (14) where the estimate at each point x can be formed independently, given the labeled data. The estimator can be written in closed form as a linear smoother m = C -1 B Y = G Y where m = (m(Xn+1 ), . . . , m(Xn+u ))T is the vector of estimates over the unlabeled test points, and Y = (Y1 , . . . , Yn )T is vector of labeled values. The (N -n)×(N -n) matrix C and the (N -n)×n matrix B denote blocks of the combinatorial Laplacian on the data graph corresponding to the labeled and unlabeled data: ( A BT 16) = BC where the Laplacian = [ij ] has entries k Kh (Xi , Xk ) if i = j (17) ij = -Kh (Xi , Xj ) otherwise. This expresses the effective kernel G in terms of geometric objects such as heat kernels for the discrete diffusion equations (Smola and Kondor, 2003). This estimator assumes the noise is zero, since m(Xi ) = Yi for i = 1, . . . , n. To work in the standard model Y = m(X ) + , the natural extension of the harmonic function approach is manifold regularization (Belkin et al., 2005; Sindhwani et al., 2005; Tsang and Kwok, 2006). Here the estimator is chosen to minimize the regularized empirical risk functional i N jN iN j n 2 2 KH (Xi , Xj ) (m(Xj ) - m(Xi )) (18) KH (Xi , Xj ) (Yj - m(Xi )) + R (m) = =1 =1 =1 =1 where is the combinatorial Laplacian associated with KH . This regularization term is motivated by the semi-supervised smoothness assumption--it favors functions m for which m(Xi ) is close to m(Xj ) when Xi and Xj are similar, according to the kernel function. The name manifold regM 1 ularization is justified by the fact that 2 J (m) m(x) 2 dM x, the energy of m over the manifold. While this regularizer has primarily been used for SVM classifiers (Belkin et al., 2005), it can be used much more generally. For an appropriate choice of , minimizing the functional (18) can be expected to give essentially the same results as the harmonic function approach that minimizes (15). where H is a matrix of bandwidths and KH (Xi , Xj ) = K (H -1/2 (Xi - Xj )). When = 0 the standard kernel smoother is obtained. The regularization term is i N jN 2 KH (Xi , Xj ) (m(Xj ) - m(Xi )) = 2mT m (19) J (m) =1 =1 Theorem 2. Suppose that D 2. Let mH, minimize (18), and let p,H be the differential operator defined by 1 p(x)T H f (x) p,H f (x) = trace(Hf (x)H ) + . (20) 2 p(x) Then the asymptotic MSE of mH, (x) is c 2 -1 I c1 µ 2 2 (µ + ) M= + - p,H p,H m(x) + o( tr(H )) (21) µ µ n(µ + )p(x)|H |1/2 where µ = P(Ri = 1). The proof is given in the appendix. The implication of this theorem is that the estimator that uses manifold regularization has the same rate of convergence as the usual kernel estimator, and thus the unlabeled data have not improved the estimator asymptotically. 5 Note that the bias of the standard kernel estimator, in the notation of this theorem, is b(x) = c2 p,H m(x), and the variance is V (x) = c1 /np(x)|H |1/2 . Thus, this result agrees with the standard supervised MSE in the special case = 0. It follows from this theorem that M = M + o( tr(H )) where M is the usual MSE for a kernel estimator. Therefore, the minimum of M has the same leading order in H as the minimum of M . 5 Semi-Supervised Methods With Improved Rates The previous result is negative, in the sense that it shows unlabeled data do not help to improve the rate of convergence. This is because the bias and variance of a manifold regularized kernel estimator are of the same order in H as the bias and variance of standard kernel regression. We now demonstrate how improved rates of convergence can be obtained by formulating and exploiting appropriate SSS assumptions. We describe three different approaches: semi-supervised bias reduction, improved bandwidth selection, and averaging over level sets. 5.1 Semi-Supervised Bias Reduction We first show a positive result by formulating an SSS assumption that links the shape of p to the shape of m by positing a relationship between the Hessian Hm of m and the Hessian Hp of p. Under this SSS assumption, we can improve the rate of convergence by reducing the bias. To illustrate the idea, take p(x) known (i.e., N = ) and suppose that Hm (x) = Hp (x). Define 1 mn (x) = mn (x) - µ2 (K )tr(Hm (x)H ) 22 n-8/(8+D) . (22) where mn (x) is the local linear estimator. Theorem 3. The risk of mn (x) is O Proof. First note that the variance of the estimator mn , conditional on X1 , . . . , Xn , is 1 Var(mn (x)|X1 , . . . , Xn ) = Var(mn (x)|X1 , . . . , Xn ). Now, the term 2 µ2 (K )tr(Hm (x)H ) is pre2 cisely the bias of the local linear estimator, under the SSS assumption that Hp (x) = Hm (x). Thus, the first order bias term has been removed. The result now follows from the fact that the next term 4 B the bias of the local linear estimator is of order O (tr(H ) ). in y assuming 2 derivatives are matched, we get the rate n-(4+4)/(4+4+D) . When p is estimated from the data, the risk will be inflated by N -4/(4+D) assuming standard smoothness assumptions on p. This term will not dominate the improved rate n-(4+4)/(4+4+D) as long as N > n . The assumption that Hm = Hp can be replaced by the more realistic assumption that Hm = g (p; ) for some parameterized family of functions g (·; ). Semiparametric methods can then be used to estimate . This approach is taken in the following section. 5.2 Improved Bandwidth Selection Let H be the estimated bandwidth using the labeled data. We will now show how a bandwidth H can be estimated using the labeled and unlabeled data together, such that, under appropriate assumptions, |R(H ) - R(H )| lim sup = 0, where H = arg min R(H ). (23) H n |R(H ) - R(H )| Therefore, the unlabeled data allow us to construct an estimator that gets closer to the oracle risk. The improvement is weaker than the bias adjustment method. But it has the virtue that the optimal local linear rate is maintained even if the proposed model linking Hm to p is incorrect. We begin in one dimension to make the ideas clear. Let mH denote the local linear estimator with bandwidth H R, H > 0. To use the unmbeled data, note that td e optimal (global) bandwidth is la h H = (c2 B /(4nc1 A))1/5 where A = (x)2 dx and B = x/p(x). Let p(x) be the kernel density estimator of p using X1 , . . . , XN and bandwidth h = O(N -1/5 ). We assume (SSS) m (x) = G (p) for some function G depending on finitely many parameters . b 4nc1 A Now let m (x) = G (p), and define H = b d x/p(x). c 2B b 1/ 5 where A = ( m (x))2 dx and B = 6 2 Theorem 4. Suppose that m (x) - m (x) = OP (N - ) where > 5 . Let N = N (n) as n . If N/n1/4 , then lim sup n Proof. The risk is R(H ) = c1 H 4 |R(H ) - R(H )| = 0. |R(H ) - R(H )| d x +o p(x) (24) The first term is oP (n-3/10 ) since N > n1/4 . The second term is oP (n-3/10 ) since > 2/5. Thus T (H ) - R(H ) = oP (1/n) and the result follows. R (H - H )2 R (H ) + O(|H - H |3 ) (26) 2 (H - H )2 = O(n-2/5 ) + O(|H - H |3 ). (27) 2 From Girard (1998), H - H = OP (n-3/10 ). Hence, R(H ) - R(H ) = OP (n-1 ). Also, p(x) - p(x) = O(N -2/5 ). Since m (x) - m (x) = OP (N - ), N -2 / 5 + N - . H - H = OP OP (28) n1/5 n1/5 R(H ) = The oracle bandwidth is H = c3 /n1/5 and then R(H ) = O(n-4/5 ). Now let H be the bandwidth estimated by cross-validation. Then, since R (H ) = 0 and H = O(n-1/5 ), we have ( c2 m (x))2 dx + nH nH 1 . (25) he proof in the multidimensional case is essentially the same as in the one dimensional case, except that we use the multivariate version of Girard's result, namely, H - H = OP (n-(D+2)/(2(D+4)) ). This leads to the following result. Theorem 5. Let N = N (n). If N/nD/4 , - = OP (N - ) for some > lim sup n 2 4+D then 5.3 Averaging over Level Sets |R(H ) - R(H )| = 0. |R(H ) - R(H )| (29) Recall that SSS is motivated by the intuition that high density level sets should correspond to clusters of similar objects. Another approach to quantifying SSS is to make this cluster assumption explicit. Rigollet (2006) shows one way to do this in classification. Here we focus on regression. Suppose that L = {x : p(x) > } can be decomposed into a finite number of connected, compact, convex sets C1 , . . . , Cg where is chosen so that Lc has negligible probability. For N large we can replace L with L = {x : p(x) > } with small loss in accuracy, n here p is an estimate of p using w the unlabeled data; see Rigollet (2006) for details. Let kj = i=1 I (Xi Cj ) and for x Cj define n Yi I (Xi Cj ) m(x) = i=1 . (30) kj Thus, m(x) simply averages the labels of the data that fall in the set to which x belongs. If the regression function is slowly varying in over this set, the risk should be small. A similar estimator is considered by Cortes and Mohri (2006), but they do not provide estimates of the risk. Theorem 6. The risk of m(x) for x L Cj is bounded by + 1 2 2 O j j O nj 7 ( 31) where j = supxCj m(x) , j = diameter(Cj ) and j = P(X Cj ). Proof. Since the kj are Binomial, kj = nj + o(1) almost surely. Thus, the variance of m(x) is O(1/(nj )). The mean, given X1 , . . . , Xn , is 1i 1i m(Xi ) = m(x) + (m(Xi ) - m(x)). (32) kj kj : Xi Cj : Xi Cj Now m(Xi )-m(x) = (Xj -x) m(ui ) for some ui between x and Xi . Hence, |m(Xi )-m(x)| TXj - x supxCj m(x) and so the bias is bounded by j j . his result reveals an interesting bias-variance tradeoff. Making smaller decreases the variance and increases the bias. Suppose the two terms are balanced at = . Then we will beat the usual rate of convergence if j ( ) > n-D/(4+D) . T 6 Conclusion Semi-supervised methods have been very successful in many problems. Our results suggest that the standard explanations for this success are not correct. We have indicated some new approaches to understanding and exploiting the relationship between the labeled and unlabeled data. Of course, we make no claim that these are the only ways of incorporating unlabeled data. But our results indicate that decoupling the manifold assumption and the semi-supervised smoothness assumption is crucial to clarifying the problem. References B E L K I N , M ., N I YO G I , P. and S I N D H WA N I , V. (2005). On manifold regularization. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTAT 2005). B I C K E L , P. and L I , B . (2006). Local polynomial regression on unknown manifolds. Tech. rep., Department of Statistics, UC Berkeley. C A S T E L L I , V. and C OV E R , T. (1996). The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. on Info. Theory 42 2101­2117. C O RT E S , C . and M O H R I , M . (2006). On transductive regression. In Advances in Neural Information Processing Systems (NIPS), vol. 19. FA N , J . (1993). Local linear regression smoothers and their minimax efficiencies. The Annals of Statistics 21 196­216. G I R A R D , D . (1998). Asymptotic comparison of (partial) cross-validation, gcv and randomized gcv in nonparametric regression. Ann. Statist. 12 315­334. L E V I NA , E . and B I C K E L , P. (2005). Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems (NIPS), vol. 17. R I G O L L E T, P. (2006). Generalization error bounds in semi-supervised classification under the cluster assumption. arxiv.org/math/0604233 . S I N D H WA N I , V., N I YO G I , P., B E L K I N , M . and K E E RT H I , S . (2005). Linear manifold regularization for large scale semi-supervised learning. In Proc. of the 22nd ICML Workshop on Learning with Partially Classified Training Data. S M O L A , A . and KO N D O R , R . (2003). Kernels and regularization on graphs. In Conference on Learning Theory, COLT/KW. T S A N G , I . and K W O K , J . (2006). Large-scale sparsified manifold regularization. In Advances in Neural Information Processing Systems (NIPS), vol. 19. ¨ Z H O U , D ., B O U S Q U E T, O ., L A L , T., W E S T O N , J . and S C H O L KO P F, B . (2004). Learning with local and global consistency. In Advances in Neural Information Processing Systems (NIPS), vol. 16. Z H U , X . (2006). Semi-supervised learning literature review. Tech. rep., University of Wisconsin. Z H U , X ., G H A H R A M A N I , Z . and L A FF E RT Y, J . (2003). Semi-supervised learning using Gaussian fields and harmonic functions. In ICML-03, 20th International Conference on Machine Learning. 8