Semi-Supervised Learning Using Label Mean Yu-Feng Li LIYF @ LAMDA . NJU . EDU . CN National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China James T. Kwok JAMESK @ CSE . UST. HK Dept. Computer Science & Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong Zhi-Hua Zhou ZHOUZH @ LAMDA . NJU . EDU . CN National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China Abstract Semi-Supervised Support Vector Machines (S3VMs) typically directly estimate the label assignments for the unlabeled instances. This is often inefficient even with recent advances in the efficient training of the (supervised) SVM. In this paper, we show that S3VMs, with knowledge of the means of the class labels of the unlabeled data, is closely related to the supervised SVM with known labels on all the unlabeled data. This motivates us to first estimate the label means of the unlabeled data. Two versions of the meanS3VM, which work by maximizing the margin between the label means, are proposed. The first one is based on multiple kernel learning, while the second one is based on alternating optimization. Experiments show that both of the proposed algorithms achieve highly competitive and sometimes even the best performance as compared to the state-of-the-art semi-supervised learners. Moreover, they are more efficient than existing S3VMs. data to regularize the decision boundary. Specifically, these methods prefer the decision boundary to pass through lowdensity regions (Chapelle & Zien, 2005). The Laplacian SVM is a S3VM that exploits the data's manifold structure via the graph Laplacian. It encodes both the labeled and unlabeled data by a connected graph, where each instance is represented as a vertex and two vertices are connected by an edge if they have large similarity. The goal is to find class labels for the unlabeled data such that their inconsistencies with both the supervised data and the underlying graph structure are minimized. However, while many efficient SVM softwares have been developed for supervised learning, S3VMs still suffer from inefficiency issues. In particular, the optimization problem of Bennett and Demiriz's S3VM is formulated as a mixedinteger programming problem and so is computationally intractable in general. TSVM, on the other hand, iteratively solves standard supervised SVM problems. However, the number of iterations required may be large since the TSVM is based on a local combinatorial search that is guided by a label switching procedure. Unlike the TSVM, the Laplacian SVM only needs to solve one small SVM with the labeled data only. However, it needs to calculate the inverse of an n × n matrix, where n is the size of the whole data set. This costs O(n3 ) time and O(n2 ) memory. In this paper, we propose a new approach which may lead to efficient designs of semi-supervised learning algorithms. In contrast to directly estimating the labels of the unlabeled examples, we propose a new S3VM which first estimates the label means of the unlabeled data. We show that this S3VM, referred to as the meanS3VM, is nearly the same as the supervised SVM that is provided with all the labels of the unlabeled examples. Roughly speaking, when the data set is separable, the meanS3VM is equivalent to the supervised SVM; otherwise, the loss function of the meanS3VM is no more than twice of that of the supervised SVM. So, instead of estimating the label of ev- 1. Introduction During the past decade, many semi-supervised learning algorithms have been proposed, among which a very popular type of algorithms is the semi-supervised support vector machines (S3VMs). Examples include the semi-supervised SVM (Bennett & Demiriz, 1999), the transductive SVM (TSVM) (Joachims, 1999), and the Laplacian SVM (Belkin et al., 2006). Bennett & Demiriz's S3VM and the TSVM are built upon the cluster assumption and use the unlabeled Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Semi-Supervised Learning Using Label Mean ery unlabeled example, we estimate the label means of the whole unlabeled data. This will be more efficient for semisupervised learning. Based on this observation, we propose two efficient algorithms by maximizing the margin between the label means of the unlabeled data. Experimental comparisons with state-of-the-art semi-supervised learning methods show that our proposed algorithms achieve highly competitive or even the best performance on a broad range of data sets, including the semi-supervised learning benchmark data sets, UCI data sets and the newsgroup data sets. In particular, on relatively large data (e.g., those with more than 1,000 instances), our proposed algorithms are one hundred times faster than the TSVM and ten times faster than the Laplacian SVM. The rest of this paper is organized as follows. Section 2 briefly introduces the semi-supervised support vector machines. Section 3 studies the usefulness of the label mean of the unlabeled data. Section 4 proposes two algorithms which use the label mean for the S3VM. Experimental results are reported in Section 5. The last section concludes this paper. where = [1 , . . . , l+u ]. The last constraint (with userdefined parameter r) is a balance constraint that avoids the trivial solution of assigning all patterns to the same class and thus achieves "infinite" margin (Chapelle et al., 2008). It is worth noting that the effect of the objective in (2) has been well studied and many algorithms have been developed (Chapelle et al., 2008). 3. Usefulness of the Label Mean Consider the following optimization problem: w,b,,p min 1 w 2 +C2 2 + C1 iIl i (3) (i + pi-l - |f (xi )|) iIu s.t. yi (w (xi ) + b) 1 - i , i Il , w (xi ) + b pi-l , -w (xi ) - b pi-l , pi-l 1 - i , i Iu ; i 0, i Il Iu , sgn(w (xi ) + b) = r. iIu 2. Semi-Supervised SVM (S3VM) In semi-supervised learning, we are given a set of labeled data {(x1 , y1 ), · · · , (xl , yl )} and a set of unlabeled data {xl+1 , · · · , xl+u }, where l and u are the sizes of the labeled data and unlabeled data, respectively, and yi {±1}. Let Il = {1, 2, . . . , l} be the set containing indices of the labeled data, and Iu = {l + 1, l + 2, . . . , l + u} be the set for the unlabeled data. The goal is to find f that minimizes min f Lemma 1. Let (w , b , , p ) be the optimal solution of (3). Then, for i Iu , i + p = i-l 1 |f (xi )| 1, |f (xi )| otherwise. 1 f 2 2 H + C1 iIl (yi , f (xi )) + C2 iIu sym (f (xi )), (1) where H is a reproducing kernel Hilbert space (RKHS) induced by the kernel k, (y, f (x)) = max{0, 1 - yf (x)} is the hinge loss, sym (f (x)) = max{0, 1 - |f (x)|} is the symmetric hinge loss, and C1 , C2 are regularization parameters that balance model complexity with the empirical risks on the labeled and unlabeled data, respectively. Let the decision function be f (x) = w (x) + b, where is the feature map induced by the kernel k. By introducing slack variables, (1) can be reformulated as min 1 w 2 2 Proof. When |f (xi )| 1, it is easy to verify that setting p = |f (xi )| + and i = 1 - |f (xi )| - (where 0 i-l 1 - |f (xi )|) satisfies all the constraints in (3). When |f (xi )| > 1, given p |f (xi )| 1, then the second i-l condition p 1 - i is satisfied for any i 0, and i-l i = 0 is the choice that minimizes Eq.3. Proposition 1. Problem (3) is equivalent to problem (2). Proof. The key is to show that the loss functions in (2) and (3) are the same. Obviously, this is the case for the labeled data. For the unlabeled data (i Iu ), let sym (f (xi )) = i + pi-l - |f (xi )|. From Lemma 1, i + pi-l - |f (xi )| = 1 - |f (xi )| 0 |f (xi )| 1, otherwise. w,b, + C1 iIl i + C2 iIu i (2) Note that the RHS is the same as the symmetric hinge loss sym (f (xi )), while the LHS is the same as sym (f (xi )). Therefore, sym (f (xi )) = sym (f (xi )). From the balance constraint, the numbers of positive and negative instances in the unlabeled data can be obtained as u+ = r+u and u- = -r+u , respectively. Let the 2 2 true label of the unlabeled pattern i be yi . The (true) means for the positive and negative classes are then m+ = s.t. yi (w (xi ) + b) 1 - i , i Il , |w (xi ) + b| 1 - i , i Iu , i 0, i Ii Iu , sgn(w (xi ) + b) = r, iIu Semi-Supervised Learning Using Label Mean (xi ) and m- = tively. Now, yi =1 1 u+ 1 u- yi =-1 (xi ), respec- |f (xi )| - (u- - u+ )b iIu (xi ) - iIu ,f (xi )<0 iIu ,f (xi )0 (xi ) (4) = w ^ ^ = u + w m+ - u - w m - , ^ where m+ 1 u- ^ (xi ) and m- = iIu ,f (xi )<0 (xi ) are estimates of m+ and m- . Using (4), the objective of (3) can be rewritten as 1 w 2 + C1 i + C2 (i + pi-l ) 2 = iIu ,f (xi )0 iIl iIu 1 u+ Figure 1. Plots of yf (x), for the loss in (6) and the hinge loss. 4. S3VM Training via the Label Mean Results in Section 3 suggests that the label mean, being a simple statistic, can be useful in building a semi-supervised learner. To find the label means of the unlabeled data, we propose in this section a margin-based framework such that the margin between the means is maximized. Mathematically, it is formulated as: min min s.t. 1 w 2 l 2 2 ^ ^ -C2 (u+ w m+ - u- w m- + (u+ - u- )b). ^ ^ On replacing the estimates m+ and m- by the ground truth values, we obtain the following optimization problem: 1 (i + pi-l ) (5) min w 2 + C1 i + C2 w,b,,p 2 iIl iIu s.t. -C2 (u+ w m+ - u- w m- + (u+ - u- )b) constraints in (3) . Proposition 2. In (5), the loss for an unlabeled xi is ~(xi ) = -2yi f (xi ) yi f (xi ) < 0, |f (xi )| 1, (yi , f (xi )) otherwise. d w,b,, + C1 i=1 i - C2 (7) (6) Proof. The objective in (5) can also be rewritten as 1 2 + C1 iIl i + C2 iIu (i + pi-l - yi f (xi )). 2 w Hence, ~(xi ) = i + pi-l - yi f (xi ). From Lemma 1, 1 - yi f (xi ) |f (xi )| 1, ~(xi ) = 0 |f (xi )| 1, yi f (xi ) 0, -2yi f (xi ) |f (xi )| 1, yi f (xi ) < 0, which is equivalent to (6). Corollary 1. (Separable case) If the data is separable, the loss in (6) is the same as the hinge loss for sample xi w.r.t. its true label. Proof. When the data is separable, (6) becomes ~(xi ) = (yi , f (xi )) which is the same as the hinge loss in the standard SVM. Corollary 2. (Non-separable case) If the data is nonseparable, the loss in (6) is no more than twice of that of the hinge loss w.r.t. the true label. Proof. (5) is upper-bounded by max{-2yi f (xi ), 1 - yi f (xi )}, while the hinge loss is (yi , f (xi )) = 1 - yi f (xi ). Since -2yi f (xi ) < 2(1 - yi f (xi )), hence ~(xi ) < 2 (y , f (xi )). i Figure 1 compares the loss in (6) with the hinge loss. yi (w (xi ) + b) 1 - i , i = 1, . . . , l, l+u 1 w dj-l (xj ) + b , u+ j=l+1 1 w u- l+u (1 - dj-l )(xj ) + b -. j=l+1 u Here, = {d | di {0, 1}, i=1 di = u+ }. Note that the balance constraint is incorporated into . Moreover, by focusing on the label means, we no longer need to have one constraint for each unlabeled sample. Consequently, (7) has much fewer constraints than (3). As will be verified empirically in Section 5, this allows (7) to be trained much faster than existing S3VM implementations. Besides, (7) can also be motivated from the Hilbert space embedding of distributions (Gretton et al., 2006). Here, we map the positive class to one point in the RKHS and the negative class to another point. Then, we try to separate the means with maximum margin. Intuitively, the further part these two class distributions are, the easier the classification problem becomes. However, note that (7) is non-convex due to the bilinear constraint between w and d. In the following, we propose two algorithms to solve (7). The first one is based on convex relaxation (Li et al., 2009), while the second one is based on alternating optimization. Semi-Supervised Learning Using Label Mean 4.1. Convex Relaxation 4.1.1. F ORMULATING AS K ERNEL L EARNING First, we replace the inner problem of (7) by its dual: d A Algorithm 1 Cutting plane algorithm for solving the convex relaxation in (11). 1: Initialize = min max ~ 1 1 - 2 ( ~ y) Kd ( ~ y), (8) where denotes the element-wise product of two matrices, ~ = [1 , . . . , l+2 ] Rl+2 , y = [y1 , . . . , yl , 1, -1] ~ Rl+2 , 1 = [1l , 02 ] Rl+2 , l+2 l+2 find the most violate d0 , and set C = {d0 } 2: Run MKL for the subset of kernel matrices selected in C and obtain from dual problem of (7). 3: Find the most violated d and set C = d C. 4: Repeat steps 2-3 until convergence. Tsochantaridis et al., 2005), to handle the exponential number of constraints. The algorithm is shown in Algorithm 1. First, we initialize and the set of cutting planes C. Since the size of C is no longer exponential, we can perform MKL with the subset of kernel matrices in C and obtain from (13). The most violated d is then added to C, and the process repeated until convergence. There are two important issues in the cutting plane algorithm. First, how to efficiently solve the MKL problem in step 2? Second, how to efficiently find the most violated d in step 3? These will be addressed in the next two sections. 4.1.2. S OLVING THE MKL P ROBLEM IN S TEP 2 In recent years, a number of MKL methods have been proposed in the literature (Bach et al., 2004; Lanckriet et al., 2004; Sonnenburg et al., 2006; Rakotomamonjy et al., 2008; Xu et al., 2009). In this paper, we use an adaptation of the SimpleMKL algorithm (Rakotomamonjy et al., 2008) to solve the MKL problem in Step 2 of Algorithm 1. Recall that the feature map induced by the base kernel matrix Kdt is given in (10). As in the derivation of the SimpleMKL algorithm, we consider the following MKL problem in (13). min min 1 2 T T t=1 C2 2 1, A = | i=1 i yi = 0, ~ i=l+1 i = C2 , 0 i C1 , i = 1, . . . , l, 0 l+1 , l+2 C2 }, (9) and Kd R(l+2)×(l+2) is the kernel matrix with elements Kd = (d ) (d ), where ij i j (xi ) i = 1, . . . , l, l+u d dj-l (xj )/u+ i = l + 1, i = (10) j=l+1 l+u (1 - dj-l )(xj )/u- i = l + 2. j=l+1 Note that (8) is a mixed-integer programming problem, and so is computationally intractable in general. Here, we consider a minimax convex relaxation of (8) (Li et al., 2009). Using the minimax inequality, (8) is lowerbounded by ~ 1 max min 1 - ( A d 2 ~ y) Kd ( ~ y). (11) The inner minimization (over d) can also be written as max s.t. ~ 1- 1 ( 2 (12) ~ y) K ( dt ~ y), dt . Let µt 0 be the dual variable for each constraint in (12). It can be shown that (11) can be rewritten as ~ 1 min max 1 - ( µM A 2 ~ y) t:dt µM w, wt µt 2 + C1 iIl i - C2 (14) s.t. yi t=1 wt (xi ) + b 1 - i , i Il , µt K dt ( ~ y). (13) Here, M is the simplex {µ | µt = 1, µt 0}. Note that (13) is of the same form as the optimization problem in multiple kernel learning (MKL) (Lanckriet et al., 2004). However, note that (12) involves minimizing over all dt . There are an exponential number of feasible dt 's, and hence the set of base kernels is also exponential in size and so direct MKL is computationally intractable. In this paper, we apply the cutting plane method (Kelly, 1960), which has been successfully used in many optimization problems in machine learning (Joachims et al., 2009; T 1 dt (xj ) + b , w j-l u+ t=1 t jIu T 1 w (1 - dt )(xj ) + b -. j-l u- t=1 t jIu It is easy to verify that its dual is (11). Following the SimpleMKL, we solve (13), or equivalently, (14), iteratively. First, we fix µ = [µ1 , . . . , µT ] and solve the dual of the inner problem in (14): ~ max 1 - 1 ( 2 A ~ y) T t=1 µt Kdt ( ~ y). Semi-Supervised Learning Using Label Mean Note that this is simply the SVM's dual, with kernel matrix T dt ~ and label vector y. Hence, it can be solved t=1 µt K efficiently with existing SVM solvers. Then, we fix and use the reduced gradient method to update µ. These two steps are iterated until convergence. 4.1.3. F INDING THE MOST VIOLATED d IN S TEP 3 ~ From (13), the most violated d maximizes ( y) Kd ( l+2 d d ~ y) = i,j=1 i j yi yj (i ) (j ). However, this is a con~~ cave QP and so cannot be solved efficiently. Note, however, that the cutting plane algorithm only requires the addition of a violated constraint at each iteration. Hence, we propose in the following a simple and efficient method for finding a good approximation of the most violated d. Using (10), it is easy to verify that l+2 i,j=1 Algorithm 2 Alternating optimization. 1: Run the SVM with the labeled data only. 2: Set the d entries of those unlabeled patterns with the largest u+ predictions to 1, otherwise set to 0. 3: Fix d, and optimize the QP in (8). 4: Repeat steps 2-3 until convergence. 4.1.4. F INAL A SSIGNMENT OF d After training, the prediction of x is given by f (x) = T t=1 wt (x) + b. Unlabeled patterns whose predictions are among the u+ largest will have their corresponding dj entries assigned 1, while the others will be assigned 0. 4.2. Alternating Optimization Another natural way to solve (7) is based on alternating optimization. Note that for a fixed d, the inner maximization of (8) w.r.t. can be solved efficiently with standard QP solvers. On the other hand, when is fixed, w and b can be determined from the KKT conditions, and (7) reduces to d, i j yi yj (d ) (d ) ~~ i j l (15) i yi (d ) (d ) ~ i l+1 i=1 2 2. = i yi (xi ) ~ l 2 2 + l+1 -l+2 i=1 i yi (d ) (d ) + l+1 d - l+2 d ~ i l+2 l+1 l+2 max 1 u+ w l+u j=l+1 l+u j=l+1 The first term is not related to d and so can be removed. Moreover, since labeled data are often more reliable than unlabeled data (Joachims, 1999), so C2 C1 . Consequently, from (9), l+1 , l+2 are much smaller than the other i 's and the last term in (15) can be dropped. Thus, as an approximation, we simply maximize l l s.t. dj-l (xi ) + b , 1 w u- (1 - dj-l )(xi ) + b -. l+1 i=1 i yi (d ) ~ i (d ) l+1 - l+2 i=1 i yi (d ) (d ). ~ i l+2 Proposition 3. At optimality, di-l dj-l if f (xi ) > f (xj ), i, j Iu Proof. In order to maximize , we want to inl+u l+u crease q=l+1 dq-l f (xq ) and decrease q=l+1 (1 - dq-l )f (xq ) as much as possible. Assume, to the contrary, that the optimal d does not have the same sorted order as {f (xj )}l+u . Then, there are two j=l+1 entries of d, say, di-l and dj-l (both 0), with f (xi ) f (xj ) but di-l < dj-l . Then l+u q=l+1 dq-l f (xq ) = q=i,j dq-l f (xq ) + di-l f (xi ) + dj-l f (xj ) < q=i,j dq-l f (xq ) + dj-l f (xi ) + di-l f (xj ). A similar result holds for q=l+1 (1 - dq-l )f (xq ). Thus, interchanging the values of di-l and dj-l will increase l+u l+u q=l+1 dq-l f (xq ) and decrease q=l+1 (1-dq-l )f (xq ), and so d is not optimal, a contradiction. Using this proposition, unlabeled patterns whose predictions are among the u+ largest will have their corresponding entries in d set to one; otherwise they will be set to zero (Algorithm 2). Note that, unlike convex relaxation, alternating optimization may get stuck in a local minimum. On the other hand, its computation is very simple and, empirically, much faster. l+u On using (10) again, this can be rewritten as l+1 u+ - l+2 u- l i=1 l i yi ~ l+u l+u j=l+1 dj-l k(xi , xj ) i yi ~ i=1 j=l+1 l+u (1 - dj-l )k(xi , xj ) l = ( l+1 l+2 - ) u+ u- l i=1 dj-l j=l+1 l+u j=l+1 i=1 i yi k(xi , xj ) ~ k(xi , xj ). (16) l+2 - u- i yi ~ Since the second term and ( l+1 - l+2 ) are not related to u+ u- d, the maximization of (16) can be rewritten as max d l+u j=l+1 dj-l l i=1 i yi k(xi , xj ). ~ This can be easily solved as follows. First, we sort the l ~ i=1 i yi k(xi , xj )'s in descending order. Then, those dj 's with the corresponding j among the largest u+ values are assigned 1, while all others are assigned 0. Semi-Supervised Learning Using Label Mean Table 1. Accuracy (%) on benchmark data sets. The number in parentheses shows the relative rank of the algorithm (performance-wise) on the corresponding data set. The smaller the rank, the better the relative performance. The best performance on each data set is bolded. #labeled 10 Method 1-NN SVM TSVM Cluster-Kernel LDS Laplacian RLS Laplacian SVM meanS3vm-iter meanS3vm-mkl 1-NN SVM TSVM Cluster-Kernel LDS Laplacian RLS Laplacian SVM meanS3vm-iter meanS3vm-mkl g241c 52.12(8) 52.66(7) 75.29(1) 51.72(9) 71.15(3) 56.05(5) 53.79(6) 72.22(2) 65.48(4) 56.07(9) 76.89(6) 81.54(3) 86.51(1) 81.96(2) 75.64(8) 76.18(7) 80.00(5) 80.25(4) g241d 53.28(7) 53.34(6) 49.92(8) 57.95(2) 49.37(9) 54.32(5) 54.85(4) 57.00(3) 58.94(1) 57.55(9) 75.36(6) 77.58(2) 95.05(1) 76.26(5) 73.54(8) 73.64(7) 77.52(4) 77.58(2) Digit1 86.35(3) 69.40(9) 82.23(7) 81.27(8) 84.37(4) 94.56(1) 91.03(2) 82.98(6) 83.00(5) 96.11(5) 94.47(8) 93.85(9) 96.21(4) 96.54(3) 97.08(1) 96.87(2) 95.68(7) 95.91(6) USPS 83.34(1) 79.97(6) 74.80(9) 80.59(5) 82.43(2) 81.01(3) 80.95(4) 76.34(8) 77.84(7) 94.19(4) 90.25(8) 90.23(9) 90.32(7) 95.04(3) 95.32(1) 95.30(2) 93.83(5) 93.17(6) BCI 51.00(5) 50.15(9) 50.85(6) 51.69(3) 50.73(8) 51.03(4) 50.75(7) 51.88(2) 52.07(1) 51.33(9) 65.69(6) 66.75(5) 64.83(7) 56.03(8) 68.64(3) 67.61(4) 71.31(2) 71.44(1) Text 61.82(7) 54.63(9) 68.79(2) 57.28(8) 63.85(5) 66.32(4) 62.72(6) 69.57(1) 66.91(3) 69.89(9) 73.55(8) 75.48(7) 75.62(6) 76.85(1) 76.43(4) 76.14(5) 76.74(2) 76.60(3) Total rank 31 46 33 35 31 22 29 22 21 45 42 35 26 22 25 27 25 22 100 4.3. MeanS3VM The obtained d vector can be used to assign labels for the unlabeled data. We take these labels, together with the labels of the labeled data, to train a final SVM. The final SVMs, obtained with the d from Algorithms 1 and 2, will be denoted means3vm-mkl and means3vm-iter, respectively. It is worth noting that the MeanS3VM bears some resemblance with (Grandvalet et al., 2009), where one does not estimate the labels of some examples. is fixed to 0.1. Following (Chapelle et al., 2008), µ+ , µ- are set to the ground truth. When there are only 10 labeled examples, we simply set the Gaussian kernel width to the average distance between patterns. When there are 100 labeled examples, the kernel parameter is selected via ten-fold cross-validation. Experimental results are shown in Table 1. We also include four other SVM-type methods from (Chapelle et al., 2006), including (1) SVM using labeled data only, (2) TSVM (Joachims, 1999), (3) SVM with the cluster kernel (Weston et al., 2005), and (4) Laplacian RLS/SVM (Belkin et al., 2006). In addition, we also compare with (5) the classic 1nearest neighbor classifier, and (6) low-density separation (LDS) (Chapelle & Zien, 2005). Since the setup is the same as in (Chapelle et al., 2006), the results for these methods are simply taken from (Chapelle et al., 2006). As can be seen, the proposed methods achieve highly competitive performance with the other state-of-the-art semisupervised learning methods. In particular, means3vm-mkl is one of the most accurate algorithms. 5.2. UCI Data Sets In this section, we evaluate the proposed algorithms on 9 UCI data sets. The experiment setup follows that in (Mallapragada et al., 2009). Each data set is split into two equal halves, with one for training and the other for testing. Each training data set contains ten labeled examples and the rest are used as unlabeled examples. The experiment is repeated 20 times and the average results reported. In addition to the SVM (using labeled data only), TSVM 5. Experiments The proposed algorithms are evaluated on the benchmark data sets used in (Chapelle et al., 2006) (Section 5.1), a number of UCI data sets (Section 5.2) and a text data set (Section 5.3). We also empirically study the computational efficiency in Section 5.4. 5.1. Benchmark Data Sets We first perform experiments on the benchmark data sets (http://www.kyb.tuebingen.mpg.de/ssl-book/) in (Chapelle et al., 2006). These include g241c, g241d, Digit1, USPS, BCI and Text. There are two settings, one using 10 labeled examples and the other using 100 labeled examples. For each data set and each setting, there are twelve subsets. The average accuracy on the unlabeled data will be reported. The settings of the proposed methods are the same as that of the TSVM in (Chapelle et al., 2006). Specifically, C1 is fixed to 100, both the linear and Gaussian kernels are used and then have the better results reported. Besides, C2 Semi-Supervised Learning Using Label Mean Table 2. Accuracy (%) on UCI data. The numbers in parentheses show the number of instances (n) and dimensionality (d). The best performance on each data set is bolded. Data set (n, d) house (232,16) heart (270,9) vehicle (435,26) wdbc (569,14) isolet (600,51) austra (690,15) optdigits (1143,42) ethn (2630,30) sat (3041,36) SVM 91.16 70.59 78.28 75.74 89.58 65.64 90.31 67.04 99.13 SB-SVM 90.65 79.00 72.29 88.82 95.12 71.36 96.35 67.57 87.71 LDS 89.35 77.11 66.28 85.07 92.07 66.00 96.40 67.16 94.20 TSVM 86.55 77.63 63.62 86.40 90.38 73.38 92.34 54.69 98.26 LapSVM 89.95 77.96 71.38 91.07 93.93 74.38 98.34 74.60 99.12 means3vm-iter 91.72 74.56 82.47 79.39 98.75 68.12 98.93 73.21 99.56 means3vm-mkl 91.90 73.22 82.15 80.19 98.98 67.59 99.09 73.57 99.56 Table 3. Accuracy (%) on text categorization tasks. The best performance on each data set is bolded. SBSVM 70.74 74.83 78.47 82.64 64.06 74.85 80.12 75.26 78.31 68.07 LapSVM 68.23 71.34 74.67 78.01 61.68 70.95 74.79 71.45 74.91 65.05 means3vm -iter -mkl 84.72 84.27 90.54 90.83 88.33 88.76 91.10 91.14 66.48 66.73 81.77 81.71 77.13 77.37 84.47 84.12 81.65 80.36 66.45 72.85 Table 4. Wall clock time (in seconds). The smallest time cost on each data set is bolded. Data set BCI Text g241d g241c Digit1 USPS house heart vehicle wdbc isolet austra optdigits ethn sat (1,2) (1,3) TSVM 73.88 6181.12 596.23 552.19 1222.90 560.05 3.19 13.12 34.46 123.02 62.10 44.37 114.93 355.30 494.38 2176.65 2151.67 LapSVM 0.19 17.27 5.88 7.08 6.54 7.48 0.09 0.06 0.20 0.29 0.55 0.40 1.53 11.70 18.78 13.46 13.48 means3vm -iter -mkl 0.27 2.45 0.55 14.12 0.53 0.94 1.77 2.17 0.50 0.83 0.58 1.25 0.09 0.77 0.09 0.52 0.11 0.65 0.50 0.56 0.19 0.97 0.26 0.80 0.39 0.94 1.09 2.16 1.08 1.86 0.81 3.33 0.75 3.09 Classes (1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) (3,4) (3,5) (4,5) TSVM 75.44 89.34 88.71 92.35 66.05 81.50 84.94 81.98 77.38 67.54 LDS 55.10 58.88 61.72 66.45 50.76 50.32 53.94 50.08 53.83 52.39 and Laplacian SVM (LapSVM), we also compare with the Semi-Boost SVM (SB-SVM) (Mallapragada et al., 2009) and Inductive LDS (Chapelle & Zien, 2005). As in (Mallapragada et al., 2009), parameter C1 is set to 1 and the linear kernel is used for all the SVMs. C2 is fixed to 0.1. As can be seen from Table 2, the proposed algorithms achieve highly competitive performance on all data sets. In particular, means3vm-iter is the best on 2 of the 9 tasks, while means3vm-mkl is the best on 4 tasks. 5.3. Text Categorization In this section, we evaluate the various methods using the popular 20-newsgroups data set (http://people.csail.mit. edu/jrennie/20Newsgroups/). Following Mallapragada et al. (2009), we use five of the twenty classes, and generate ten tasks from these five classes by the one-vs-one strategy. The experimental setup is as same as that in Section 5.2, except that each training data set now has only two labeled examples per class. Table 3 shows the results. As can be seen, the proposed algorithms achieve highly competitive or sometimes even the best performance on these data sets. 5.4. Speed In this section, we study the computational efficiency of the various methods1 . All the experiments are performed on a PC with 2GHz Intel Xeon(R)2-Duo running Windows XP with 4GB main memory. As can be seen from Table 4, TSVM is always slower than the Laplacian SVM and the proposed algorithms. The Laplacian SVM is slightly faster than the means3vm-mkl when the data set is small. However, on large data sets (with more than 1,000 instances), the Laplacian SVM is much slower than means3vm-mkl. Moreover, means3vmiter is consistently faster than means3vm-mkl and is almost always the fastest among all methods. In particular, on the large data sets, means3vm-iter is about 100 times faster than TSVM and 10 times faster than Laplacian SVM. The TSVM and Laplacian SVM implementations are downloaded from http://svmlight.joachims.org/ and http://manifold.cs.uchicago. edu/manifoldregularization/software.html, respectively. 1 Semi-Supervised Learning Using Label Mean 6. Conclusion In contrast to previous semi-supervised support vector machines that directly estimate the label of every unlabeled example, we propose in this paper a new approach which works by first estimating the label means of the unlabeled data. We show that the semi-supervised SVM with known label means of unlabeled data is closely related to the supervised SVM that has access to all the labels of the unlabeled examples. Based on this observation, we propose two algorithms that maximize the margin between the label means of the unlabeled data. Experiments on a broad range of data sets show that, in comparison with state-of-the-art semi-supervised learning methods, both of our proposed algorithms achieve highly competitive or sometimes even the best performance, and are much faster to train. The idea of using the label means can be applied to many other learning scenarios, and other approaches for estimating the label means will also be investigated in the future. Grandvalet, Y., Rakotomamonjy, A., Keshet, J., & Canu, S. (2009). Support vector machines with a reject option. In Advances in Neural Information Processing Systems 21, 537­544. Gretton, A., Borgwardt, K. M., Rasch, M., Sch¨ lkopf, B., o & Smola, A. J. (2006). A kernel method for the twosample-problem. In Advances in Neural Information Processing Systems 19, 513­520. Joachims, T. (1999). Transductive inference for text classification using support vector machines. Proceeding of the 16th International Conference on Machine Learning (pp. 200­209). Joachims, T., Finley, T., & Yu, C.-N. (2009). Cutting-plane training of structural SVMs. Machine Learning, to appear. Kelly, J. E. (1960). The cutting plane method for solving convex programs. Journal of Society for Industrial and Applied Mathematics, 8, 703­712. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27­72. Li, Y.-F., Tsang, I. W., Kwok, J. T., & Zhou, Z.-H. (2009). Tighter and convex maximum margin clustering. Proceeding of the 12th International Conference on Artificial Intelligence and Statistics (pp. 344­351). Mallapragada, P., Jin, R., Jain, A., & Liu, Y. (2009). SemiBoost: Boosting for semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, to appear. Rakotomamonjy, A., Bach, F. R., Canu, S., & Grandvalet, Y. (2008). SimpleMKL. Journal of Machine Learning Research, 9, 2491­2521. Sonnenburg, S., R¨ tsch, G., Sch¨ fer, C., & Sch¨ lkopf, B. a a o (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531­1565. Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453­1484. Weston, J., Leslie, C., Ie, E., Zhou, D., Elisseeff, A., & Noble, W. S. (2005). Semi-supervised protein classification using cluster kernels. Bioinformatics, 21, 3241­3247. Xu, Z., Jin, R., King, I., & Lyu, M. R. (2009). An extended level method for efficient multiple kernel learning. In Advances in Neural Information Processing Systems 21, 1825­1832. Acknowledgments Thanks to anonymous reviews for their helpful suggestions. This work was supported by the NSFC (60635030, 60721002), JiangsuSF (BK2008018), Jiangsu 333 Program and Hong Kong RGC (614508). References Bach, R. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. Proceeding of the 21st International Conference on Machine Learning (pp. 41­48). Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7, 2399­2434. Bennett, K., & Demiriz, A. (1999). Semi-supervised support vector machines. In Advances in Neural Information Processing Systems 11, 368­374. Chapelle, O., Sch¨ lkopf, B., & Zien, A. (Eds.). (2006). o Semi-supervised learning. Cambridge, MA: MIT Press. Chapelle, O., Sindhwani, V., & Keerthi, S. S. (2008). Optimization techniques for semi-supervised support vector machines. Journal of Machine Learning Research, 9, 203­233. Chapelle, O., & Zien, A. (2005). Semi-supervised learning by low density separation. Proceeding of the 8th International Conference on Artificial Intelligence and Statistics (pp. 57­64).