Efficient Convex Relaxation for Transductive Support Vector Machine Zenglin Xu Dept. of Computer Science & Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong zlxu@cse.cuhk.edu.hk Rong Jin Dept. of Computer Science & Engineering Michigan State University East Lansing, MI, 48824 rongjin@cse.msu.edu Jianke Zhu Irwin King Michael R. Lyu Dept. of Computer Science & Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong {jkzhu,king,lyu}@cse.cuhk.edu.hk Abstract We consider the problem of Support Vector Machine transduction, which involves a combinatorial problem with exponential computational complexity in the number of unlabeled examples. Although several studies are devoted to Transductive SVM, they suffer either from the high computation complexity or from the solutions of local optimum. To address this problem, we propose solving Transductive SVM via a convex relaxation, which converts the NP-hard problem to a semi-definite programming. Compared with the other SDP relaxation for Transductive SVM, the proposed algorithm is computationally more efficient with the number of free parameters reduced from O(n2 ) to O(n) where n is the number of examples. Empirical study with several benchmark data sets shows the promising performance of the proposed algorithm in comparison with other state-of-the-art implementations of Transductive SVM. 1 Introduction Semi-supervised learning has attracted an increasing amount of research interest recently [3, 15]. An important semi-supervised learning paradigm is the Transductive Support Vector Machine (TSVM), which maximizes the margin in the presence of unlabeled data and keeps the boundary traversing through low density regions, while respecting labels in the input space. Since TSVM requires solving a combinatorial optimization problem, extensive research efforts have been devoted to efficiently finding the approximate solution to TSVM. The popular version of TSVM proposed in [8] uses a label-switching-retraining procedure to speed up the computation. In [5], the hinge loss in TSVM is replaced by a smooth loss function, and a gradient descent method is used to find the decision boundary in a region of low density. Chapelle et al. [2] employ an iterative approach for TSVM. It begins with minimizing an easy convex object function, and then gradually approximates the objective of TSVM with more complicated functions. The solution of the simple function is used as the initialization for the solution to the complicated function. Other iterative methods, such as deterministic annealing [11] and the concave-convex procedure (CCCP) method [6], are also employed to solve the optimization problem related to TSVM. The main drawback of the approximation methods listed above is that they are susceptible to local optima, and therefore are sensitive to the initialization of solutions. To address this problem, in [4], a branch1 Time Comparison 2000 1800 1600 1400 Time (seconds) 1200 1000 800 600 400 200 0 50 100 150 200 Number of Samples 250 300 CTSVM RTSVM Figure 1: Computation time of the proposed convex relaxation approach for TSVM (i.e., CTSVM) and the semi-definite relaxation approach for TSVM (i.e., RTSVM) versus the number of unlabeled examples. The Course data set is used, and the number of labeled examples is 20. and-bound search method is developed to find the exact solution. In [14], the authors approximate TSVM by a semi-definite programming problem, which leads to a relaxation solution to TSVM (noted as RTSVM), to avoid the solution of local optimum. However, both approaches suffer from the high computational cost and can only be applied to small sized data sets. To this end, we present the convex relaxation for Transductive SVM (CTSVM). The key idea of our method is to approximate the non-convex optimization problem of TSVM by its dual problem. The advantage of doing so is twofold: · Unlike the semi-definite relaxation [14] that approximates TSVM by dropping the rank constraint, the proposed approach approximates TSVM by its dual problem. As the basic result of convex analysis, the conjugate of conjugate of any function f (x) is the convex envelope of f (x), and therefore provides a tighter convex relaxation for f (x) [7]. Hence, the proposed approach provides a better convex relaxation than that in [14] for the optimization problem in TSVM. · Compared to the semi-definite relaxation TSVM, the proposed algorithm involves fewer free parameters and therefore significantly improves the efficiency by reducing the worstcase computational complexity from O(n6.5 ) to O(n4.5 ). Figure 1 shows the running time of both the semi-definite relaxation of TSVM in [14] and the proposed convex relaxation for TSVM versus increasing number of unlabeled examples. The data set used in this example is the Course data set (see the experiment section), and the number of labeled examples is 20. We clearly see that the proposed convex relaxation approach is considerably more efficient than the semi-definition approach. The rest of this paper is organized as follows. Section 2 reviews the related work on the semidefinite relaxation for TSVM. Section 3 presents the convex relaxation approach for Transductive SVM. Section 4 presents the empirical studies that verify the effectiveness of the proposed relaxation for TSVM. Section 5 sets out the conclusion. 2 Related Work In this section, we review the key formulae for Transductive SVM, followed by the semi-definite programming relaxation for TSVM. Let X = (x1 , . . . , xn ) denote the entire data set, including both the labeled examples and the unlabeled ones. We assume that the first l examples within X are labeled by y = (y1 , y2 , . . . , yl ) where yi {-1, +1} represents the binary class label assigned to xi . We further denote by y = (y1 , y2 , . . . , yn ) {-1, +1}n the binary class labels predicted for all the data points in X . The goal of TSVM is to estimate y by using both the labeled examples and the unlabeled ones. 2 Following the framework of maximum margin, TSVM aims to identify the classification model that will result in the maximum classification margin for both labeled and unlabeled examples, which amounts to solve the following optimization problem: in min n w 2+C i 2 w,b,y{-1,+1} , =1 s. t. yi (w yi = y , i = 1, 2, . . . , l, where C 0 is the trade-off parameter between the complexity of function w and the margin errors. The prediction function can be formulated as f (x) = wT x - b. Evidently, the above problem is a non-convex optimization problem due to the product term yi wj in the constraint. In order to approximate the above problem into a convex programming problem, we first rewrite the above problem into the following form using the Lagrange Theorem: 1 min n (e + - + y) D(y)K-1 D(y)(e + - + y) + C e (1) 2 ,y{-1,+1} , s. t. 0, 0, yi = yi , i = 1, 2, . . . , l, where , and are the dual variables. e is the n-dimensional column vector of all ones and K is the kernel matrix. D(y) represents a diagonal matrix whose diagonal elements form the vector y. Detailed derivation can be found in [9, 13]. Using the Schur complement, the above formulation can be further formulated as follows: min t (2) y{-1,+1}n ,t,, y 0 y K e + - + y s. t. t (e + - + y) - 2C e 0, 0, yi = yi , i = 1, 2, . . . , l, where the operator represents the element wise product. To convert the above problem into a convex optimization problem, the key idea is to replace the quadratic term yy by a linear variable. Based on the result that the set Sa = {M = yy |y {-1, +1}n } is equivalent to the set Sb = {M|Mi,i = 1, rank(M) = 1}, we can approximate the problem in (2) as follows: min t (3) M,t,, M 0 K e+- s. t. t (e + - ) - 2C e 0, 0, M 0, Mi,i = 1, i = 1, 2, . . . , n, where Mij = yi yj for 1 i, j l. Note that the key differences between (2) and (3) are (a) the rank constraint rank(M) = 1 is removed, and (b) the variable is set to be zero, which is equivalent to setting b = 0. The above approximation is often referred to as the Semi-Definite Programming (SDP) relaxation. As revealed by the previous studies [14, 1], the SDP programming problem resulting from the approximation is computationally expensive. More specifically, there are O(n2 ) parameters in the SDP cone and O(n) linear inequality constraints, which implies a worst-case computational complexity of O(n6.5 ). To avoid the high computational complexity, we present a different approach for relaxing TSVM into a convex problem. Compared to the SDP relaxation approach, it is advantageous in that (1) it produces the best convex approximation for TSVM, and (2) it is computationally more efficient than the previous SDP relaxation. x i i - b) 1 - i , i 0, i = 1, 2, . . . , n 3 Relaxed Transductive Support Vector Machine In this section, we follow the work of generalized maximum margin clustering [13] by first studying the case of hard margin, and then extending it to the case of soft margin. 3 3.1 Hard Margin In the hard margin case, SVM does not penalize the classification error, which corresponds to = 0 in (1). The resulting formulism of TSVM becomes 1 min (e + + y) D(y)K-1 D(y)(e + + y) (4) ,y, 2 s. t. 0, yi = yi , i = 1, 2, . . . , l, 2 yi = 1, i = l + 1, l + 2, . . . , n. Instead of employing the SDP relaxation as in [14], we follow the work in [13] and introduce a variable z = D(y)(e + ) = y (e + ). Given that 0, the constraints in (4) can be written 2 as yi zi 1 for the labeled examples, and zi 1 for all the unlabeled examples. Hence, z can be used as the prediction function, i.e., f = z. Using this new notation, the optimization problem in (4) can be rewritten as follows: 1 min (z + e) K-1 (z + e) (5) z, 2 s. t. yi zi 1, i = 1, 2, . . . , l, 2 zi 1, i = l + 1, l + 2, . . . , n. One problem with Transductive SVMs is that it is possible to classify all the unlabeled data to one of the classes with a very large margin due to the high dimension and few labeled data. This will lead to poor generalization ability. To solve this problem, we introduce the following balance constraint to ensure that no class takes all the unlabeled examples: - l 1i 1i zi - l =1 n-l n zi , (6) =l+1 where 0 is a constant. Through the above constraint, we aim to ensure that the difference between the labeled data and the unlabeled data in their class assignment is small. To simplify the expression, we further define w = (z, ) Rn+1 and P = (In , e) Rn×(n+1) . Then, the problem in (5) becomes: min w w P K-1 Pw (7) s. t. yi wi 1, i = 1, 2, . . . , l, 2 wi 1, i = l + 1, l + 2, . . . , n, l 1i 1i wi - l =1 n-l n - wi . =l+1 When this problem is solved, the label vector y can be directly determined by the sign of the prediction function, i.e., sign(w). This is because wi = (1 + )yi for i = l + 1, . . . , n and 0. The following theorem shows that the problem in (7) can be relaxed to a semi-definite programming. Theorem 1. Given a sample X = {x1 , . . . , xn } and a partial set of the labels y = (y1 , y2 , . . . , yl ) where 1 l n, the variable w that optimizes (7) can be calculated by 1 -1 w = [A - D( b)] ( a - ( - )c), (8) 2 1 where a = (yl , 0n-l , 0) Rn+1 , b = (0l , 1n-l , 0) Rn+1 , c = ( 1 1l , - u 1n-l , 0) Rn+1 , l K-1 A=P P, and is determined by the following semi-definite programming: in 1 max - t+ i - ( + ) (9) ,t 4 =1 A - D( b) a - ( - )c, s. t. 0 t ( a - ( - )c) 0, 0, i 0, i = 1, 2, . . . , n. 4 Proof Sketch. We define the Lagrangian of the minimization problem (7) as follows: min max F (w, ) = w w P K-1 Pw + w il =1 i (1 - yi wi ) + w i n 2 i (1 - wi ) =l+1 + ( c - ) + (-c - ), where i 0 for i = 1, . . . , n. It can be derived from the duality that minw max F (w, ) = max minw F (w, ). At the optimum, the derivatives of F with respect to the variable w are derived as below: F = 2 [A - D( b)] w - a + ( - )c w = 0, w where A = P K-1 P and the inverse of A - D( b) can be computed through adding a regularization parameter. Therefore, w is able to be calculated by: 1 -1 w = [A - D( b)] ( a - ( - )c). 2 Thus, the dual form of the problem becomes: max 1 L( ) = - ( a - ( - )c) 4 [ A - D ( b )] -1 ( a - ( - )c) + in =1 i - ( + ), We import a variable t, so that 1 - ( a - ( - )c) [A - D(b )]-1 ( a - ( - )c) -t. 4 According to the Schur Complement, we obtain a semi-definite programming cone, from which the R optimization problem (9) can be formulated. emark I. The problem in (9) is a convex optimization problem, more specifically, a semi-definite programming problem, and can be efficiently solved by the interior-point method [10] implemented in some optimization packages, such as SeDuMi [12]. Besides, our relaxation algorithm has O(n) parameters in the SDP cone and O(n) linear equality constraints, which involves a worst-case computational complexity of O(n4.5 ). However, in the previous relaxation algorithms [1, 14], there are approximately O(n2 ) parameters in the SDP cone, which involve a worst-case computational complexity in the scale of O(n6.5 ). Therefore, our proposed convex relaxation algorithm is more efficient. In addition, as analyzed in Section 2, the approximation in [1, 14] drops the rank constraint of the matrix y y, which does not lead to a tight approximation. On the other hand, our prediction function f implements the conjugate of conjugate of the prediction function f (x), which is the convex envelope of f (x) [7]. Thus, our proposed convex approximation method provides a tighter approximation than the previous method. Remark II. It is interesting to discuss the connection between the solution of the proposed algorithm and that of harmonic functions. We consider a special case of (8), where = 0 (which implies no bias term in the primal SVM), and there is no balance constraint. Then the solution of (9) can be expressed as follows: -1 1 K- 1 z= - D( (0l , 1n-l )) ( (yl , 0n-l )). (10) 2 It can be further derived as follows: I -1 l , in i z= i KIi i yi K(xi , ·) (11) n- n =l+1 =1 where is defined as an n × n matrix with all elements being zero except the i-th diagonal element which is 1, and K(xi , ·) is the i-th column of K. Similar to the solution of the harmonic function, we first propagate the class labels from the labeled examples to the unlabeled one by term I -1 l n i i . n- i=1 i y K(xi , ·), and then adjust the prediction labels by the factor i=l+1 i KIn The key difference in our solution is that (1) different weights (i.e., i ) are assigned to the labeled examples, and (2) the adjustment factor is different to that in the harmonic function [16]. 5 Ii n 3.2 Soft Margin We extend TSVM to the case of soft margin by considering the following problem: min 1 (e + - + y) 2 0, 0, yi = yi , 1 i l, 2 yi = 1, l + 1 i n, D ,y,, (y)K-1 D(y)(e + - + y) + C i l =1 2 i + Cu i n 2 i =l+1 s. t. where i is related to the margin error. Note that we distinguish the labeled examples from the unlabeled examples by introducing different penalty constants for margin errors, C for labeled examples and Cu for unlabeled examples. Similarly, we introduce the slack variable z, and then derive the following dual problem: max ,t s. t. i 1 - t+ i - ( + ) 4 =1 A - D( b) a - ( - )c t ( a - ( - )c) 0 i C , i = 1, 2, . . . , l, 0 i Cu , i = l + 1, l + 2, . . . , n, 0, 0, n (12) 0, which is also a semi-definite programming problem and can be solved similarly. 4 Experiments In this section, we report empirical study of the proposed method on several benchmark data sets. 4.1 Data Sets Description To make evaluations comprehensive, we have collected three UCI data sets and three text data sets as our experimental testbeds. The UCI data sets include Iono, Titanic, and Breast, which are widely used in data classification. The WinMac data set consists of the classes mswindows and mac of the Newsgroup20 data set. The IBM data set contains the classes IBM and non-IBM of the Newsgroup20 data set. The course data set is made of the course pages and non-course pages of the WebKb corpus. For each text data set, we randomly sample the data with the sample size of 60, 300 and 1000, respectively. Each resulted sample is noted by the suffix, "-s", "-m", or "-l" depending on whether the sample size is small, medium or large. Table 1 describes the information of these data sets, where d represents the data dimensionality, l means the number of labeled data points, and n denotes the total number of examples. Table 1: Data sets used in the experiments. d l n Data set d 34 20 351 WinMac-m 7511 60 20 208 IBM-m 11960 4 20 400 Course-m 1800 9 20 300 WinMac-l 7511 11960 10 60 IBM-l 11960 1800 10 60 Course-l 1800 Data set Iono Sonar Banana Breast IBM-s Course-s 4.2 l 20 10 20 50 50 50 n 300 300 300 1000 1000 1000 Experimental Protocol To evaluate the effectiveness of the proposed CTSVM method, we choose the conventional SVM as our baseline method. In our experiments, we also make comparisons with three state-of-the-art 6 methods: the SVM-light algorithm [8], the Gradient Decent TSVM ( TSVM) algorithm [5], and the Concave Convex Procedure (CCCP) [6]. Since the SDP approximation TSVM [14] has very high time complexity O(n6.5 ), which is difficult to process data sets with hundreds of examples. Thus, it is only evaluated on the smaller data sets, i.e., "IBM-s" and "Course-s". The experiment setup is described as follows. For each data set, we conduct 10 trials. In each trial, the training set contains each class of data, and the remaining data are then used as the unlabeled (test) data. Moreover, the RBF kernel is used for "Iono", "Sonar" and "Banana", and the linear kernel is used for the other data sets. This is because the linear kernel performs better than the RBF kernel on these data sets. The kernel width of RBF kernel is chosen by 5-cross validation on the labeled data. The margin parameter C is tuned by using the labeled data in all algorithms. Due to the small number of labeled examples, for CTSVM and CCCP, the margin parameter for unlabeled data, Cu , is set equal to C . Other parameters in these algorithms are set to the default values according to the relevant literatures. 4.3 Experimental Results Table 2: The classification performance of Transductive SVMs on benchmark data sets. Data Set SVM SVM-light TSVM CCCP CTSVM IBM-s 52.75±15.01 67.60±9.29 65.80±6.56 65.62±14.83 75.25±7.49 Course-s 63.52±5.82 76.82±4.78 75.80±12.87 74.20±11.50 79.75±8.45 Iono 78.55±4.83 78.25±0.36 81.72±4.50 82.11±3.83 80.09±2.63 Sonar 51.76±5.05 55.26±5.88 69.36±4.69 56.01±6.70 67.39±6.26 Banana 58.45±7.15 71.54±7.28 79.33±4.22 79.51±3.02 Breast 96.46±1.18 95.68±1.82 97.17±0.35 96.89±0.67 97.79±0.23 WinMac-m 57.64±9.58 79.42±4.60 81.03±8.23 84.28±8.84 84.82±2.12 IBM-m 53.00±6.83 67.55±6.74 64.65±13.38 69.62±11.03 73.17±0.89 Course-m 80.18±1.27 93.89±1.49 90.35±3.59 88.78±2.87 92.92±2.28 WinMac-l 60.86±10.10 89.81±2.10 90.19±2.65 91.00±2.42 91.25±2.67 IBM-l 61.82±7.26 75.40±2.26 73.11±1.99 74.80±1.87 73.42±3.23 Course-l 83.56±3.10 92.35±3.02 93.58±2.68 91.32±4.08 94.62±0.97 Table 2 summarizes the classification accuracy and the standard deviations of the proposed algorithm, the baseline method and the state-of-the-art methods. It can be observed that our proposed algorithm performs significantly better than the standard SVM across all the data sets. Moreover, on the small-size data sets, i.e., "IBM-s" and "Course-s", the results of the SDP-relaxation method are 68.57±22.73 and 64.03±7.65, which are worse than the proposed CTSVM method. In addition, the proposed CTSVM algorithm performs much better than other TSVM methods over "WinMac-m" and "Course-l". As shown in Table 2, the SVM-light algorithm achieves the best results on "Coursem" and "IBM-l", however, it fails to converge on "Banana". On the remaining data sets, comparable results have been obtained for our proposed algorithm. From above, the empirical evaluations indicate that our proposed CTSVM method achieves promising classification results comparing with the state-of-the-art methods. 5 Conclusion and Future Work This paper presents a novel method for Transductive SVM by relaxing the unknown labels to the continuous variables. In contrast to the previous relaxation method which involves O(n2 ) free parameters in the semi-definite matrix, our method takes the advantages of reducing the number of free parameters to O(n), and can solve the optimization problem more efficiently. In addition, the proposed approach provides a tighter convex relaxation for the optimization problem in TSVM. Empirical studies on benchmark data sets demonstrate that the proposed method is more efficient than the previous semi-definite relaxation method and achieves promising classification results comparing to the state-of-the-art methods. As the current model is only designed for a binary-classification, we plan to develop a multi-class Transductive SVM model in the future. Moreover, it is desirable to extend the current model to classify the new incoming data. 7 Acknowledgments The work described in this paper is supported by a CUHK Internal Grant (No. 2050346) and a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CUHK4150/07E). References [1] T. D. Bie and N. Cristianini. Convex methods for transduction. In S. Thrun, L. Saul, and ¨ B. Scholkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [2] O. Chapelle, M. Chi, and A. Zien. A continuation method for semi-supervised SVMs. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 185­192, New York, NY, USA, 2006. ACM Press. ¨ [3] O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006. [4] O. Chapelle, V. Sindhwani, and S. Keerthi. Branch and bound for semi-supervised support ¨ vector machines. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, 2007. [5] O. Chapelle and A. Zien. Semi-supervised classification by low density separation. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 57­64, 2005. [6] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive SVMs. Journal of Machine Learning Reseaerch, 7:1687­1712, 2006. [7] J.-B. Hiriart-Urruty and C. Lemarechal. Convex analysis and minimization algorithms II: advanced theory and bundle methods. (2nd part edition). Springer-Verlag, New York, 1993. [8] T. Joachims. Transductive inference for text classification using support vector machines. In ICML '99: Proceedings of the Sixteenth International Conference on Machine Learning, pages 200­209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [9] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27­ 72, 2004. [10] Y. Nesterov and A. Nemirovsky. Interior point polynomial methods in convex programming: Theory and applications. Studies in Applied Mathematics. Philadelphia, 1994. [11] V. Sindhwani, S. S. Keerthi, and O. Chapelle. Deterministic annealing for semi-supervised kernel machines. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 841­848, New York, NY, USA, 2006. ACM Press. [12] J. F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11:625­653, 1999. [13] H. Valizadegan and R. Jin. Generalized maximum margin clustering and unsupervised kernel ¨ learning. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, 2007. [14] L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class support vector machines. In AAAI, pages 904­910, 2005. [15] X. Zhu. Semi-supervised learning literature survey. Technical report, Computer Sciences, University of Wisconsin-Madison, 2005. [16] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of Twentith International Conference on Machine Learning (ICML-2003), pages 912­919, 2003. 8