Large Scale Semi-supervised Linear SVMs Vikas Sindhwani Depar tment of Computer Science University of Chicago Chicago, IL 60637, USA S. Sathiya Keer thi Yahoo! Research Media Studios Nor th Burbank, CA 91504, USA vikass@cs.uchicago.edu selvarak@yahoo-inc.com ABSTRACT Large scale learning is often realistic only in a semi-sup ervised setting where a small set of lab eled examples is available together with a large collection of unlab eled data. In many information retrieval and data mining applications, linear classifiers are strongly preferred b ecause of their ease of implementation, interpretability and empirical p erformance. In this work, we present a family of semi-sup ervised linear supp ort vector classifiers that are designed to handle partially-lab eled sparse datasets with p ossibly very large numb er of examples and features. At their core, our algorithms employ recently develop ed modified finite Newton techniques. Our contributions in this pap er are as follows: (a) We provide an implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets. (b) We prop ose a variant of TSVM that involves multiple switching of lab els. Exp erimental results show that this variant provides an order of magnitude further improvement in training efficiency. (c) We present a new algorithm for semi-sup ervised learning based on a Deterministic Annealing (DA) approach. This algorithm alleviates the problem of local minimum in the TSVM optimization procedure while also b eing computationally attractive. We conduct an empirical study on several document classification tasks which confirms the value of our methods in large scale semisup ervised settings. Keywords Supp ort Vector Machines, Unlab eled data, Global optimization, Text Categorization 1. INTRODUCTION Consider the following situation: In a single web-crawl, search engines like Yahoo! and Google index billions of documents. Only a very small fraction of these documents can p ossibly b e hand-lab eled by human editorial teams and assembled into topic directories. In information retrieval relevance feedback, a user lab els a small numb er of documents returned by an initial query as b eing relevant or not. The remaining documents form a massive collection of unlab eled data. Despite its natural and p ervasive need, solutions to the problem of utilizing unlab eled data with lab eled examples have only recently emerged in machine learning literature. Whereas the abundance of unlab eled data is frequently acknowledged as a motivation in most pap ers, the true p otential of semi-sup ervised learning in large scale settings is yet to b e systematically explored. This app ears to b e partly due to the lack of scalable tools to handle large volumes of data. In this pap er, we prop ose extensions of linear Supp ort Vector Machines (SVMs) for semi-sup ervised classification. Linear techniques are often the method of choice in many applications due to their simplicity and interpretability. When data app ears in a rich high-dimensional representation, linear functions often provide a sufficiently complex hyp othesis space for learning high-quality classifiers. This has b een established, for example, for document classification with Linear SVMs in numerous studies. Our methods are motivated by the intuition of margin maximization for semi-sup ervised SVMs [13, 5, 1, 6, 3, 4]. The key idea is to bias the classification hyp erplane to pass through a low data density region keeping p oints in each data cluster on the same side of the hyp erplane while resp ecting lab els. This algorithm uses an extended SVM objective function with a non-convex loss term over the unlab eled examples to implement the cluster assumption in semi-sup ervised learning1 . This idea is of historical imp ortance as one of the first concrete prop osals for learning from unlab eled data; its p opular implementation in [5] is considered state-of-the-art in text categorization, even in the face of increasing recent comp etition. The assumption that p oints in a cluster should have similar lab els. The role of unlab eled data is to identify clusters and high density regions in the input space. 1 Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning; I.5.2 [Pattern Recognition]: Design Methodology--Classifier design and evaluation General Terms Algorithms, Performance, Exp erimentation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. 477 We highlight the main contributions of this pap er. (1) We outline an implementation for a variant of Transductive SVM [5] designed for linear semi-sup ervised classification on large, sparse datasets. As compared to currently used dual techniques (e.g in the SVMlight implementation of TSVM), our method effectively exploits data sparsity and linearity of the problem to provide sup erior scalability. Additionally, we prop ose a multiple switching heuristic that further improves TSVM training by an order of magnitude. These sp eed enhancements turn TSVM into a feasible tool for large scale applications. (2) We prop ose a novel algorithm for semi-sup ervised SVMs inspired from Deterministic Annealing (DA) techniques. This approach generates a family of ob jective functions whose non-convexity is controlled by an annealing parameter. The global minimizer is parametrically tracked in this family. This approach alleviates the problem of local minima in the TSVM optimization procedure which results in b etter solutions on some problems. A computationally attractive training algorithm is presented that involves a sequence of alternating convex optimizations. (3) We conduct an exp erimental study on many document classification tasks with several thousands of examples and features. This study clearly shows the utility of our tools for large scale problems. The modified finite Newton algorithm (abbreviated l2 SVM-MFN) of Keerthi and Decoste [8] for fast training of linear SVMs is a key subroutine for our algorithms. This pap er is arranged as follows. In section 2 we describ e the l2 -SVM-MFN algorithm and present its semi-sup ervised extensions in section 3. Exp erimental results are rep orted in section 4. Section 5 contains some concluding comments. A more detailed version of this pap er is available at [11]. of a least squares asp ect with algorithmic advantages associated with SVMs. We also note that all the discussion in this pap er can b e applied to other loss functions such as Hub er's Loss and rounded Hinge loss using the modifications outlined in [8]. We consider a version of l2 -SVM-MFN where a weighted quadratic soft margin loss function is used. 1X min f (w) = ci l2 (yi wT xi ) + (2) w2 w 2 2 ij(w) Here we have rewritten Eqn. 1 in terms of the supp ort vector set j(w) = {i : yi (wT xi ) < 1}. Additionally, the loss associated with the ith example has a cost ci . f (w) refers to the ob jective function b eing minimized, evaluated at a candidate solution w. Note that if the index set j(w) were indep endent of w and ran over all data p oints, this would simply b e the ob jective function for weighted linear regularized least squares (RLS). Following [8], we observe that f is a strictly convex, piecewise quadratic, continuously differentiable function having a unique minimizer. The gradient of f at w is given by: ^ ~ T f (w) = w + Xj(w) Cj(w) Xj(w) w - Yj(w) where Xj(w) is a matrix whose rows are the feature vectors of training p oints corresp onding to the index set j(w), Yj(w) is a column vector containing lab els for these p oints, and Cj(w) is a diagonal matrix that contains the costs ci for these p oints along its diagonal. l2 -SVM-MFN is a primal algorithm that uses the Newton's Method for unconstrained minimization of a convex function. The classical Newton's method is based on a second order approximation of the ob jective function, and involves up dates of the following kind: wk+1 = wk + k nk (3) 2. MODIFIED FINITE NEWTON LINEAR L2 -SVM The modified finite Newton l2 -SVM method [8] (l2 -SVMMFN) is a recently develop ed training algorithm for Linear SVMs that is ideally suited to sparse datasets with large numb er of examples and p ossibly large numb er of features. Given a binary classification problem with l lab eled examples {xi , yi }l=1 where the input patterns xi Rd (e.g docui ments) and the lab els yi {+1, -1}, l2 -SVM-MFN provides an efficient primal solution to the following SVM optimization problem: w = argmin w Rd l 1X w l2 ( y i w T x i ) + 2 i=1 2 2 (1) where l2 is the l2 -SVM loss given by l2 (z ) = max(0, 1 - z )2 , is a real-valued regularization parameter and the final classifier is given by sign(w T x). This ob jective function differs from the standard SVM problem in some resp ects. First, instead of using the hinge loss as the data fitting term, the square of the hinge loss (or the so-called quadratic soft margin loss function) is used. This makes the ob jective function continuously differentiable, allowing easier applicability of gradient techniques. Secondly, the bias term ("b") is also regularized. In the problem formulation of Eqn. 1, it is implicitly assumed that an additional comp onent in the weight vector and a constant feature in the example vectors have b een added to indirectly incorp orate the bias. This formulation combines the simplicity where the step size k R, and the Newton direction nk Rd is given by: nk = -[2 f (wk )]-1 f (wk ). Here, f (wk ) is the gradient vector and 2 f (wk ) is the Hessian matrix of f at wk . However, the Hessian does not exist everywhere, since f is not twice differentiable at those weight vectors w where wT xi = yi for some index i.2 Thus a generalized definition of the Hessian matrix is used. The modified finite Newton procedure [8] proceeds as follows. The step wk = wk + nk in the Newton direction can b e seen to b e ¯ given by solving the following linear system associated with a weighted linear regularized least squares problem over the data subset defined by the indices j(wk ): h i T T I + Xj(wk ) Cj(wk ) Xj(wk ) wk = Xj(wk ) Cj(wk ) Yj(wk ) (4) ¯ where I is the identity matrix. Once wk is obtained, wk+1 ¯ is obtained from Eqn. 3 by setting wk+1 = wk + k (wk - wk ) ¯ after p erforming an exact line search for k , i.e by exactly solving a one-dimensional minimization problem: " " ¯ (5) k = argmin ( ) = f wk + (wk - wk ) 0 The modified finite Newton procedure has the prop erty of finite convergence to the optimal solution. The key features In the neighb orhood of such a w, the index i leaves or enters j(w). However, at w, yi wT xi = 1. So f is continuously differentiable inspite of these index jumps. 2 478 that bring scalability and numerical robustness to l2 -SVMMFN are: (a) Solving the regularized least squares system of Eqn. 4 by a numerically well-b ehaved Conjugate Gradient scheme referred to as CGLS, which is designed for large, sparse data matrices X . The b enefit of the least squares asp ect of the loss function comes in here to provide access to a p owerful set of tools in numerical computation. (b) Due to the one-sided nature of margin loss functions, these systems are required to b e solved over only restricted index sets j(w) which can b e much smaller than the whole dataset. This also allows additional heuristics to b e develop ed such as terminating CGLS early when working with a crude starting guess like 0, and allowing the following line search step to yield a p oint where the index set j(w) is small. Subsequent optimization steps then work on smaller subsets of the data Below, we briefly discuss the CGLS and Line search procedures. We refer the reader to [8] for full details. 2.1 CGLS CGLS is a sp ecial conjugate-gradient solver that is designed to solve, in a numerically robust way, large, sparse, weighted regularized least squares problems such as the one in Eqn. 4. Starting with a guess solution, several sp ecialized conjugate-gradient iterations are applied to get wk that ¯ solves Eqn. 4. The ma jor exp ense in each iteration conT sists of two op erations of the form Xj (wk ) p and Xj (wk ) q . If there are n0 non-zero elements in the data matrix, these involve O(n0 ) cost. It is worth noting that, as a subroutine of l2 -SVM-MFN, CGLS is typically called on a small subset, Xj (wk ) of the full data set. To compute the exact solution of Eqn. 4, r iterations are needed, where r is the rank of Xj (wk ) . But, in practice, such an exact solution is unnecessary. CGLS uses an effective stopping criterion for early termination. The total cost of CGLS is O(tcgls n0 ) where tcgls is the numb er of iterations, which dep ends on the practical rank of Xj (wk ) and is typically found to b e very small relative to the dimensions of Xj (wk ) (numb er of examples and features). The memory requirements of CGLS are also minimal: only five vectors need to b e maintained, including the outputs over the currently active set of data p oints. Finally, an imp ortant feature of CGLS is worth emphasizing. Supp ose the solution w of a regularized least squares problem is available, i.e the linear system in Eqn. 4 has b een solved using CGLS. If there is a need to solve a p erturb ed linear system, it is greatly advantageous in many settings to start the CG iterations for the new system with w as the initial guess. This is called seeding. If the starting residual is small, CGLS can converge much faster than with a guess of 0 vector. The utility of this feature dep ends on the nature and degree of p erturbation. In l2 -SVM-MFN, the candidate solution wk obtained after line search in iteration k is seeded for the CGLS computation of wk . Also, in tuning over a ¯ range of values, it is valuable to seed the solution for a particular onto the next value. For the semi-sup ervised SVM implementations with l2 -SVM-MFN, we will seed solutions across linear systems with slightly p erturb ed lab el vectors, data matrices and costs. f , ( ) is also a continuously differentiable, strictly convex, piecewise quadratic function with a unique minimizer. is a continuous piecewise linear function whose root, k , can b e easily found by sorting the break p oints where its slop e changes and then p erforming a sequential search on that sorted list. The cost of this op eration is negligible compared to the cost of the CGLS iterations. l2 -SVM-MFN alternates b etween calls to CGLS and line ¯ searches. Its computational complexity is O(tmf n tcgls n0 ) where tmf n is the numb er of outer iterations of CGLS calls ¯ and line search, and tcgls is the average numb er of CGLS iterations. These dep end on the data set and the tolerance desired in the stopping criterion, but are typically very small. For example, on a text classification exp eriment involving 198788 examples and 252472 features, tmf n = 11 ¯ and tcgls = 102. Therefore, the complexity of l2 -SVM-MFN is effectively linear in the numb er of entries in the data matrix. 3. SEMI-SUPERVISED LINEAR SVMS We now assume we have l lab eled examples {xi , yi }l=1 i and u unlab eled examples {xj }u=1 with xi , xj Rd and j yi {-1, +1}. Our goal is to construct a linear classifier sign(wT x) that utilizes unlab eled data, typically in situations where l u. We present semi-sup ervised algorithms that provide l2 -SVM-MFN the capability of dealing with unlab eled data. 3.1 Transductive SVM Transductive SVM app ends an additional term in the SVM ob jective function whose role is to drive the classification hyp erplane towards low data density regions. Variations of this idea have app eared in the literature [5, 1, 6]. Since [5] app ears to b e the most natural extension of standard SVMs among these methods, and is p opularly used in Text classification applications, we will focus on developing its large scale implementation. The following optimization problem is setup for standard TSVM3 : min ju w 2 2 w ,{ y } j = 1 + l u 2X 1X l (y i w T x i ) + l (y j w T x j ) 2l i=1 u j =1 u 1X max[0, sign(wT xj )] = r u j =1 sub ject to: 2.2 Line Search Given the vectors wk ,wk in some iteration of l2 -SVM¯ MFN, the line search step requires us to solve Eqn. 5. The one-dimensional function ( ) is the restriction of the ob¯ jective function f on the ray from wk onto wk . Hence, like where the hinge loss function, l(z ) = l1 (z ) = max(0, 1 - z ) is normally used. The lab els on the unlab eled data, y1 . . . yu , are {+1, -1}-valued variables in the optimization problem. In other words, TSVM seeks a hyp erplane w and a lab eling of the unlab eled examples, so that the SVM ob jective function is minimized, sub ject to the constraint that a fraction r of the unlab eled data b e classified p ositive. SVM margin maximization in the presence of unlab eled examples can b e interpreted as an implementation of the cluster assumption. In the optimization problem ab ove, is a user-provided parameter that provides control over the influence of unlab eled data. For example, if the data has distinct clusters with a large margin, but the cluster assumption does not The bias term is typically excluded from the regularizer, but this factor is not exp ected to make any significant difference. 3 479 hold, then can b e set to 0 and the standard SVM is retrieved. If there is enough lab eled data, , can b e tuned by cross-validation. An initial estimate of r can b e made from the fraction of lab eled examples that b elong to the p ositive class and subsequent fine tuning can b e done based on validation p erformance. This optimization is implemented in [5] by first using an inductive SVM to lab el the unlab eled data and then iteratively switching lab els and retraining SVMs to improve the ob jective function. The TSVM algorithm wraps around an SVM training procedure. The original (and widely p opular) implementation of TSVM uses the SVMlight software. There, the training of SVMs in the inner loops of TSVM uses dual decomp osition techniques. As shown by exp eriments in [8], in sparse, linear settings one can obtain significant sp eed improvements with l2 -SVM-MFN over SVMlight . Thus, by implementing TSVM with l2 -SVM-MFN, we exp ect similar improvements for semi-sup ervised learning on large, sparse datasets. Note that l2 -SVM-MFN can also b e used to sp eedup other TSVM formulations e.g [4] in such cases. The l2 -SVM-MFN retraining steps in the inner loop of TSVM are typically executed extremely fast by using seeding techniques. Additionally, we also prop ose a version of TSVM where more than one pair of lab els may b e switched in each iteration. These sp eed-enhancement details are discussed in the following subsections. soft outputs of this classifier so that the fraction of the total numb er of unlab eled examples that are temp orarily lab eled p ositive equals the parameter r . Then starting from a small value of , the unlab eled data is gradually brought in by increasing by a certain factor in the outer loop. This gradual increase of the influence of the unlab eled data is a way to protect TSVM from b eing immediately trapp ed in a local minimum. An inner loop identifies pairs of unlab eled examples with p ositive and negative temp orary lab els such that switching these lab els would decrease the ob jective function. l2 -SVM-MFN is then retrained with the switched lab els. 3.1.2 Multiple Switching The TSVM algorithm presented in [5] involves switching a single pair of lab els at a time. We prop ose a variant where upto S pairs are switched such that the ob jective function improves. Here, S is a user controlled parameter. Setting S = 1 recovers the original TSVM algorithm, whereas setting S = u/2 switches as many pairs as p ossible in the inner loop of TSVM. The implementation is conveniently done as follows: 1. Identify unlab eled examples with active indices and currently p ositive lab els. Sort corresp onding outputs in ascending order. Let the sorted list b e L+ . 2. Identify unlab eled examples with active indices and currently negative lab els. Sort corresp onding outputs in descending order. Let the sorted list b e L- . 3. Pick pairs of elements, one from each list, from the top of these lists until either a pair is found such that the output from L+ is greater than the output from L- , or if S pairs have b een picked. 4. Switch the current lab els of these pairs. Using arguments similar to Theorem 2 in [5] we can show that Transductive l2 -SVM-MFN with multiple-pair switching converges in a finite numb er of steps. We are unaware of any prior work that suggests and evaluates this simple multiple-pair switching heuristic. Our exp erimental results in section 4 establish that this heuristic is remarkably effective in sp eeding up TSVM training while maintaining generalization p erformance. 3.1.1 Implementing TSVM Using l2 -SVM-MFN To develop the TSVM implementation with l2 -SVM-MFN, we consider the TSVM ob jective function but with the L2 SVM loss function, l = l2 . Figure 1: l2 loss function for TSVM 1 0.9 0.8 0.7 loss 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 output 3.1.3 Seeding Note that this ob jective function ab ove can also b e equivalently written in terms of the following loss over each unlab eled example x: min[l2 (wT x), l2 (-wT x)] = max[0, 1 - |wT x|]2 Here, we pick the value of the lab el variable y that minimizes the loss on the unlab eled example x, and rewrite in terms of the absolute value of the output of the classifier on x. This loss function is shown in Fig. 1. We note in passing that, l1 and l2 loss terms over unlab eled examples are very similar on the interval [-1, +1]. The non-convexity of this loss function implies that the TSVM training procedure is susceptible to local optima issues. In the next subsection, we will outline a deterministic annealing procedure that can overcome this problem. The TSVM algorithm with l2 -SVM-MFN closely follows the presentation in [5]. A classifier is obtained by first running l2 -SVM-MFN on just the lab eled examples. Temp orary lab els are assigned to the unlab eled data by thresholding the The effectiveness of l2 -SVM-MFN on large sparse datasets combined with the efficiency gained from seeding w in the re-training steps (after switching lab els or after increasing ) make this algorithm quite attractive. The complexity ¯ ¯ of Transductive L2 -TSVM-MFN is O(nswitches tmf n tcgls n0 ), where nswitches is the numb er of lab el switches. Typically, nswitches is exp ected to strongly dep end on the data set and also on the numb er of lab eled examples. Since it is difficult to apriori estimate the numb er of switches, this is an issue that is b est understood from empirical observations. 3.2 Deterministic Annealing The transductive SVM loss function over the unlab eled examples can b e seen from Fig. 1 to b e non-convex. This makes the TSVM optimization procedure susceptible to local minimum issues causing a loss in its p erformance in many situations, e.g as recorded in [3]. We now present a new algorithm based on deterministic annealing that can p otentially overcome this problem while also b eing computation- 480 ally very attractive for large scale applications. Deterministic Annealing [2, 9] (DA) is an established tool for combinatorial optimization that approaches the problem from information theoretic principles. The discrete variables in the optimization problem are relaxed to continuous probability variables and a non-negative temp erature parameter T is used to track the global optimum. We b egin by re-writing the TSVM ob jective function as follows: w 2 = Figure 2: DA loss function parameterized by T. 1 0.5 Decreasing T 0 loss -0.5 -1 argmin w,{j }u=1 j w 2 2 l 1X + l2 ( w T x i ) 2l i=1 -1.5 -2 -3 -2 -1 0 1 2 3 output + u " X" j l2 (wT xj ) + (1 - j )l2 (-wT xj ) u j =1 Here, we introduce binary valued variables j = (1 + yj )/2. Let pj [0, 1] denote the b elief probability that the unlab eled example xj b elongs to the p ositive class. The Ising model 4 motivates the following ob jective function, where we relax the binary variables j to probability variables pj , and include entropy terms for the distributions defined by pj : wT = argmin w,{pj }u=1 j 2 w 2 2 + l 1X l2 ( y i w T x i ) 2l i=1 The optimization is done in stages, starting with high values of T and then gradually decreasing T towards 0. For each T , the problem in Eqns. 6,7 is optimized by alternating the minimization over w and p = [p1 . . . pu ] resp ectively. Fixing p, the optimization over w is done by l2 -SVM-MFN with seeding. Fixing w, the optimization over p can also b e done easily as describ ed b elow. Both these problems involve convex optimization and can b e done exactly and efficiently. We now provide some details. 3.2.1 Optimizing w We describ e the steps to efficiently implement the l2 -SVMMFN loop for optimizing w keeping p fixed. The call to l2 ^ ~T ^ SVM-MFN is made on the data X = X T X T X T whose first l rows are formed by the lab eled examples, and the next 2u rows are formed by the unlab eled examples app earing as two rep eated blocks. The associated lab el vector and cost matrix are given by z }| { z }| { ^ Y = [y1 , y2 ...yl , 1, 1, ...1, -1, -1... - 1] 3 2l u u }| {z }| { z }| { z p p ( ( 6 1 1 1 u 1 - p1 ) 1 - pu ) 7 7 (8) ... ... C = diag 6 ... , 5 4l l u u u u Even though each unlab eled data contributes two terms to the ob jective function, effectively only one term contributes to the complexity. This is b ecause matrix-vector products, which form the dominant exp ense in l2 -SVM-MFN, are p erformed only on unique rows of a matrix. The output may b e duplicated for duplicate rows. Infact, we can re-write the CGLS calls in l2 -SVM-MFN so that the unlab eled examples app ear only once in the data matrix. u u u " X" p j l2 ( w T x j ) + ( 1 - p j ) l2 ( - w T x j ) + u j =1 + u TX [pj log pj + (1 - pj ) log (1 - pj )] 2u j =1 (6) Here, the "temp erature" T parameterizes a family of objective functions. The ob jective function for a fixed T is minimized under the following class balancing constraint: u 1X pj = r u j =1 (7) where r is the fraction of the numb er of unlab eled examples b elonging to the p ositive class. As in TSVM, r is treated as a user-provided parameter. It may also b e estimated from the lab eled examples. The solution to the optimization problem ab ove is tracked as the temp erature parameter T is lowered to 0. We monitor the value of the ob jective function in the optimization path and return the solution corresp onding to the minimum value achieved. To develop an intuition for the working on this method, we consider the loss term in the ob jective function associated with an unlab eled example as a function of the output of the classifier. Fig. 2 plots this loss term for various values of T . As the temp erature is decreased, the loss function deforms from a squared-loss shap e where a global optimum is easier to achieve, to the TSVM loss function in Fig. 1. At high temp eratures a global optimum is easier to obtain. The minimizer is then slowly tracked as the temp erature is lowered towards zero. 4 A multiclass extension would use the Potts glass model. There, one would have to app end the entropy of the distribution over multiple classes to a multi-class ob jective function. 3.2.2 Optimizing p For the latter problem of optimizing p for a fixed w, we construct the Lagrangian: u " 2 X" p j l2 ( w T x j ) + ( 1 - p j ) l2 ( - w T x j ) + L= u j =1 " # u u TX 1X (pj log pj + (1 - pj ) log (1 - pj )) - pj - r 2u j =1 u j =1 Solving L/ pj = 0, we get: pj = 1 1+e gj -2 T (9) 481 where gj = [l2 (wT xj ) - l2 (-wT xj )]. Substituting this expression in the balance constraint in Eqn. 7, we get a one-dimensional non-linear equation in 2 : u 1 1X =r u j =1 1 + e gi -2 T Table 1: Two-class datasets. d : data dimensionality, n0 : average sparsity, l + u : number of labeled and ¯ unlabeled examples, t : number of test examples, r : positive class ratio. Dataset aut-avn real-sim ccat gcat 33-36 pcmac d 20707 20958 47236 47236 59072 7511 n0 ¯ 51.32 51.32 75.93 75.93 26.56 54.58 l+u 35588 36155 17332 17332 41346 1460 t 35587 36154 5787 5787 41346 486 r 0.65 0.31 0.46 0.30 0.49 0.51 The root is computed by using a hybrid combination of Newton-Raphson iterations and the bisection method together with a carefully set initial value. 3.2.3 Stopping Criteria For a fixed T , the alternate minimization of w and p proceeds until some stopping criterion is satisfied. A natural criterion is the mean Kullback-Liebler divergence (relative entropy) K L(p, q ) b etween current values of pi and the values, say qi , at the end of last iteration. Thus the stopping criterion for fixed T is: A K L(p, q ) = u X j =1 pj log pj 1 - pj + (1 - pj ) log