Large Scale Semi-supervised Linear SVMs
Vikas Sindhwani
Depar tment of Computer Science University of Chicago Chicago, IL 60637, USA

S. Sathiya Keer thi
Yahoo! Research Media Studios Nor th Burbank, CA 91504, USA

vikass@cs.uchicago.edu

selvarak@yahoo-inc.com

ABSTRACT
Large scale learning is often realistic only in a semi-sup ervised setting where a small set of lab eled examples is available together with a large collection of unlab eled data. In many information retrieval and data mining applications, linear classifiers are strongly preferred b ecause of their ease of implementation, interpretability and empirical p erformance. In this work, we present a family of semi-sup ervised linear supp ort vector classifiers that are designed to handle partially-lab eled sparse datasets with p ossibly very large numb er of examples and features. At their core, our algorithms employ recently develop ed modified finite Newton techniques. Our contributions in this pap er are as follows: (a) We provide an implementation of Transductive SVM (TSVM) that is significantly more efficient and scalable than currently used dual techniques, for linear classification problems involving large, sparse datasets. (b) We prop ose a variant of TSVM that involves multiple switching of lab els. Exp erimental results show that this variant provides an order of magnitude further improvement in training efficiency. (c) We present a new algorithm for semi-sup ervised learning based on a Deterministic Annealing (DA) approach. This algorithm alleviates the problem of local minimum in the TSVM optimization procedure while also b eing computationally attractive. We conduct an empirical study on several document classification tasks which confirms the value of our methods in large scale semisup ervised settings.

Keywords
Supp ort Vector Machines, Unlab eled data, Global optimization, Text Categorization

1. INTRODUCTION
Consider the following situation: In a single web-crawl, search engines like Yahoo! and Google index billions of documents. Only a very small fraction of these documents can p ossibly b e hand-lab eled by human editorial teams and assembled into topic directories. In information retrieval relevance feedback, a user lab els a small numb er of documents returned by an initial query as b eing relevant or not. The remaining documents form a massive collection of unlab eled data. Despite its natural and p ervasive need, solutions to the problem of utilizing unlab eled data with lab eled examples have only recently emerged in machine learning literature. Whereas the abundance of unlab eled data is frequently acknowledged as a motivation in most pap ers, the true p otential of semi-sup ervised learning in large scale settings is yet to b e systematically explored. This app ears to b e partly due to the lack of scalable tools to handle large volumes of data. In this pap er, we prop ose extensions of linear Supp ort Vector Machines (SVMs) for semi-sup ervised classification. Linear techniques are often the method of choice in many applications due to their simplicity and interpretability. When data app ears in a rich high-dimensional representation, linear functions often provide a sufficiently complex hyp othesis space for learning high-quality classifiers. This has b een established, for example, for document classification with Linear SVMs in numerous studies. Our methods are motivated by the intuition of margin maximization for semi-sup ervised SVMs [13, 5, 1, 6, 3, 4]. The key idea is to bias the classification hyp erplane to pass through a low data density region keeping p oints in each data cluster on the same side of the hyp erplane while resp ecting lab els. This algorithm uses an extended SVM objective function with a non-convex loss term over the unlab eled examples to implement the cluster assumption in semi-sup ervised learning1 . This idea is of historical imp ortance as one of the first concrete prop osals for learning from unlab eled data; its p opular implementation in [5] is considered state-of-the-art in text categorization, even in the face of increasing recent comp etition. The assumption that p oints in a cluster should have similar lab els. The role of unlab eled data is to identify clusters and high density regions in the input space.
1

Categories and Subject Descriptors
I.2.6 [Artificial Intelligence]: Learning; I.5.2 [Pattern Recognition]: Design Methodology--Classifier design and evaluation

General Terms
Algorithms, Performance, Exp erimentation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

477


We highlight the main contributions of this pap er. (1) We outline an implementation for a variant of Transductive SVM [5] designed for linear semi-sup ervised classification on large, sparse datasets. As compared to currently used dual techniques (e.g in the SVMlight implementation of TSVM), our method effectively exploits data sparsity and linearity of the problem to provide sup erior scalability. Additionally, we prop ose a multiple switching heuristic that further improves TSVM training by an order of magnitude. These sp eed enhancements turn TSVM into a feasible tool for large scale applications. (2) We prop ose a novel algorithm for semi-sup ervised SVMs inspired from Deterministic Annealing (DA) techniques. This approach generates a family of ob jective functions whose non-convexity is controlled by an annealing parameter. The global minimizer is parametrically tracked in this family. This approach alleviates the problem of local minima in the TSVM optimization procedure which results in b etter solutions on some problems. A computationally attractive training algorithm is presented that involves a sequence of alternating convex optimizations. (3) We conduct an exp erimental study on many document classification tasks with several thousands of examples and features. This study clearly shows the utility of our tools for large scale problems. The modified finite Newton algorithm (abbreviated l2 SVM-MFN) of Keerthi and Decoste [8] for fast training of linear SVMs is a key subroutine for our algorithms. This pap er is arranged as follows. In section 2 we describ e the l2 -SVM-MFN algorithm and present its semi-sup ervised extensions in section 3. Exp erimental results are rep orted in section 4. Section 5 contains some concluding comments. A more detailed version of this pap er is available at [11].

of a least squares asp ect with algorithmic advantages associated with SVMs. We also note that all the discussion in this pap er can b e applied to other loss functions such as Hub er's Loss and rounded Hinge loss using the modifications outlined in [8]. We consider a version of l2 -SVM-MFN where a weighted quadratic soft margin loss function is used. 1X  min f (w) = ci l2 (yi wT xi ) + (2) w2 w 2 2
ij(w)

Here we have rewritten Eqn. 1 in terms of the supp ort vector set j(w) = {i : yi (wT xi ) < 1}. Additionally, the loss associated with the ith example has a cost ci . f (w) refers to the ob jective function b eing minimized, evaluated at a candidate solution w. Note that if the index set j(w) were indep endent of w and ran over all data p oints, this would simply b e the ob jective function for weighted linear regularized least squares (RLS). Following [8], we observe that f is a strictly convex, piecewise quadratic, continuously differentiable function having a unique minimizer. The gradient of f at w is given by: ^ ~ T  f (w) =  w + Xj(w) Cj(w) Xj(w) w - Yj(w) where Xj(w) is a matrix whose rows are the feature vectors of training p oints corresp onding to the index set j(w), Yj(w) is a column vector containing lab els for these p oints, and Cj(w) is a diagonal matrix that contains the costs ci for these p oints along its diagonal. l2 -SVM-MFN is a primal algorithm that uses the Newton's Method for unconstrained minimization of a convex function. The classical Newton's method is based on a second order approximation of the ob jective function, and involves up dates of the following kind: wk+1 = wk +  k nk (3)

2.

MODIFIED FINITE NEWTON LINEAR L2 -SVM

The modified finite Newton l2 -SVM method [8] (l2 -SVMMFN) is a recently develop ed training algorithm for Linear SVMs that is ideally suited to sparse datasets with large numb er of examples and p ossibly large numb er of features. Given a binary classification problem with l lab eled examples {xi , yi }l=1 where the input patterns xi  Rd (e.g docui ments) and the lab els yi  {+1, -1}, l2 -SVM-MFN provides an efficient primal solution to the following SVM optimization problem: w
=

argmin
w  Rd

l  1X w l2 ( y i w T x i ) + 2 i=1 2

2

(1)

where l2 is the l2 -SVM loss given by l2 (z ) = max(0, 1 - z )2 ,  is a real-valued regularization parameter and the final classifier is given by sign(w T x). This ob jective function differs from the standard SVM problem in some resp ects. First, instead of using the hinge loss as the data fitting term, the square of the hinge loss (or the so-called quadratic soft margin loss function) is used. This makes the ob jective function continuously differentiable, allowing easier applicability of gradient techniques. Secondly, the bias term ("b") is also regularized. In the problem formulation of Eqn. 1, it is implicitly assumed that an additional comp onent in the weight vector and a constant feature in the example vectors have b een added to indirectly incorp orate the bias. This formulation combines the simplicity

where the step size  k  R, and the Newton direction nk  Rd is given by: nk = -[2 f (wk )]-1  f (wk ). Here,  f (wk ) is the gradient vector and 2 f (wk ) is the Hessian matrix of f at wk . However, the Hessian does not exist everywhere, since f is not twice differentiable at those weight vectors w where wT xi = yi for some index i.2 Thus a generalized definition of the Hessian matrix is used. The modified finite Newton procedure [8] proceeds as follows. The step wk = wk + nk in the Newton direction can b e seen to b e Ż given by solving the following linear system associated with a weighted linear regularized least squares problem over the data subset defined by the indices j(wk ): h i T T I + Xj(wk ) Cj(wk ) Xj(wk ) wk = Xj(wk ) Cj(wk ) Yj(wk ) (4) Ż where I is the identity matrix. Once wk is obtained, wk+1 Ż is obtained from Eqn. 3 by setting wk+1 = wk +  k (wk - wk ) Ż after p erforming an exact line search for  k , i.e by exactly solving a one-dimensional minimization problem: " " Ż (5)  k = argmin ( ) = f wk +  (wk - wk )
 0

The modified finite Newton procedure has the prop erty of finite convergence to the optimal solution. The key features In the neighb orhood of such a w, the index i leaves or enters j(w). However, at w, yi wT xi = 1. So f is continuously differentiable inspite of these index jumps.
2

478


that bring scalability and numerical robustness to l2 -SVMMFN are: (a) Solving the regularized least squares system of Eqn. 4 by a numerically well-b ehaved Conjugate Gradient scheme referred to as CGLS, which is designed for large, sparse data matrices X . The b enefit of the least squares asp ect of the loss function comes in here to provide access to a p owerful set of tools in numerical computation. (b) Due to the one-sided nature of margin loss functions, these systems are required to b e solved over only restricted index sets j(w) which can b e much smaller than the whole dataset. This also allows additional heuristics to b e develop ed such as terminating CGLS early when working with a crude starting guess like 0, and allowing the following line search step to yield a p oint where the index set j(w) is small. Subsequent optimization steps then work on smaller subsets of the data Below, we briefly discuss the CGLS and Line search procedures. We refer the reader to [8] for full details.

2.1 CGLS
CGLS is a sp ecial conjugate-gradient solver that is designed to solve, in a numerically robust way, large, sparse, weighted regularized least squares problems such as the one in Eqn. 4. Starting with a guess solution, several sp ecialized conjugate-gradient iterations are applied to get wk that Ż solves Eqn. 4. The ma jor exp ense in each iteration conT sists of two op erations of the form Xj (wk ) p and Xj (wk ) q . If there are n0 non-zero elements in the data matrix, these involve O(n0 ) cost. It is worth noting that, as a subroutine of l2 -SVM-MFN, CGLS is typically called on a small subset, Xj (wk ) of the full data set. To compute the exact solution of Eqn. 4, r iterations are needed, where r is the rank of Xj (wk ) . But, in practice, such an exact solution is unnecessary. CGLS uses an effective stopping criterion for early termination. The total cost of CGLS is O(tcgls n0 ) where tcgls is the numb er of iterations, which dep ends on the practical rank of Xj (wk ) and is typically found to b e very small relative to the dimensions of Xj (wk ) (numb er of examples and features). The memory requirements of CGLS are also minimal: only five vectors need to b e maintained, including the outputs over the currently active set of data p oints. Finally, an imp ortant feature of CGLS is worth emphasizing. Supp ose the solution w of a regularized least squares problem is available, i.e the linear system in Eqn. 4 has b een solved using CGLS. If there is a need to solve a p erturb ed linear system, it is greatly advantageous in many settings to start the CG iterations for the new system with w as the initial guess. This is called seeding. If the starting residual is small, CGLS can converge much faster than with a guess of 0 vector. The utility of this feature dep ends on the nature and degree of p erturbation. In l2 -SVM-MFN, the candidate solution wk obtained after line search in iteration k is seeded for the CGLS computation of wk . Also, in tuning  over a Ż range of values, it is valuable to seed the solution for a particular  onto the next value. For the semi-sup ervised SVM implementations with l2 -SVM-MFN, we will seed solutions across linear systems with slightly p erturb ed lab el vectors, data matrices and costs.

f , ( ) is also a continuously differentiable, strictly convex, piecewise quadratic function with a unique minimizer.  is a continuous piecewise linear function whose root,  k , can b e easily found by sorting the break p oints where its slop e changes and then p erforming a sequential search on that sorted list. The cost of this op eration is negligible compared to the cost of the CGLS iterations. l2 -SVM-MFN alternates b etween calls to CGLS and line Ż searches. Its computational complexity is O(tmf n tcgls n0 ) where tmf n is the numb er of outer iterations of CGLS calls Ż and line search, and tcgls is the average numb er of CGLS iterations. These dep end on the data set and the tolerance desired in the stopping criterion, but are typically very small. For example, on a text classification exp eriment involving 198788 examples and 252472 features, tmf n = 11 Ż and tcgls = 102. Therefore, the complexity of l2 -SVM-MFN is effectively linear in the numb er of entries in the data matrix.

3. SEMI-SUPERVISED LINEAR SVMS
We now assume we have l lab eled examples {xi , yi }l=1 i and u unlab eled examples {xj }u=1 with xi , xj  Rd and j yi  {-1, +1}. Our goal is to construct a linear classifier sign(wT x) that utilizes unlab eled data, typically in situations where l u. We present semi-sup ervised algorithms that provide l2 -SVM-MFN the capability of dealing with unlab eled data.

3.1 Transductive SVM
Transductive SVM app ends an additional term in the SVM ob jective function whose role is to drive the classification hyp erplane towards low data density regions. Variations of this idea have app eared in the literature [5, 1, 6]. Since [5] app ears to b e the most natural extension of standard SVMs among these methods, and is p opularly used in Text classification applications, we will focus on developing its large scale implementation. The following optimization problem is setup for standard TSVM3 : min ju  w 2
2

w ,{ y } j = 1

+

l u 2X 1X  l (y i w T x i ) + l (y j w T x j ) 2l i=1 u j =1 u 1X max[0, sign(wT xj )] = r u j =1

sub ject to:

2.2 Line Search
Given the vectors wk ,wk in some iteration of l2 -SVMŻ MFN, the line search step requires us to solve Eqn. 5. The one-dimensional function ( ) is the restriction of the obŻ jective function f on the ray from wk onto wk . Hence, like

where the hinge loss function, l(z ) = l1 (z ) = max(0, 1 - z ) is normally used. The lab els on the unlab eled data, y1 . . . yu , are {+1, -1}-valued variables in the optimization problem. In other words, TSVM seeks a hyp erplane w and a lab eling of the unlab eled examples, so that the SVM ob jective function is minimized, sub ject to the constraint that a fraction r of the unlab eled data b e classified p ositive. SVM margin maximization in the presence of unlab eled examples can b e interpreted as an implementation of the cluster assumption. In the optimization problem ab ove,  is a user-provided parameter that provides control over the influence of unlab eled data. For example, if the data has distinct clusters with a large margin, but the cluster assumption does not The bias term is typically excluded from the regularizer, but this factor is not exp ected to make any significant difference.
3

479


hold, then  can b e set to 0 and the standard SVM is retrieved. If there is enough lab eled data, ,  can b e tuned by cross-validation. An initial estimate of r can b e made from the fraction of lab eled examples that b elong to the p ositive class and subsequent fine tuning can b e done based on validation p erformance. This optimization is implemented in [5] by first using an inductive SVM to lab el the unlab eled data and then iteratively switching lab els and retraining SVMs to improve the ob jective function. The TSVM algorithm wraps around an SVM training procedure. The original (and widely p opular) implementation of TSVM uses the SVMlight software. There, the training of SVMs in the inner loops of TSVM uses dual decomp osition techniques. As shown by exp eriments in [8], in sparse, linear settings one can obtain significant sp eed improvements with l2 -SVM-MFN over SVMlight . Thus, by implementing TSVM with l2 -SVM-MFN, we exp ect similar improvements for semi-sup ervised learning on large, sparse datasets. Note that l2 -SVM-MFN can also b e used to sp eedup other TSVM formulations e.g [4] in such cases. The l2 -SVM-MFN retraining steps in the inner loop of TSVM are typically executed extremely fast by using seeding techniques. Additionally, we also prop ose a version of TSVM where more than one pair of lab els may b e switched in each iteration. These sp eed-enhancement details are discussed in the following subsections.

soft outputs of this classifier so that the fraction of the total numb er of unlab eled examples that are temp orarily lab eled p ositive equals the parameter r . Then starting from a small value of  , the unlab eled data is gradually brought in by increasing  by a certain factor in the outer loop. This gradual increase of the influence of the unlab eled data is a way to protect TSVM from b eing immediately trapp ed in a local minimum. An inner loop identifies pairs of unlab eled examples with p ositive and negative temp orary lab els such that switching these lab els would decrease the ob jective function. l2 -SVM-MFN is then retrained with the switched lab els.

3.1.2 Multiple Switching
The TSVM algorithm presented in [5] involves switching a single pair of lab els at a time. We prop ose a variant where upto S pairs are switched such that the ob jective function improves. Here, S is a user controlled parameter. Setting S = 1 recovers the original TSVM algorithm, whereas setting S = u/2 switches as many pairs as p ossible in the inner loop of TSVM. The implementation is conveniently done as follows: 1. Identify unlab eled examples with active indices and currently p ositive lab els. Sort corresp onding outputs in ascending order. Let the sorted list b e L+ . 2. Identify unlab eled examples with active indices and currently negative lab els. Sort corresp onding outputs in descending order. Let the sorted list b e L- . 3. Pick pairs of elements, one from each list, from the top of these lists until either a pair is found such that the output from L+ is greater than the output from L- , or if S pairs have b een picked. 4. Switch the current lab els of these pairs. Using arguments similar to Theorem 2 in [5] we can show that Transductive l2 -SVM-MFN with multiple-pair switching converges in a finite numb er of steps. We are unaware of any prior work that suggests and evaluates this simple multiple-pair switching heuristic. Our exp erimental results in section 4 establish that this heuristic is remarkably effective in sp eeding up TSVM training while maintaining generalization p erformance.

3.1.1 Implementing TSVM Using l2 -SVM-MFN
To develop the TSVM implementation with l2 -SVM-MFN, we consider the TSVM ob jective function but with the L2 SVM loss function, l = l2 . Figure 1: l2 loss function for TSVM
1 0.9 0.8 0.7

loss

0.6 0.5 0.4 0.3 0.2 0.1 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

output

3.1.3 Seeding
Note that this ob jective function ab ove can also b e equivalently written in terms of the following loss over each unlab eled example x: min[l2 (wT x), l2 (-wT x)] = max[0, 1 - |wT x|]2 Here, we pick the value of the lab el variable y that minimizes the loss on the unlab eled example x, and rewrite in terms of the absolute value of the output of the classifier on x. This loss function is shown in Fig. 1. We note in passing that, l1 and l2 loss terms over unlab eled examples are very similar on the interval [-1, +1]. The non-convexity of this loss function implies that the TSVM training procedure is susceptible to local optima issues. In the next subsection, we will outline a deterministic annealing procedure that can overcome this problem. The TSVM algorithm with l2 -SVM-MFN closely follows the presentation in [5]. A classifier is obtained by first running l2 -SVM-MFN on just the lab eled examples. Temp orary lab els are assigned to the unlab eled data by thresholding the The effectiveness of l2 -SVM-MFN on large sparse datasets combined with the efficiency gained from seeding w in the re-training steps (after switching lab els or after increasing  ) make this algorithm quite attractive. The complexity Ż Ż of Transductive L2 -TSVM-MFN is O(nswitches tmf n tcgls n0 ), where nswitches is the numb er of lab el switches. Typically, nswitches is exp ected to strongly dep end on the data set and also on the numb er of lab eled examples. Since it is difficult to apriori estimate the numb er of switches, this is an issue that is b est understood from empirical observations.

3.2 Deterministic Annealing
The transductive SVM loss function over the unlab eled examples can b e seen from Fig. 1 to b e non-convex. This makes the TSVM optimization procedure susceptible to local minimum issues causing a loss in its p erformance in many situations, e.g as recorded in [3]. We now present a new algorithm based on deterministic annealing that can p otentially overcome this problem while also b eing computation-

480


ally very attractive for large scale applications. Deterministic Annealing [2, 9] (DA) is an established tool for combinatorial optimization that approaches the problem from information theoretic principles. The discrete variables in the optimization problem are relaxed to continuous probability variables and a non-negative temp erature parameter T is used to track the global optimum. We b egin by re-writing the TSVM ob jective function as follows: w
2 =

Figure 2: DA loss function parameterized by T.
1

0.5

Decreasing T

0

loss

-0.5

-1

argmin
w,{j }u=1 j

 w 2

2

l 1X + l2 ( w T x i ) 2l i=1

-1.5

-2 -3

-2

-1

0

1

2

3

output

+

u "  X" j l2 (wT xj ) + (1 - j )l2 (-wT xj ) u j =1

Here, we introduce binary valued variables j = (1 + yj )/2. Let pj  [0, 1] denote the b elief probability that the unlab eled example xj b elongs to the p ositive class. The Ising model 4 motivates the following ob jective function, where we relax the binary variables j to probability variables pj , and include entropy terms for the distributions defined by pj : wT = argmin
w,{pj }u=1 j 2

 w 2

2

+

l 1X l2 ( y i w T x i ) 2l i=1

The optimization is done in stages, starting with high values of T and then gradually decreasing T towards 0. For each T , the problem in Eqns. 6,7 is optimized by alternating the minimization over w and p = [p1 . . . pu ] resp ectively. Fixing p, the optimization over w is done by l2 -SVM-MFN with seeding. Fixing w, the optimization over p can also b e done easily as describ ed b elow. Both these problems involve convex optimization and can b e done exactly and efficiently. We now provide some details.

3.2.1 Optimizing w
We describ e the steps to efficiently implement the l2 -SVMMFN loop for optimizing w keeping p fixed. The call to l2 ^ ~T ^ SVM-MFN is made on the data X = X T X T X T whose first l rows are formed by the lab eled examples, and the next 2u rows are formed by the unlab eled examples app earing as two rep eated blocks. The associated lab el vector and cost matrix are given by z }| { z }| { ^ Y = [y1 , y2 ...yl , 1, 1, ...1, -1, -1... - 1] 3 2l u u }| {z }| { z }| { z p p ( ( 6 1 1  1  u  1 - p1 )  1 - pu ) 7 7 (8) ... ... C = diag 6 ... , 5 4l l u u u u Even though each unlab eled data contributes two terms to the ob jective function, effectively only one term contributes to the complexity. This is b ecause matrix-vector products, which form the dominant exp ense in l2 -SVM-MFN, are p erformed only on unique rows of a matrix. The output may b e duplicated for duplicate rows. Infact, we can re-write the CGLS calls in l2 -SVM-MFN so that the unlab eled examples app ear only once in the data matrix.
u u

u "  X" p j l2 ( w T x j ) + ( 1 - p j ) l2 ( - w T x j ) + u j =1

+

u TX [pj log pj + (1 - pj ) log (1 - pj )] 2u j =1

(6)

Here, the "temp erature" T parameterizes a family of objective functions. The ob jective function for a fixed T is minimized under the following class balancing constraint:
u 1X pj = r u j =1

(7)

where r is the fraction of the numb er of unlab eled examples b elonging to the p ositive class. As in TSVM, r is treated as a user-provided parameter. It may also b e estimated from the lab eled examples. The solution to the optimization problem ab ove is tracked as the temp erature parameter T is lowered to 0. We monitor the value of the ob jective function in the optimization path and return the solution corresp onding to the minimum value achieved. To develop an intuition for the working on this method, we consider the loss term in the ob jective function associated with an unlab eled example as a function of the output of the classifier. Fig. 2 plots this loss term for various values of T . As the temp erature is decreased, the loss function deforms from a squared-loss shap e where a global optimum is easier to achieve, to the TSVM loss function in Fig. 1. At high temp eratures a global optimum is easier to obtain. The minimizer is then slowly tracked as the temp erature is lowered towards zero.
4 A multiclass extension would use the Potts glass model. There, one would have to app end the entropy of the distribution over multiple classes to a multi-class ob jective function.

3.2.2 Optimizing p
For the latter problem of optimizing p for a fixed w, we construct the Lagrangian: u " 2 X"  p j l2 ( w T x j ) + ( 1 - p j ) l2 ( - w T x j ) + L= u j =1 " # u u TX 1X (pj log pj + (1 - pj ) log (1 - pj )) -  pj - r 2u j =1 u j =1 Solving  L/ pj = 0, we get: pj = 1 1+e
gj -2 T

(9)

481


where gj =  [l2 (wT xj ) - l2 (-wT xj )]. Substituting this expression in the balance constraint in Eqn. 7, we get a one-dimensional non-linear equation in 2 :
u 1 1X =r u j =1 1 + e gi -2 T

Table 1: Two-class datasets. d : data dimensionality, n0 : average sparsity, l + u : number of labeled and Ż unlabeled examples, t : number of test examples, r : positive class ratio.
Dataset aut-avn real-sim ccat gcat 33-36 pcmac d 20707 20958 47236 47236 59072 7511 n0 Ż 51.32 51.32 75.93 75.93 26.56 54.58 l+u 35588 36155 17332 17332 41346 1460 t 35587 36154 5787 5787 41346 486 r 0.65 0.31 0.46 0.30 0.49 0.51

The root is computed by using a hybrid combination of Newton-Raphson iterations and the bisection method together with a carefully set initial value.

3.2.3 Stopping Criteria
For a fixed T , the alternate minimization of w and p proceeds until some stopping criterion is satisfied. A natural criterion is the mean Kullback-Liebler divergence (relative entropy) K L(p, q ) b etween current values of pi and the values, say qi , at the end of last iteration. Thus the stopping criterion for fixed T is: A K L(p, q ) =
u X j =1

pj log

pj 1 - pj + (1 - pj ) log <u qj 1 - qj

our conclusions to also hold for other p erformance measures such as F-measure. We use a default value of  = 1 for all datasets except6 for aut-avn and ccat where  = 10. The default value of  = 0.001 was used for all datasets. For datasets and software implementation, we p oint the reader to [12].

good value for is 10-6 . The temp erature may b e decreased in the outer loop until the total entropy falls b elow a threshold, which we take to b e = 10-6 as ab ove, i.e., T (p ) = - H
u X j =l

Minimization of Objective Function
We first examine the effectiveness of TSVM and DA in optimizing the TSVM ob jective function. In Table 2, we rep ort the minimum value of the ob jective function achieved by TSVM and DA with resp ect to varying numb er of lab els. As compared to TSVM, we see that DA p erforms significantly b etter optimization on aut-avn, ccat, and p cmac datasets and slightly b etter optimization on the other datasets. Table 2: TSVM,DA: Minimum value of ob jective function achieved with respect to number of labels. Statistically significantly better minimizations are shown in bold.
aut-avn TSVM DA real-sim TSVM DA ccat TSVM DA gcat TSVM DA 33-36 TSVM DA pcmac TSVM DA l=45 0.692 0.669 l=46 0.168 0.166 l=44 0.583 0.575 l=44 0.135 0.134 l=52 0.198 0.196 l=37 0.136 0.122 89 0.692 0.663 91 0.181 0.176 87 0.602 0.594 87 0.140 0.138 104 0.210 0.208 73 0.137 0.125 178 0.698 0.670 181 0.194 0.190 174 0.637 0.603 174 0.147 0.146 207 0.233 0.230 110 0.138 0.128 356 0.709 0.679 362 0.212 0.210 348 0.663 0.619 348 0.159 0.158 414 0.256 0.255 146 0.139 0.131 712 0.715 0.687 724 0.235 0.234 695 0.682 0.639 695 0.177 0.176 827 0.286 0.285 183 0.141 0.134 1424 0.722 0.696 1447 0.257 0.256 1389 0.704 0.662 1389 0.193 0.192 1654 0.316 0.315 220 0.142 0.136

(pj log pj + (1 - pj ) log (1 - pj )) < u

he TSVM ob jective function,  w 2
2

+

l u h i2 2X 1X  l2 ( y i ( w T x i ) + max 0, 1 - |wT xj | 2l i=1 u j =1

is monitored as the optimization proceeds. The weight vector corresp onding to the minimum transductive cost in the optimization path is returned as the solution.

4.

EMPIRICAL STUDY

Semi-sup ervised learning exp eriments were conducted to test these algorithms on six text binary classification problems. These are listed in Table 1. The aut-avn and real-sim binary classification datasets come from a collection of UseNet articles5 from four discussion groups, for simulated auto racing, simulated aviation, real autos, and real aviation. The ccat and gcat data sets p ose the problem of separating corp orate and government related articles resp ectively; these are the top-level categories in the RCV1 training data set [7]. These data sets create an interesting situation where semi-sup ervised learning is required to learn different low density separators resp ecting different classification tasks in the same input space. The 33-36 data set is a subset of a multiclass Yahoo shopping data set. Finally, the p cmac data set is a small subset of the 20-newsgroups data p opularly used in semi-sup ervised learning literature (e.g in [3]). The results b elow are averaged over 10 random stratified splits of training (lab eled and unlab eled) and test sets. The p erformance of SVM, DA and TSVM is studied as a function of the amount of lab eled data in the training set. Since the two classes are fairly well represented in these datasets, we rep ort error rates, but exp ect
5

Generalization Performance
In Table 3 we rep ort error rates over unseen test examples for SVM (which only uses lab eled examples), TSVM and DA with resp ect to varying amounts of lab eled data. The following observations are made. (1) Comparing the p erformance of SVM against the semisup ervised algorithms, the b enefit of unlab eled data for b oost6 This produced b etter results for b oth TSVM and DA. A careful optimization of  was not attempted.

Available at:

http://www.cs.umass.edu/mccallum/data/

sraa.tar.gz

482


Table 3: SVM, TSVM, DA: test error rate comparison. TSVM (S=max) are results for multiple (maximum possible) switching TSVM. Bold numbers indicate a significant performance difference between TSVM and DA.
aut-avn SVM DA TSVM TSVM (S=max) real-sim SVM DA TSVM TSVM (S=max) ccat SVM DA TSVM TSVM (S=max) gcat SVM DA TSVM TSVM (S=max) 33-36 SVM DA TSVM TSVM (S=max) pcmac SVM DA TSVM TSVM (S=max) l= 45 31.8 7.0 6.2 6.1 l= 46 28.7 15.4 16.3 15.9 l=44 23.9 18.1 20.3 20.2 l=44 26.1 6.0 6.1 6.2 l= 52 26.0 21.5 21.3 21.4 l= 37 18.0 5.2 7.4 7.3 89 24.0 4.6 5.6 5.6 91 24.9 13.5 12.5 12.4 87 18.6 13.7 13.9 14.0 87 18.1 5.9 6.0 6.0 104 20.3 17.4 16.6 16.6 73 12.1 5.0 6.9 6.8 178 15.6 4.5 5.2 5.2 181 18.2 11.6 10.2 10.2 174 14.6 11.9 11.7 11.7 174 11.8 5.8 5.8 5.8 207 15.5 13.3 12.8 12.8 110 9.7 4.6 6.0 6.0 356 10.3 3.9 4.7 4.7 362 12.8 10.0 9.0 8.9 348 11.9 10.3 10.2 10.2 348 8.3 5.7 5.7 5.8 414 12.7 11.4 11.2 11.2 146 8.0 4.7 5.7 5.7 712 7.6 3.7 4.3 4.3 724 9.8 8.4 7.9 7.9 695 10.0 9.2 9.1 9.1 695 6.6 5.4 5.4 5.4 827 10.7 9.8 9.7 9.7 183 7.4 4.4 5.1 5.1 1424 5.8 3.6 3.9 3.9 1447 7.5 7.2 6.9 6.9 1389 8.7 8.5 8.3 8.3 1389 5.6 5.2 5.1 5.1 1654 8.9 8.6 8.5 8.5 220 6.7 4.3 4.7 4.7

Figure 3: Computation time with respect to number of labels for DA and Transductive l2 -SVM-MFN with single and multiple switches.
DA

aut-avn
time

TSVM (S=1) TSVM (S=max)
time

real-sim

10

3

10

3

45

89

178

356

712

1424

2848

46

91

181

362

724

1447

2893

num. of labeled examples ccat
10
time time
3

num. of labeled examples gcat

10

3

22

44

87

174

348

695

1389 2778

22

44

87

174

348

695

1389 2778

num. of labeled examples 33-36
10
time
1 3

num. of labeled examples pcmac
time

10

52

104

207

414

827

1654

3308

37

73

110

146

183

220

256

292

num. of labeled examples

num. of labeled examples

ing generalization p erformance is evident on all datasets. This is true even for moderate numb er of lab els, though it is particularly striking towards the lower end. (2) On aut-avn and p cmac, DA outp erforms TSVM significantly. On ccat, DA p erforms a much b etter optimization (Table 2) but this does not translate into ma jor error rate improvements. DA and TSVM are very closely matched on gcat and 33-36. On real-sim, TSVM and DA p erform very similar optimization of the transduction ob jective function (Table 2), but app ear to return very different solutions. The TSVM solution returns lower error rates as compared to DA on this dataset. (3) On all datasets we found that multiple switching (see rows corresp onding to TSVM (S=max) in Table 3) returned nearly identical p erformance as single switching. Since it saves significant computation time, our study establishes multiple switching as a valuable heuristic for training TSVM. (4) These observations are also true for in-sample transductive p erformance. Both TSVM and DA are found to provide high quality extension to unseen test data.

In Table 4, we compare our TSVM and DA implementations with SVMlight at its default optimization settings on the first split with fewest lab eled examples. These comparisons clearly demonstrate massive sp eedups with our methods. Note that the results presented in this section were obtained with a MATLAB implementation. We exp ect significantly faster computation times with a C or a fortran implementation, esp ecially with parallel computation of matrix vector products7 . Table 4: Speed comparisons (in seconds) with SVMlight . S=1 and S=max denotes our single and multiple maximum implementations.
Dataset aut-avn real-sim ccat gcat 33-36 pcmac SVMlight 101759 498313 13540 243840 48390 167 S=1 5849 6244 2352 1267 7406 4 S=max 390 373 390 358 309 2 DA 1446 1129 1185 159 393 12

Importance of Annealing
To confirm the necessity of an annealing comp onent (tracking the minimizer with resp ect to T ) in the optimization, we compare DA with the alternating w,p optimization procedure where the temp erature parameter is held fixed at T = 0.1 and T = 0.001. In Figure 4 we plot the error rates achieved with and without annealing. We see that annealing tends to provide higher quality solutions as compared to fixed temp erature optimization. It is imp ortant to note that the gradual increase of  to the user-set value in TSVM is also a mechanism to avoid preliminary exp eriments with a C++ implementation (to b e made available at [12]) suggested ab out 5-fold further improvements in sp eed.
7

Computational Timings
In Figure 3, we plot the average computation time for DA and TSVM with single and maximum switching. We make the following observations. The single switch TSVM is an order of magnitude slower than the maximum switching variant. DA is significantly faster than single switch TSVM but slower than TSVM with maximum switching.

483


Figure 4: Error Rates achieved by DA and fixed temperature optimization with respect to number of labels.
aut-avn
test err %
30 20 10 45 89 178 356 712 1424 2848 DA

real-sim
test err %
16 14 12 10 8 46 91 181 362 724 1447 2893

T=0.1 T=0.001

num. of labeled examples ccat
test err %
40 30 20 10 22 44 87 174 348 695 1389 2778

num. of labeled examples gcat
test err %
9 8 7 6 5 22

44

87

174

348

695

1389

2778

num. of labeled examples 33-36
test err %
25 20 15 10 52 104 207 414 827 1654 3308 25 20 15 10 5 37

num. of labeled examples pcmac
test err %

73

110

146

183

220

256

292

num. of labeled examples

num. of labeled examples

local optima. The non-convex part of the TSVM ob jective function is gradually increased to a desired value. In this sense,  simultaneously plays the role of an annealing parameter and also provides control over the strength of the cluster assumption. This dual role has the advantage that a suitable  can b e chosen by monitoring p erformance on a validation set as the algorithm proceeds. In DA, however, we directly apply a framework for global optimization, and decouple annealing from the implementation of the cluster assumption. As our exp eriments show, this can lead to significantly b etter solutions on many problems.

6. REFERENCES
[1] K. Bennett and A. Demirez, Semi-Sup ervised Supp ort Vector Machines, NIPS 1998. [2] G. Bilbro, R. Mann, T.K. Miller, W.E. Snyder and D.E. Van den, Optimization by Mean Field Annealing, NIPS 1989. [3] O. Chap elle and A. Zien, Semi-Sup ervised Classification by Low Density Separation, AI & Statistics, Barbados, January 2005. [4] R. Collob ert, F. Sinz, J. Weston, and L. Bottou, Large Scale Transductive SVMs, (submitted) 2006. [5] T. Joachims, Transductive Inference for Text Classification using Supp ort Vector Machines, ICML 1998. [6] G. Fung and O. Mangasarian, Semi-Sup ervised Supp ort Vector Machines for Unlab eled Data Classification, Optimization Methods and Software 15, 2001, 29-44. [7] D. Lewis, Y. Yang, T. Rose and F. Li, RCV1: A New Benchmark Collection for Text Categorization Research, Journal of Machine Learning Research 5:361-397, 2004. [8] S. S. Keerthi and D. DeCoste, A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs, Journal of Machine Learning Research 6:341­361, 2005. [9] C. Peterson and B. Soderb erg, A new method for mapping optimization problems onto neural networks, International Journal of Neural Systems, 1(1):3­22, 1989. [10] V. Sindhwani, S. S. Keerthi, and O. Chap elle, Deterministic Annealing for Semi-sup ervised Kernel Machines, ICML 2006. [11] V. Sindhwani and S.S. Keerthi, Large Scale Semi-sup ervised Linear SVMs, Technical rep ort, Yahoo research, 2006. [12] http://www.cs.uchicago.edu/~vikass/research.html
[13] V. Vapnik, Statistical Learning Theory, John Wiley and Sons, New York, 1998.

5.

CONCLUSION

In this pap er we have prop osed a family of primal SVM algorithms for large scale semi-sup ervised learning based on the finite Newton technique. Our methods significantly enhance the training sp eed of TSVM over existing methods such as SVMlight and also include a new effective technique based on deterministic annealing. The new TSVM method with multiple switching is the fastest of all the algorithms considered, and also returns good generalization p erformance. The DA method is relatively slower but often gives the b est accuracy. These algorithms can b e very valuable in applied scenarios where sparse classification problems arise frequently, lab eled data is scarce and plenty of unlab eled data is easily available. Even in situations where a good numb er of lab eled examples are available, utilizing unlab eled data to obtain a semi-sup ervised solution using these algorithms can b e worthwhile. For one thing, the semisup ervised solutions never lag b ehind purely sup ervised solutions in terms of p erformance. The presence of a mix of lab eled and unlab eled data can provide added b enefits such as reducing p erformance variability and stabilizing the linear classifier weights. Our algorithms can b e extended to the non-linear setting [10], and may also b e develop ed to handle clustering and one-class classification problems. These are sub jects for future work.

484