Graph Transduction via Alternating Minimization Jun Wang Department of Electrical Engineering, Columbia University Tony Jebara Department of Computer Science, Columbia University Shih-Fu Chang Department of Electrical Engineering, Columbia University jwang@ee.columbia.edu jebara@cs.columbia.edu sfchang@ee.columbia.edu Abstract Graph transduction methods label input data by learning a classification function that is regularized to exhibit smoothness along a graph over labeled and unlabeled samples. In practice, these algorithms are sensitive to the initial set of labels provided by the user. For instance, classification accuracy drops if the training set contains weak labels, if imbalances exist across label classes or if the labeled portion of the data is not chosen at random. This paper introduces a propagation algorithm that more reliably minimizes a cost function over both a function on the graph and a binary label matrix. The cost function generalizes prior work in graph transduction and also introduces node normalization terms for resilience to label imbalances. We demonstrate that global minimization of the function is intractable but instead provide an alternating minimization scheme that incrementally adjusts the function and the labels towards a reliable local minimum. Unlike prior methods, the resulting propagation of labels does not prematurely commit to an erroneous labeling and obtains more consistent labels. Experiments are shown for synthetic and real classification tasks including digit and text recognition. A substantial improvement in accuracy compared to state of the art semi-supervised methods is achieved. The advantage are even more dramatic when labeled instances are limited. 1. Intro duction Graph transduction refers to a family of algorithms that achieve state of the art performance in semisupervised learning and classification. These methods incur a tradeoff between a classification function's accuracy on labeled examples and a regularizer term that encourages the function to remain smooth over a weighted graph connecting the data samples. The weighted graph and the minimized function ultimately propagate label information from labeled data to unlabeled data to provide the desired transductive predictions. Popular algorithms for graph transduction include the Gaussian fields and harmonic functions based method (GFHF) (Zhu et al., 2003) as well as the local and global consistency method (LGC) (Zhou et al., 2004). Other closely related methods include the manifold regularization framework proposed in (Sindhwani et al., 2005; Belkin et al., 2006) where graph Laplacian regularization terms are combined with regularized least squares (RLS) or support vector machine (SVM) function estimation criteria. These methods lead to graph-regularized variants denoted as Laplacian RLS (LapRLS) and Laplacian SVM (LapSVM) respectively. For certain synthetic and real data problems, graph transduction approaches do achieve promising performance. However, this article identifies several realistic settings and labeling situations where this performance can be compromised. An alternative algorithm which generalizes the previous techniques is proposed by defining a joint iterative optimization over the classification function and a balanced label matrix. Even if one assumes the graph structures used in the above methods faithfully describe the data manifold, graph transduction algorithms may still be misled by problems in the label information. Figure 1 depicts several cases where the label information leads to invalid graph transduction solutions for all the aforemen- Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Graph Transduction via Alternating Minimization tioned algorithms. The top row of Figure 1 shows a separable pair of manifolds where unbalanced label information affects the propagation results. Although a clear separation region is visible between the two manifolds, the imbalance in the labels misleads the previous algorithms which prefer assigning points to the class with the ma jority of labels. In the bottom row of Figure 1, a non-separable problem is shown where two otherwise separable manifolds are peppered with noisy outlier samples. Here, the outliers do not belong to either class but once again interfere with the propagation of label information. In both situations, conventional transductive learning approaches such as GFHF, LGC, LapRLS, and LapSVM fail to give acceptable labeling results. In order to handle such situations, we extend the graph transduction optimization problem by casting it as a joint optimization over the classification function and the labels. The optimization is solved iteratively and remedies the instability previous methods seem to have vis-a-vis the initial labeling. In our novel framework, initial labels simply act as the starting value of the label matrix variable which is incrementally refined until convergence. The overall minimization over the continuous classification function and the binary label matrix proceeds by an alternating minimization over each term separately and converges to a local minimum. Moreover, to handle the imbalanced labels issue, a node regularizer term is introduced to balance the label matrix among different classes. These two fundamental changes to the graph transduction problem produce significantly better performance on both artificial and real datasets. The remainder of this paper is organized as the follows. In Section 2, we revisit the graph regularization framework of (Zhou et al., 2004) and extend it into a bivariate graph optimization problem. A corresponding algorithm is provided that solves the new optimization problem by iterative alternating minimization. Section 3 provides experimental validation for the algorithm on both toy and real classification datasets, including text classification and digital recognition. Comparisons with leading semi-supervised methods are made. Concluding remarks and a discussion are then provided in Section 4. to infer the labels {yl+1 , · · · , yn } of the unlabeled data {xl+1 , · · · , xn }, where typically l << n. The graph transduction methods define an undirected graph represented by G = {X , E }, where the set of node or vertices is X = {xi } and the set of edges is E = {eij }. Each sample xi is treated as the node on the graph and the weight of edge eij is wij . Typically, one uses a kernel function k (·) over pairs of points to recover weights, in other words wij = k (xi , xj ) with the RBF kernel being a popular choice. The weights for edges are used to build a weight matrix which is denoted by W = {wij }. Similarly, the node degree matrix jn wij . The D = diag ([d1 , · · · , dn ]) is defined as di = =1 binary label matrix Y is described as Y B n×c with Yij = 1 if xi has label yi = j and Yij = 0 otherwise. This article will often refer to row and column vectors of such matrices, for instance, the ith row and j th column vectors of Y are denoted as Yi· and Y·j , respectively. The graph Laplacian is defined as = D - W and the normalized graph Laplacian is L = D-1/2 D-1/2 = I - D-1/2 WD-1/2 . 2.1. Consistent Lab el Propagation Graph based semi-supervised learning methods propagate label information from labeled nodes to unlabeled nodes by treating all samples as nodes in a graph and using edge-based affinity functions between all pairs of nodes to estimate the weight of each edge. Most methods then define a continuous classification function F Rn×c that is estimated on the graph to minimize a cost function. The cost function typically enforces a tradeoff between the smoothness of the function on the graph of both labeled and unlabeled data and the accuracy of the function at fitting the label information for the labeled nodes. Such is the case for a large variety of graph based semi-supervised learning techniques ranging from the the mincuts method (Blum & Chawla, 2001), the Gaussian fields and harmonic functions (GFHF) method, and the local and global consistency (LGC) method. A detailed survey of these methods is available in (Zhu, 2005). In trading off smoothness for accuracy, both GFHF and LGC approaches attempt to preserve consistency on the data manifold during the optimization of the classification function. The loss function for both methods involves the additive contribution of two penalty terms the global smoothness Qsmooth and local fitness Qf it as shown below: F = arg min Q(F) = arg min {Qsmooth (F) + Qf it (F)} F F 2. Graph Transduction Consider the dataset X = (Xl , Xu ) of labeled inputs Xl = {x1 , · · · , xl } and unlabeled inputs Xu = {xl+1 , · · · , xn } along with a small portion of corresponding labels {y1 , · · · , yl }, where yi L = {1, · · · , c}. For transductive learning, the ob jective is (1) Graph Transduction via Alternating Minimization (a) (b) (c) (d) (e) Figure 1. A demonstration with artificial data of the sensitivity graph transduction exhibits for certain initial label settings. The top row shows how imbalanced labels adversely affect even a well-separated 2D two-moon dataset. The bottom row shows a 3D two-moon data where graph transduction is again easily misled by the introduction of a cloud of outliers. Large markers indicate known labels and the two-color small markers represent the predicted classification results. Columns depict the results from (a) the GFHF method (Zhu et al., 2003); (b) the LGC method (Zhou et al., 2004); (c) the LapRLS method (Belkin et al., 2006); (d) the LapSVM method (Belkin et al., 2006); and (e) Our method (GTAM). In particular, recall that LGC uses an elastic regularizer framework with the following cost function (Zhou et al., 2004). 1 2 ever, the proposed method, graph transduction via alternating minimization (GTAM) appears resilient. To address these problems, we will make modifications to the cost function in Eq. 1. The first one is to explicitly show the optimization over both the classification function F and the binary label matrix Y: (F , Y ) = arg minFRn×c ,YBn×c Q(F, Y). (3) Q(F) = i , j =1 n wij F i· Dii Fj · - Dj j 2 +µ in Fi· - Yi· 2 =1 ( 2) where the coefficient µ balances global smoothness and local fitting penalty terms. If we set µ = and use a standard graph Laplacian for the smoothness term, the above framework reduces to the harmonic function formulation as shown in (Zhu et al., 2003). While LGC and GFHF formulations remain popular and have been empirically validated in the past, it is possible to discern some key limitations. First, the optimization can be broken up into a separate parallel problems since the cost function decomposes into terms that only depend on individual columns of the matrix F. Because each column of F indexes the labeling of a single class, such a decomposition reveals that biases may arise if the input labels are disproportionately imbalanced. In practice, both propagation algorithms tend to prefer predicting the class with the ma jority of labels. Second, both learning algorithms are extremely dependent on the initial labels provided in Y. This is seen in practice but can also be explained mathematically by fact that Y is starts off extremely sparse and has many unknown terms. Third, when the graph contains background noise and makes class manifolds nonseparable, these graph transduction approaches fail to output reasonable classification results. These difficulties were illustrated in Figure 1 and seem to plague many graph transduction approaches. How- Where B n×c is the set of all binary matrices Y of size j Yij = 1 and, for the labeled data n × c that satisfy xi Xl , Yij = 1 if yi = j . More specifically, our loss function is: F ( 1 tr T LF + µ(F - VY)T (F - VY) 2 4) where we have introduced the matrix V which is a node regularizer to balance the influence of labels from different classes. The matrix V = diag (v) is a function of the current label matrix Y: Q(F, Y) = v= j c Y·j D1 T Y·j D1 =1 (5) where the symbol denotes the Hadamard product and column vector 1 represents 1 = [1 · · · 1]T . This node regularizer permits us to work with a normalized version of the label matrix Z defined as: Z = VY . By definition, we see that the normalized label matrix i satisfies Zij = 1. Using the normalized label matrix Z in a graph regularization allows labeled nodes with high degree to contribute more during the graph diffusion and label propagation process. However, the Graph Transduction via Alternating Minimization total diffusion of each class is kept equal and normalized to be one. Therefore, the influence of different classes is balanced even if the given class labels are imbalanced. If class proportion information is known a priori, it can be integrated by scaling the diffusion with the prior class proportion. However, because of the nature of graph transduction and unknown class prior knowledge, equal class balancing leads to generally more reliable solutions than label proportional weighting. This intuition is in line with prior work that uses class proportion information in transductive inference such as (Chapelle et al., 2007) where class proportion is enforced as a hard constraint on the labels or in (Mann & McCallum, 2007) where such information is used as a regularizer. We next discuss the alternating minimization procedure which is the key modification to the overall framework. 2.2. Alternating Minimization Pro cedure In our proposed graph regularization framework, the cost function involves two variables to be optimized. While simultaneously recovering both solutions is intractable due to the mixed integer programming problem over binary Y and continuous F, we will propose a greedy alternating minimization approach. The first update of the continuous classification function F is straightforward since the resulting cost function is convex and unconstrained allowing us to recover the optimal F by setting the partial derivative Q to be F zero. However, since Y B is a binary matrix and j sub ject to linear constraints of the form Yij = 1, the other step in our alternating minimization requires solving a linearly constrained max cut problem which is NP (Karp, 1972). Due to the alternating minimization outer loop, investigating guaranteed approximation schemes (Goemans & Williamson, 1995) to solve a constrained max cut problem for Y is unjustified due to the solution's dependence on the dynamically varying classification function F during the alternating minimization procedure. Instead, we use a greedy gradient based approach to incrementally update Y, while keeping the classification function F at the corresponding optimal setting. Moreover, because the node regularizer term V normalizes the labeled data, we also interleave updates of V based on the revised Y. Minimization for F : The classification function F Rn×c is continuous and its loss terms are convex allowing the minimum to be recovered by zeroing the partial derivative: Q = 0 = F = LF + µ(F - VY ) = 0 F = (L/µ + I)-1 VY = PVY (6) where we denote P = (L/µ + I)-1 as the propagation matrix and assume the graph is symmetrically built (i.e. L = LT ). Greedy minimization of Y: To update Y, first replace F in Eq. 4 by its optimal vlue F from the solution of Eq. 6. 1 (7) Q(Y)= tr(YT VT PT LPVY 2 +µ(PVY - VY )T (PVY - VY )) T YT T PT V 1 V LP + µ(PT - I)(P - I) Y = tr 2 he optimization still involves the node regularizer V in Eq. 5, which depends on Y and normalizes the label matrix over columns. Due to the dependence on the current estimate of F and V, only an incremental step will be taken greedily to reduce Q(Y). In each iteration, we find position (i , j ) in the matrix Y and change the binary value of Yi j from 0 to 1. The direction with largest negative gradient guides our choice of binary step on Y. Therefore, we need to evaluate QY and find the largest negative value to determine (i , j ). Note that setting Yi ,j = 1 is equivalent to a similar operation on the normalized label matrix Z by setting Zi ,j = , 0 < < 1, and Y, Z have one to one correspondence. Thus, the greedy optimization of Q with respect to Y is equivalent to greedy minimization of Z Q Q with respect to Z. More formally: Y = Q Y and Z with straightforward algebra we see that: (i , j ) = arg min i,j Q Q = arg min i,j Z Y (8) Then we can rewrite the loss function using the variable Z as: Q(Z) = = Z PT Z 1 tr T LP + (PT - I)(P - I) 2 Z ( 1 tr T AZ 9) 2 where A represents A = PT LP + (PT - I)(P - I). Notice that A is symmetric if the graph is symmetrically built. We derive the gradient of the above loss function and recover it with respect to Y as: Q = AZ = AVY Z (10) As described earlier, we search the gradient matrix Z Q to find the minimal element for updating (i , j ) = arg minxi Xu ,1j c Zij Q (11) Graph Transduction via Alternating Minimization Then update the label matrix by setting Yi j = 1. Because of the binary nature of Y, we simply set Yi j = 1 instead of using a continuous gradient approach. In the t + 1th iteration, the node regularizer vt+1 can be recalculated with the updated Yt+1 . The update Y is indeed greedy. Therefore, it could oscillate and backtrack from predicted labelings in previous iterations without convergence guarantees. We propose a straightforward way to guarantee convergence and avoid backtracking, inconsistency or unstable oscillation in the greedy propagation of labels. Once an unlabeled point has been labeled, its labeling can no longer be changed. Thus, we remove the most recently labeled point (i , j ) from future consideration and only permit the algorithm to search for the minimal gradient entries corresponding to the remaining unlabeled examples. Thus, to avoid changing the labeling of previous predictions, the new labeled node xi will be removed from Xu and added to Xl . In the following, we summarize the updating rules from step t to t + 1 in the alternative minimization scheme. Although the optimal F is being computed in each iteration as shown in Eq. 6, we do not explicitly need to update it. Instead, it is implicitly used in Eq. 8 to directly update Y. Z Qt (i , j ) t+ Yi j1 tion. Furthermore, similar to the graph superposition approach introduced in (Wang et al., 2008), the calculation of the node regularizer v and gradient matrix Z Q can be more efficient by incremental updating as a result of the newly gained labels. Due to greedy assignment, the algorithm can only loop the alternative minimization (or the gradient computation equivalently) at most n - l times. The update of the graph gradient, finding the largest element in the gradient and the matrix algebra involved can be done efficiently by modifying only a single entry in Y per loop. Each minimization step over F and Y thus requires O(n2 ) time and the total runtime of the greedy GTAM algorithm is O(n3 ). Empirically, the value of the loss function Q decreases rapidly in the the first dozen iterations and achieves steady convergence afterward. This phenomenon indicates that the label propagation loop could be early stopped by solving for the labels from the optimized F (Eq. 6) after only a few iterations. The above algorithm chart summarizes the proposed GTAM method. 3. Exp eriments In this section, we demonstrate the superiority of the proposed GTAM method in comparison to state of the art semi-supervised learning methods over both synthetic and real data. For instance, on the WebKB data, previous work shows that LapSVM and LapRLS are better than other semi-supervised approaches, such as Transductive SVMs TSVM (Joachims, 1999) and TSVM. Therefore, we only compare our method with LapRLS, LapSVM and two related methods, LapRLSjoint and LapSVMjoint (Sindhwani et al., 2005). In all experiments, we used the same parameter settings reported in the literature. The GTAM approach only requires a single µ parameter which controls the tradeoff between the global smoothness and local fitting terms in the cost function. Although our experiments show that GTAM is fairly robust to the setting of µ, we set µ = 99 throughout all experiments. For all real implementations of graph-based methods, one needs a construction method that builds a graph from the training data X , which involves a procedure for computing the weight of links via a kernel or similarity function. Typically, practitioners use RBF kernels for image recognition and cosine distances for text classification (Zhou et al., 2004; Ng et al., 2001; Chapelle et al., 2003; Hein & Maier, 2006). However, finding adequate parameters for the kernel or similarity function, such as the RBF kernel size , is not always straightforward particularly if labeled data is scarce. Empirical evidence has shown that the prop- = = = = Adiag (vt )Yt arg minxi Xu ,1j c Zij Qt 1 t+ j c Y·j 1 D1 =1 t+ T Y·j 1 D1 (12) vt+1 Xlt+1 t t - Xlt + xi ; Xu+1 - Xu - xi The procedure above repeats until all points have been labeled. 2.3. Algorithm Summary and Convergence From the above discussion, our method is unique in that it optimizes the loss function over both continuous-valued F space and binary-valued Y space. Starting from a few given labels, the method iteratively and greedily updates the label matrix Y, node regularizer v, and gradient matrix Z Q. In each individual iteration, new labeled samples are obtained to drive a better graph propagation in the next iteration. In our approach, we directly acquire new labels instead of calculating F and then conducting a mapping to Y, which is the regular procedure in other graph transduction methods like LGC and GFHF. This unique feature makes the proposed algorithm very efficient since we only update the gradient matrix Z Q in each itera- Graph Transduction via Alternating Minimization Algorithm 1 Graph Transduction via Alternating Minimization (GTAM) Input: data set X = {x1 , · · · , xl , xl+1 , · · · , xn }, labeled subset Xl = {x1 , · · · , xl }, unlabeled subset Xu = {xl+1 , · · · , xn }, labels {y1 , · · · , yj , · · · , yl }, where yj L = {1, · · · , l}. Affinity matrix W = {wij }, node degree matrix D, initial label matrix Y0 ; Initialization: iteration counter t = 0; normalized graph Laplacian L = D-1/2 D-1/2 ; propagation matrix P = (L/µ + I)-1 ; matrix A = PT LP + (PT - I)(P - I); c Y0 D1 node regularizer v0 = j =1 Y·0j T D1 . ·j (a) (b) Figure 2. Performance comparison of LGC, GFHF, LapRLS, LapSVM, and GTAM on noisy 3D two moon data. Only one label is given for one class, while the other class has a varying number of labels, shown as imbalance ratio on the horizontal axis: (a) The mean of the test error; (b) The standard deviation of the test error. rep eat Compute graph gradient: Zt = diag (vt )Yt , Zt Qt = AZt ; Find the optimal element in Zt Qt : (i , j ) = arg minxi Xu ,1j c Zij Qt ; Update label matrix to obtain Yt+1 by setting: t+ Yi j1 = 1; also yi = j ; Update node regularizer by: c Yt+1 D1 vt+1 = j =1 ·tj+1 T 1 ; Remove x - xi ; to Add xi Update iteration counter: t = t + 1; t until Xu = Output: The labels of unlabeled samples {yl+1 , · · · , yn }. i Y·j D t+1 t from Xu : Xu - Xu t+1 t ; - Xl + xi Xl : Xl for the two-moon data by exploring the effect of class imbalance. We start by fixing one class to have one observed label and select r labels from the other class. Here, r is also the imbalance ratio and the range we explore is 1 r 20. These experiments use the 3D noisy two-moon data which contain 300 positive and 300 negative sample points as well as 200 additional background noise samples. Multiple round tests (100 trails) are evaluated for each imbalance condition by calculating the average prediction accuracy on the relevant 600 samples. For a fair comparison, we use the same graph Laplacian, which is constructed using k NN (k = 6) neighbors with RBF weights. Moreover, the parameter for LGC is set as = 0.99. The parameters for LapRLS and LapSVM are A = 1, I = 1. Figure 2 demonstrates the performance advantage of the proposed GTAM approach versus the LGC, GFHF, LapRLS, and LapSVM methods. From the figure, we can conclude that all the four strawman approaches are extremely sensitive to the initial labels and label class imbalance since none of them can produce perfect accuracy and the error rates of LGC and GFHF are dramatically increased when the label class becomes more imbalanced even though more information is being provided to the algorithm. However, GTAM is clearly superior, achieving the best accuracy regardless of the imbalance ratio and despite contamination with noisy samples. In fact only 1 or 2 of the 100 trails for each individual setting of r were imperfect using the GTAM method. 3.2. WebKB Dataset For validation on real data, we first evaluated our method using the WebKB dataset, which has been widely used in semi-supervised learning experiments (Joachims, 2003; Sindhwani et al., 2005). The WebKB dataset contains two document categories, course and agation results highly depend on the kernel parameter selection. Motivated by the approach reported in (Hein & Maier, 2006), we use an adaptive kernel size based on the mean distance of k -nearest neighborhoods (k = 6) for the experiments on real USPS handwritten digit data. On the WebKB data, we use the same graph construction suggested by (Sindhwani et al., 2005). For each dataset, the same graph is used for all the compared transductive learning approaches. 3.1. Two Mo on Synthetic Data Figure 1 illustrated synthetic experiments on 2D and 3D two-moon data. Despite the near-perfect classification results reported on such datasets in the literature (Zhou et al., 2004), we showed how small perturbations to the problem can have adverse effects on prior algorithms. The prior methods are overly sensitive to locations of the initial labels, ratios of the two-class labels, and the level of ambient noise or outliers. A more thorough experimental study is also possible Graph Transduction via Alternating Minimization non-course. Each document has two types of information, the webpage text content called page representation and link or pointer representation. For fair comparison, we applied the same feature extraction procedure as discussed in (Sindhwani et al., 2005), obtained 1051 samples with 1840-dimensional page attributes and 3000 link attributes. The graph was built based on cosine-distance neighbors with Gaussian weights (number of nearest neighbors is 200 as in (Sindhwani et al., 2005)). We compared our method with four of the best known approaches, LapRLS, LapSVM, and the two problem specific methods, LapRLSjoint , LapSVMjoint reported in (Sindhwani et al., 2005). All the compared approaches used the same graph construction procedure and all parameter settings were set according to (Sindhwani et al., 2005), in other words A = 10-6 , I = 0.01. We varied the number of labeled data to measure the performance gain with increasing supervision. For each fixed number of labeled samples, 100 random trails were tested. The means of the test errors are shown in Figure 3. cation experiments. To evaluate the algorithms, we reveal a subset of the labels (randomly chosen and guaranteeing at least one labeled example is available for each digit). We compared our method with LGC and GFHF, LapRLS, and LapSVM. The error rates are calculated based on the average over 20 trials. Figure 4. Performance comparison on handwritten digit classification (USPS database). The horizontal axis shows the total number of randomly observed labels (guaranteeing there is at least one label for each class). The vertical axis shows the average error rate over 20 random trials. Figure 3. Performance comparison on text classification (WebKB dataset). The horizontal axis represents the number of randomly observed labels (guaranteeing there is at least one label for each class). The vertical axis shows the average error rate over 100 random trials. As the Figure reveals, the proposed GTAM method achieved significantly better accuracy than all the other methods, except for the extreme case when only four labeled samples were available. The performance gain grows rapidly when the number of labeled samples increases, although in some cases the error rate does not drop monotonically. 3.3. USPS digit data We also evaluated the proposed method in an image recognition task. Specifically, we used the data in (Zhou et al., 2004) for handwritten digit classifi- From Figure 4, we can conclude that GTAM significantly improved the classification accuracy, compared to the other approaches, especially when very few labeled samples are available. The mean accuracies of GTAM are consistently low for different numbers of labels and the standard deviation values are also very small (10-4 level). This demonstrates that the GTAM method is insensitive to the numbers and specified locations of the initially given labels. Only 1% of the test digit images were mislabeled. These failure cases are presented in Figure 5 and are often ambiguous or extremely poorly drawn digits. Compared to the performance on WebKB dataset shown in Figure 3, the USPS digit database experiments exhibit even more promising results. One possible reason is that the USPS digit dataset has relatively more samples (3874) and a lower feature dimensionality (256), compared to the WebKB dataset (which has 1840 samples in 4800 dimensions). Therefore the graph construction procedure is more reliable and the estimation of graph gradients in our algorithm is more robust. Figure 5. USPS handwritten digit samples which are incorrectly classified. Graph Transduction via Alternating Minimization 4. Conclusion and Discussion Existing graph-based transductive learning methods hinge on good labeling information and can easily be misled if the labels are not distributed evenly across classes, if the choice of initial label locations is varied or if excessive noise or outliers corrupt the underlying manifold structure. These degenerate settings seem to plague real world problems as well, compromising the performance of state-of-the-art graph transduction methods. Our experiments over synthetic data sets (two moon data sets) and real data sets (USPS digits and WebKB) confirm the shortcomings of existing tools. This article addresses these shortcomings and proposes a novel graph based semi-supervised learning method, graph transduction via alternating minimization (GTAM). Therein, both the classification function and the label matrix are treated as variables in a cost function that is iteratively minimized. While the optimal classification function can be estimated exactly, greedy optimization is applied to update the label matrix. The algorithm iterates an alternating minimization between both variables and is guaranteed to converge via a greedy scheme. In each individual iteration, through the graph gradient, the unlabeled node with the largest cost reduction is labeled. We gradually update the label matrix by adding more labeled samples while keeping the classification function at its optimal setting. Furthermore, we enforce normalization of the label matrix to avoid degeneracies. This results in an algorithm that can cope with all the aforementioned degeneracies and in practice achieves significant gains in accuracy while remaining efficient and cubic in the number of samples. Future work will include out of sample extensions of this method such that new data points can be added to the training and test set without requiring a full retraining procedure. Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts. Proc. 18th ICML (pp. 19­26). Chapelle, O., Sindhwani, V., & Keerthi, S. (2007). Branch and Bound for Semi-Supervised Support Vector Machines. Proc. of NIPS. Chapelle, O., Weston, J., & Scholkopf, B. (2003). Cluster kernels for semi-supervised learning. Proc. NIPS, 15, 1. Goemans, M., & Williamson, D. (1995). Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42, 1115­1145. Hein, M., & Maier, M. (2006). Manifold denoising. Proc. NIPS, 19. Joachims, T. (1999). Transductive inference for text classification using support vector machines. Proc. of the ICML, 200­209. Joachims, T. (2003). Transductive learning via spectral graph partitioning. Proc. of ICML, 290­297. Karp, R. (1972). Reducibility among combinatorial problems. Complexity of Computer Computations, 43, 85­103. Mann, G., & McCallum, A. (2007). Simple, robust, scalable semi-supervised learning via expectation regularization. Proc. of the ICML, 593­600. Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Proc. NIPS, 14, 849­856. Sindhwani, V., Niyogi, P., & Belkin, M. (2005). Beyond the point cloud: from transductive to semisupervised learning. Proc. of ICML. Wang, J., Chang, S.-F., Zhou, X., & Wong, T. C. S. (2008). Active microscopic cellular image annotation by superposable graph transduction with imbalanced labels. IEEE CVPR. Alaska, USA. Zhou, D., Bousquet, O., Lal, T., Weston, J., & Scholkopf, B. (2004). Learning with local and global consistency. Proc. NIPS (pp. 321­328). Zhu, X. (2005). Semi-supervised learning literature survey (Technical Report 1530). Computer Sciences, University of Wisconsin-Madison. Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semisupervised learning using gaussian fields and harmonic functions. Proc. 20th ICML. 5. Acknowledgments The authors would like to thank Mr. Yongtao Su and Shouzhen Liu for their valuable comments. We also thank Mr. We Liu and Deli Zhao for the collection of the artificial data. TJ was supported by ONR Award N000140710507 (ModNo: 07PR04918-00) and NSF Award IIS-0347499. References Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. JMLR, 7, 2399­2434.