Regularized Co-Clustering with Dual Supervision Vikas Sindhwani Jianying Hu Aleksandra Mojsilovic IBM Research, Yorktown Heights, NY 10598 {vsindhw, jyhu, aleksand}@us.ibm.com Abstract By attempting to simultaneously partition both the rows (examples) and columns (features) of a data matrix, Co-clustering algorithms often demonstrate surprisingly impressive performance improvements over traditional one-sided row clustering techniques. A good clustering of features may be seen as a combinatorial transformation of the data matrix, effectively enforcing a form of regularization that may lead to a better clustering of examples (and vice-versa). In many applications, partial supervision in the form of a few row labels as well as column labels may be available to potentially assist co-clustering. In this paper, we develop two novel semi-supervised multi-class classification algorithms motivated respectively by spectral bipartite graph partitioning and matrix approximation formulations for co-clustering. These algorithms (i) support dual supervision in the form of labels for both examples and/or features, (ii) provide principled predictive capability on out-of-sample test data, and (iii) arise naturally from the classical Representer theorem applied to regularization problems posed on a collection of Reproducing Kernel Hilbert Spaces. Empirical results demonstrate the effectiveness and utility of our algorithms. 1 Introduction Consider the setting where we are given large amounts of unlabeled data together with dual supervision in the form of a few labeled examples as well as a few labeled features, and the goal is to estimate an unknown classification function. This setting arises naturally in numerous applications. Imagine, for example, the problem of inferring sentiment ("positive" versus "negative") associated with presidential candidates from online political blog posts represented as word vectors, given the following: (a) a vast collection of blog posts easily downloadable from the web (unlabeled examples), (b) a few blog posts whose sentiment for a candidate is manually identified (labeled examples), and (c) prior knowledge of words that reflect positive (e.g., 'superb') and negative (e.g, 'awful') sentiment (labeled features). Most existing semi-supervised algorithms do not explicitly incorporate feature supervision. They typically implement the cluster assumption [3] by learning decision boundaries such that unlabeled points belonging to the same cluster are given the same label, and empirical loss over labeled examples is concurrently minimized. In situations where the classes are predominantly supported on unknown subsets of similar features, it is clear that feature supervision can potentially illuminate the true cluster structure inherent in the unlabeled examples over which the cluster assumption ought to be enforced. Even when feature supervision is not available, there is ample empirical evidence in numerous recent papers in the co-clustering literature (see e.g., [5, 1] and references therein), suggesting that the clustering of columns (features) of a data matrix can lead to massive improvements in the quality of row (examples) clustering. An intuitive explanation is that column clustering enforces a form of dimensional reduction or implicit regularization that is responsible for performance enhancements observed in many applications such as text clustering, microarray data analysis and video content mining [1]. In this paper, we utilize data-dependent co-clustering regularizers for semi-supervised learning in the presence of partial dual supervision. 1 Our starting point is the spectral bipartite graph partitioning approach of [5] which we briefly review in Section 2.1. This approach effectively applies spectral clustering on a graph representation of the data matrix and is also intimately related to Singular Value Decomposition. In Section 2.2 we review an equivalence between this approach and a matrix approximation objective function that is minimized under orthogonality constraints [6]. By dropping the orthogonality constraints but imposing non-negativity constraints, one is led to a large family of co-clustering algorithms that arise from the non-negative matrix factorization literature. Based on the algorithmic intuitions embodied in the algorithms above, we develop two semisupervised classification algorithms that extend the spectral bipartite graph partitioning approach and the matrix approximation approach respectively. We start with Reproducing Kernel Hilbert Spaces (RKHSs) defined over both row and column spaces. These RKHSs are then coupled through co-clustering regularizers. In the first algorithm, we directly adopt graph Laplacian regularizers constructed from the bipartite graph of [5] and include it as a row and column smoothing term in the standard regularization objective function. The solution is obtained by solving a convex optimization problem. This approach may be viewed as a modification of the Manifold Regularization framework [2] where we now jointly learn row and column classification functions. In the second algorithm proposed in this paper, we instead add a (non-convex) matrix approximation term to the objective function, which is then minimized using a block-coordinate descent procedure. Unlike, their unsupervised counterparts, our methods support dual supervison and naturally possess out-of-sample extension. In Section 4, we provide experimental results where we compare against various baseline approaches, and highlight the performance benefits of feature supervision. 2 Co-Clustering Algorithms Let X denote the data matrix with n data points and d features. The methods that we discuss in this section output a row partition function r : {i}n 1 {j }mr1 and a column partition function i= j= c : {i}d=1 {j }mc1 that give cluster assignments to row and column indices respectively. Here, i j= mr is the desired number of row clusters and mc is the desired number of column clusters. Below, by xi we mean the ith example (row) and by fj we mean the j th column (feature) in the data matrix. 2.1 Bipartite Graph Partitioning In the co-clustering technique introduced by [5], the data matrix is modeled as a bipartite graph with examples (rows) as one set of nodes and features (columns) as another. An edge (i, j ) exists if feature fj assumes a non-zero value in example xi , in which case the edge is given a weight of Xij . This bi-partite graph is undirected and there are no inter-example or inter-feature edges. The adjacency matrix, W, and the normalized Laplacian [4], M, of this graph are given by, 0 , 1 1 X W= M = I - D- 2 WD- 2 (1) XT 0 i where D is the diagonal degree matrix defined by Dii = Wij and I is the (n + d) × (n + d) identity matrix. Guided by the premise that column clustering induces row clustering while row clustering induces column clustering, [5] propose to find an optimal partitioning of the nodes of the bipartite graph. This method is retricted to obtaining co-clusterings where mr = mc = m. The mpartitioning is obtained by minimizing the relaxation of the normalized cut objective function using standard spectral clustering techniques. This reduces to first constructing a spectral representation of rows and columns given by the smallest eigenvectors of M, and then performing standard k -means clustering on this representation, to finally obtain the partition functions r , c . Due to the special structure of Eqn. 1, it can be shown that the spectral representation used in this algorithm is related to the singular vectors of a normalized version of X. 2.2 Matrix Approximation Formulation In [6] it is shown that the bipartite spectral graph partitioning is closely related to solving the following matrix approximation problem, (Fr , Fc ) = argminFr T Fr =I,Fc T Fc =I X - Fr Fc T f ro where Fr is an n × m matrix and Fc is a d × m matrix. Once the minimization is performed, 2 r (i) = argmaxj Fr j and c (i) = argmaxj Fc j . In a non-negative matrix factorization approach, i i the orthogonality constraints are dropped to make the optimization easier while non-negativity constraints Fr , Fc 0 are introduced with the goal of lending better interpretability to the solutions. There are numerous multiplicative update algorithms for NMF which essentially have the flavor of alternating non-convex optimization. In our empirical comparisons in Section 4, we use the Alternating Constrained Least Squares (ACLS) approach of [12]. In Section 3.2 we consider a 3-factor non-negative matrix approximation to incorporate unequal values of mr and mc , and to improve the quality of the approximation. See [7, 13] for more details on matrix tri-factorization based formulations for co-clustering. 3 Objective Functions for Regularized Co-clustering with Dual Supervision Let us consider examples x to be elements of R d . We consider column values f for each feature to be a data point in C n . Our goal is to learn partition functions defined over the entire row and column spaces (as opposed to matrix indices), i.e., r : R {i}mr1 and c : C {i}mc1 . i= i= For this purpose, let us introduce kr : R × R to be the row kernel that defines an associated RKHS Hr . Similarly, kc : C × C denotes the column kernel whose associated RKHS is Hc . Below, we define r , c using these real valued function spaces. Consider a simultaneous assignment of rows into mr classes and columns into mc classes. For any 1 m data point x, denote Fr (x) = [fr (x) · · · fr r (x)]T mr to be a vector whose elements are soft j class assignments where fr Hr for all j . For the given n data points, denote Fr to be the n × mr class assignment matrix. Correspondingly, Fc (f ) is defined for any feature f C , and Fc denotes the associated column class assignment matrix. Additionally, we are given dual supervision in the form of label matrices Yr n×mr and Yc m×mc where Yrij = 1 specifies that the ith example is labeled with class j (simlarly for the feature labels matrix Yc ). The associated row sum for a labeled point is 1. Unlabeled points have all-zero rows, and the row sums are therefore 0. Let Jr (Jc ) denote a diagonal matrix of size n × n (d × d) whose diagonal entry is 1 for labeled examples (features) and 0 otherwise. By Is we will denote an identity matrix of size s × s. We use the notation tr(A) to mean the trace of the matrix A. 3.1 Manifold Regularization with Bipartite Graph Laplacian (M R) In this approach, we setup the following optimization problem, m +m c i c i 1( r i r i fr Hr + f 2 + tr Fr - Yr )T Jr (Fr - Yr ) 2 2 =1 2 =1 c Hc 2 F MF +µ 1( T T r tr Fc - Yc )T Jc (Fc - Yc ) tr r Fc Fc 2 2 argmin m m Fr Hr r ,Fc Hc c ( 2) The first two terms impose the usual RKHS norm on the class indicator functions for rows and columns. The middle two terms measure squared loss on labeled data. The final term measure smoothness of the row and column indicator functions with respect to the bipartite graph introduced in Section 2.1. This term also incorporates unlabeled examples and features. r , c , µ are real-valued parameters that tradeoff various regularization terms. Clearly, by Representer Theorem the solution is has the form, j fr (x) = in j ij kr (x, xi ), 1 j mr , fc (f ) = =1 id ij kc (f , fi ), 1 j mc (3) =1 Let , denote the corresponding optimal expansion coefficient matrices. Then, plugging in Eqn. 3 and solving the optimization problem, the solution is easily seen to be given by, +J = Y ( + K 0 0 0 r In r r Kr r µM 4) 0 c Id 0 Kc 0 Jc Kc Yc 3 where Kr , Kc are gram matrices over datapoints and features respectively. The partition functions are then defined by r (x) = argmax 1 j m in ij kr (x, xi ), c (f ) = argmax 1 j m =1 id ij kc (f , fi ) (5) =1 As in Section 2.1, we assume mr = mc = m. If the linear system above is solved by explicitly computing the matrix inverse, the computational cost is O((n + d)3 + (n + d)2 m). This approach is closely related to the Manifold Regularization framework of [2], and may be viewed as an modification of the Laplacian Regularized Least Squares (L A P R L S) algorithm, which uses a euclidean nearest neighbor row similarity graph to capture the manifold structure in the data. Instead of using the squared loss, one can develop variants using the SVM Hinge loss or the logistic loss function. One can also use a large family of graph regularizers derived from the graph Laplacian [3]. In particular, we use the iterated Laplacian of the form M p where p is an integer. 3.2 Matrix Approximation under Dual Supervision (M A) We now consider an alternative objective function where instead of the graph Laplacian regularizer, we add a penalty term that measures how well the data matrix is approximated by a trifactorization Fr QFc T , argmin m m Fr Hr r ,Fc Hc c Qmr ×mc m m r i r + c i c i 1( i fr Hr + 2 fc Hc + tr Fr - Yr )T Jr (Fr - Yr ) 2 2 =1 2 =1 2 +µ 1( tr Fc - Yc )T Jc (Fc - Yc ) X - Fr QFc T 2 ro f 2 2 (6) As before, the first two terms above enforce smoothness, the third and fourth terms measure squared loss over labels and the final term enforces co-clustering. The classical Representer Theorem (Eqn. 3) can again be applied since the above objective function only depends on point evaluations and RKHS norms of functions in Hr , Hc . The optimal expansion coefficient matrices, , , in this case are obtained by solving, argmin J (, , Q) , ,Q + = + c T +1 ( r T tr Kr tr Kc tr Kr - Yr )T Jr (Kr - Yr ) 2 2 2 +µ 1( tr Kc - Yc )T Jc (Kc - Yc ) X - Kr Q T Kc 2 ro (7) f 2 2 This problem is not convex in , , Q. We propose a block coordinate descent algorithm for the problem above. Keeping two variables fixed, the optimization over the other is a convex problem with a unique solution. This guarantees monotonic decrease of the objective function and convergence to a stationary point. We get the simple update equations given below, J =0 Q J =0 J =0 = = = 2 Q = (T K2 )-1 (T Kr XKc )( T Kc )-1 r (8) ( ( 9) [r In + Jr Kr ] + µKr Zc = [c Id + Jc Kc ] + µKc Zr = J J r Yr + µXKc QT c Yc + µXT Kr Q 10) 2 where Zc = Q T K2 QT , Zr = QT T Kr Q c (11) In Eqn 8, we assume that the appropriate matrix inverses exist. Eqns 9 and 10 are generalized Sylvester matrix equations of the form AX B + C X D = E whose unique solution X under certain regularity conditions can be exactly obtained by an extended version of the classical BartelsStewart method [9] whose complexity is B ((p + q )3 ) for p × q -siv ed matrix variable X . Alternatively, O z one can solve the linear system [10]: A + D C ec(X ) = vec(E ) where denotes 4 Kronecker product and vec(X ) vectorizes X in a column oriented way (it behaves as the matlab operator X (:)). Thus, the solution to Eqns (9,10) are as follows, [Imr (r In + Jr Kr ) + µZc Kr ] vec() = vec(Jr Yr + µXKc QT ) [Imc (r Id + Jc Kc ) + µZr Kc ] vec( ) = vec(Jc Yc + µX Kr Q) T (12) (13) These linear systems are of size nmr × nmr and dmc × dmc respectively. It is computationally prohibitive to solve these systems by direct matrix inversion. We use an iterative conjugate gradients (CG) technique instead, which can exploit hot-starts from the previous solution, and the fact that the matrix vector products can be computed relatively efficiently as follows, [Imr (r In + Jr Kr ) + µZc Kr ] vec() = vec(µKr Z ) + r vec() + vec(Jr Kr ) c To optimize ( ) given fixed Q and (), we run CG with a stringent tolerance of 10-10 and maximum of 200 iterations starting from the ( ) from the previous iteration. In an outer loop, we monitor the relative decrease in the objective function and terminate when the relative improvement falls below 0.0001. We use a maximum of 40 outer iterations where each iteration performs one round of , , Q optimization. Empirically, we find that the block coordinate descent approach often converges surprisingly quickly (see Section 4.2). The final classification is given by Eqn. 5. 4 Empirical Study In this section, we present an empirical study aimed at comparing the proposed algorithms with several baselines: (i) Unsupervised co-clustering with spectral bipartite graph partitioning (B I PA RT I T E) and non-negative matrix factorization (N M F), (ii) supervised performance of standard regularized least squares classification (R L S) that ignores unlabeled data, and (iii) one-sided semi-supervised performance obtained with Laplacian RLS (L A P R L S) which uses a euclidean nearest-neighbor row similarity graph. The goal is to observe whether dual supervision particularly along features can help improve classification performance, and whether joint RKHS regularization as formulated in our algorithms (abbreviated M R for the manifold regularization based method of Section 3.1 and M A for the matrix approximation method of Section 3.2) along both rows and columns leads to good quality out-of-sample prediction. In the experiments below, the performance of R L S and L A P R L S is optimized for best performance on the unlabeled set over a grid of hyperparameters. We use Gaussian kernels with width r for rows and c for columns. These were set to 2k 0r , 2k 0c respectively where 0r , 0c are (1/m)-quantile of pairwise euclidean distances among rows and columns respectively for an m class problem, and k is tuned over {-2, -1, 0, 1, 2} to optimize 3fold cross-validation performance of fully supervised R L S. The values r , c , µ are loosely tuned for M A,M R with respect to a single random split of the data into training and validation set; more careful hyperparameter tuning may further improve the results presented below. We focus on performance in predicting row labels. To enable comparison with the unsupervised coclustering methods, we use the popularly used F-measure defined on pairs of examples as follows: Precision Recall F-measure 4.1 A Toy Dataset = = = Number of Pairs Correctly Predicted Number of Pairs Predicted to be In Same Cluster or Class Number of Pairs Correctly Predicted Number of Pairs in the Same Cluster or Class (2 Precision Recall)/(Precision + Recall) (14) We generated a toy 2-class dataset with 200 examples per class and 100 features to demonstrate the main observations. The feature vector for a positive example is of the form [2u - 0.1 2u + 0.1], and for a negative example is of the form [2u + 0.1 2u - 0.1], where u is a 50-dimensional random vector whose entries are uniformly distributed over the unit interval. It is clear that there is substantial overlap between tihe two classes. Given a colt mn partitioning c , consider the following transformation: u i xi :c (i)=1 xi c (i) T (x) = , |i::c (i)=-11| hat maps examples in 100 to the plane 2 by composing |i:c (i)=1| =- a single feature whose value equals the mean of all features in the same partition. For the correct column partitioning, c (i) = 1, 1 i 50, c (i) = -1, 50 < i 100, the examples under the 5 action of T are shown in Figure 1 (left). It is clear that T renders the data to be almost separable. It is therefore natural to attempt to (effectively) learn T in a semi-supervised manner. In Figure 1 (right), we plot the learning curves of various algorithms with respect to increasing number of row and column labels. On this dataset, co-clustering techniques (B I PA RT I T E, N M F) perform fairly well, and even significantly better than R L S, which has an optimized F-measure of 67% with 25 row labels. With increasing amounts of column labels, the learning curves of M R and M A steadily lift eventually outperforming the unsupervised techniques. The hyperparameters used in this experiment are: r = 2.1, c = 4.1, r = c = 0.001, µ = 10 for M R and 0.001 for M A. Figure 1: left: Examples in the toy dataset under the transformation defined by the correct column partitioning. right: Performance comparison ­ the number of column labels used are marked. 1.4 MR,50 1.3 0.92 0.9 nmf 0.88 MR,25 0.86 MA,10,25 MR,5 MR,10 MA,50 bipartite 1.2 1.1 1 0.9 0.8 F-measure 0.84 0.7 MA,5 0.82 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 5 10 15 20 25 Number of Row Labels 4.2 Text Categorization We performed experiments on document-word matrices drawn from the 20-newgroups dataset preprocessed as in [15]. The preprocessed data has been made publicly available by the authors of [15]1 . For each word w and class c, we computed a score as follows: score(w, c) = -P (Y = c) log P (Y = c) - P (W = w)P (Y = c|W = w) log P (Y = c|W = w) - P (W = w)P (Y = c|W = w) log P (Y = c|W = w), where P (Y = c) is the fraction of documents whose category is c, P (W = w) is the fraction of times word w is encountered, and P (Y = c|W = w) (P (Y = c|W = w)) is the fraction of documents with class c when w is present (absent). It is easy to see that tc e mutual information between the indicator random varih able for w and the class variable is score(w, c). We simulated manual labeling of words by associating w with the class argmaxc score(w, c). Finally, we restricted attention to 631 words with highest overall mutual information and 2000 documents that belong to the following 5 classes: comp.graphics, rec.motorcycles, rec.sport.baseball, sci.space, talk.politics.mideast. Since words of talk.politics.mideast accounted for more than half the vocabulary, we used the class normalization prescribed in [11] to handle the imbalance in the labeled data. Results presented in Table 1 are averaged over 10 runs. In each run, we randomly split the documents into training and test sets, in the ratio 1 : 3. The training set is then further split into labeled and unlabeled sets by randomly selecting 75 labeled documents. We experimented with increasing number of randomly chosen word labels. The hyperparameters are as follows: r = 0.43, c = 0.69, r = c = µ = 1 for M R and r = c = 0.0001, µ = 0.01 for M A. We observe that even without any word supervision M R outperforms all the baseline approaches: unsupervised co-clustering with B I PA RT I T E and N M F, standard R L S that only uses labeled documents, and also L A P R L S which uses a graph Laplacian based on document similarity for semisupervised learning. This validates the effectiveness of the bipartite document and word graph regularizer. As the amount of word supervision increases, the performance of both M R and M A improves gracefully. The out-of-sample extension to test data is of good quality, considering that our test sets are much larger than our training sets. We also observed that the mean number of (outer) iterations required for convergence of M A decreases as labels are increased from 0 to 500: 28.7(0), 12.2(100), 12.7(200), 9.3(350), 7.8(500). In, Figure 2 we show the top unlabeled words 1 At http://www.princeton.edu/nslonim/data/20NG data 74000.mat.gz 6 Table 1: Performance on 5-Newsgroups Dataset with 75 row labels (a) F-measure on Unlabeled Set B I PA RT I T E NMF RLS LAPRLS (b) F-measure on Test Set RLS LAPRLS 54.8 (7.8) 54.4 (6.2) 62.2 (3.1) 62.5 (3.0) 61.2 (1.7) 61.9 (1.4) (c) F-measure on Unlabeled Set (d) F-measure on Test Set # col labs 0 100 200 350 500 MR 64.7 (1.3) 72.3 (2.2) 77.0 (2.5) 78.6 (2.1) 79.3 (1.6) MA 60.4 (5.6) 59.6 (5.7) 69.2 (7.1) 75.1 (4.1) 77.1 (5.8) # col labs 0 100 200 350 500 MR 57.1 (2.1) 60.9 (2.4) 66.2 (2.8) 68.1 (1.9) 69.1 (2.4) MA 60.3 (7.0) 60.9 (5.0) 66.2 (6.2) 70.3 (4.4) 71.0 (6.0) for each class sorted by the real-valued prediction score assigned by M R (in one run trained with 100 labeled words). Intuitvely, the main words associated with the class are retrieved. Figure 2: Top unlabeled words categorized by M R C O M P. G R A P H I C S : polygon, gifs, conversion, shareware, graphics, rgb, vesa, viewers, gif, format, viewer, amiga, raster, ftp, jpeg, manipulation biker, archive, dogs, yamaha, plo, wheel, riders, motorcycle, probes, ama, rockies, neighbors, saudi, kilometers clemens, morris, pitched, hr, batters, dodgers, offense, reds, rbi, wins, mets, innings, ted, defensive, sox, inning greek, turkey, hezbollah, armenia, territory, ohanus, appressian, sahak, melkonian, civilians, greeks R E C . M OT O R C Y C L E S : R E C . S P O RT. BA S E BA L L : S C I . S PAC E : oo, servicing, solar, scispace, scheduled, atmosphere, missions, telescope, bursts, orbiting, energy, observatory, island, hst, dark TA L K . P O L I T I C S . M I D E A S T :turkish, 4.3 Project Categorization We also considered a problem that arises in a real business-intelligence setting. The dataset is composed of 1169 projects tracked by the Integrated Technology Services division of IBM. These projects need to be categorized into 8 predefined product categories within IBM's Server Services product line, with the eventual goal of performing various follow-up business analyses at the granularity of categories. Each project is represented as a 112-dimensional vector specifying the distribution of skills required for its delivery. Therefore, each feature is associated with a particular job role/skill set (JR/SS) combination, e.g., "data-specialist (oracle database)". Domain experts validated project (row) labels and additionally provided category labels for 25 features deemed to be important skills for delivering projects in the corresponding category. By demonstrating our algorithms on this dataset, we are able to validate a general methodology with which to approach project categorization across all service product lines (SPLs) on a regular basis. The amount of dual supervision available in other SPLs is indeed severely limited as both the project categories and skill definitions are constantly evolving due to the highly dynamic business environment. Results presented in Table 2 are averaged over 10 runs. In each run, we randomly split the projects into training and test sets, in the ratio 3 : 1. The training set is then further split into labeled and unlabeled sets by randomly selecting 30 labeled projects. We experimented with increasing number of randomly chosen column labels, from none to all 25 available labels. The hyperparameters are as follows: r = c = 0.0001, r = 0.69, c = 0.27 chosen as described earlier. Results in Tables 2(c),2(d) are obtained with µ = 10 for M R, µ = 0.001 for M A. We observe that B I PA RT I T E performs significantly better than N M F on this dataset, and is competitve with supervised R L S performance that relies only on labeled data. By using L A P R L S , performance can be slightly boosted. We find that M R outperforms all approaches significantly even with very few column labels. We conjecture that the comparatively lower mean and high variance in the performance of M A on this dataset is due to suboptimal local minima issues, which may be alleviated using annealing techniques or multiple random starts, commonly used for Transductive SVMs [3]. From Tables 2(c),2(d) we also observe that both methods give high quality out-of-sample extension on this problem. 7 Table 2: Performance on IBM Project Categorization Dataset with 30 row labels (a) F-measure on Unlabeled Set B I PA RT I T E NMF RLS LAPRLS (b) F-measure on Test Set RLS LAPRLS 89.1 (2.7) 56.5 (1.1) 88.1 (7.3) 90.20 (5.8) 87.8 (8.4) 90.2 (6.0) (c) F-measure on Unlabeled Set (d) F-measure on Test Set # col labs 0 5 10 15 25 MR 92.7 (4.6) 94.9 (1.8) 93.0 (4.2) 92.3 (7.0) 98.0 (0.5) MA 90.7 (4.8) 87.8 (6.4) 89.0 (8.0) 89.1 (7.4) 92.2 (6.0) # col labs 0 5 10 15 25 MR 89.2 (5.5) 93.3 (1.7) 91.9 (4.2) 92.2 (5.2) 96.4 (1.6) MA 90.0 (5.5) 87.4 (6.6) 89.1 (8.3) 89.2 (8.8) 92.1 (6.8) 5 Conclusion We have developed semi-supervised kernel methods that support partial supervision along both dimensions of the data. Empirical studies show promising results and highlight the previously untapped benefits of feature supervision in semi-supervised settings. For an application of closely related algorithms to blog sentiment classification, we point the reader to [14]. For recent work on text categorization with labeled features instead of labeled examples, see [8]. References [1] A. Banerjee, I. Dhillon, J. Ghosh, S.Merugu, and D.S. Modha. A generalized maximum entropy approach to bregman co-clustering and matrix approximation. JMLR, 8:1919­1986, 2007. [2] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 7:2399­2434, 2006. ¨ [3] O. Chapelle, B. Scholkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006. [4] F. Chung, editor. Spectral Graph Theory. AMS, 1997. [5] I. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In KDD, 2001. [6] C. Ding, X. He, and H.D. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. In SDM, 2005. [7] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In KDD, 2006. [8] G. Druck, G. Mann, and A. McCallum. Learning from labeled features using generalized expectation criteria. In SIGIR, 2008. [9] J. Gardiner, Laub A.J, Amato J.J, and Moler C.B. Solution of the Sylvester matrix equation AXBT + CXDT = E. ACM Transactions on Mathematical Software, 18(2):223­231, 1992. [10] D. Harville. Matrix Algebra From a Statistician's Perspective. Springer, New York, 1997. [11] T.M. Huang and V. Kecman. Semi-supervised learning from unbalanced labeled data an improvement. Lecture Notes in Computer Science, 3215:765­771, 2004. [12] A. Langville, C. Meyer, and R. Albright. Initializations for the non-negative matrix factorization. In KDD, 2006. [13] T. Li and C. Ding. The relationships among various nonnegative matrix factorization methods for clustering. In ICDM, 2006. [14] V. Sindhwani and P. Melville. Document-word co-regularization for semi-supervised sentiment analysis. In ICDM, 2008. [15] N. Slonim and N. Tishby. Document clustering using word clusters via the information bottleneck method. In SIGIR, 2000. 8