Discriminative K-means for Clustering

Jieping Ye Arizona State University Tempe, AZ 85287 jieping.ye@asu.edu

Zheng Zhao Arizona State University Tempe, AZ 85287 zhaozheng@asu.edu

Mingrui Wu MPI for Biological Cybernetics ¨ Tubingen, Germany mingrui.wu@tuebingen.mpg.de

Abstract
We present a theoretical study on the discriminative clustering framework, recently proposed for simultaneous subspace selection via linear discriminant analysis (LDA) and clustering. Empirical results have shown its favorable performance in comparison with several other popular clustering algorithms. However, the inherent relationship between subspace selection and clustering in this framework is not well understood, due to the iterative nature of the algorithm. We show in this paper that this iterative subspace selection and clustering is equivalent to kernel K-means with a specific kernel Gram matrix. This provides significant and new insights into the nature of this subspace selection procedure. Based on this equivalence relationship, we propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering, as well as an automatic parameter estimation procedure. We also present the nonlinear extension of DisKmeans using kernels. We show that the learning of the kernel matrix over a convex set of pre-specified kernel matrices can be incorporated into the clustering formulation. The connection between DisKmeans and several other clustering algorithms is also analyzed. The presented theories and algorithms are evaluated through experiments on a collection of benchmark data sets.

1

Introduction

Applications in various domains such as text/web mining and bioinformatics often lead to very highdimensional data. Clustering such high-dimensional data sets is a contemporary challenge, due to the curse of dimensionality. A common practice is to project the data onto a low-dimensional subspace through unsupervised dimensionality reduction such as Principal Component Analysis (PCA) [9] and various manifold learning algorithms [1, 13] before the clustering. However, the projection may not necessarily improve the separability of the data for clustering, due to the inherent separation between subspace selection (via dimensionality reduction) and clustering. One natural way to overcome this limitation is to integrate dimensionality reduction and clustering in a joint framework. Several recent work [5, 10, 16] incorporate supervised dimensionality reduction such as Linear Discriminant Analysis (LDA) [7] into the clustering framework, which performs clustering and LDA dimensionality reduction simultaneously. The algorithm, called Discriminative Clustering (DisCluster) in the following discussion, works in an iterative fashion, alternating between LDA subspace selection and clustering. In this framework, clustering generates the class labels for LDA, while LDA provides the subspace for clustering. Empirical results have shown the benefits of clustering in a low dimensional discriminative space rather than in the principal component space (generative). However, the integration between subspace selection and clustering in DisCluster is not well understood, due to the intertwined and iterative nature of the algorithm. In this paper, we analyze this discriminative clustering framework by studying several fundamental and important issues: (1) What do we really gain by performing clustering in a low dimensional discriminative space? (2) What is the nature of its iterative process alternating between subspace 1


selection and clustering? (3) Can this iterative process be simplified and improved? (4) How to estimate the parameter involved in the algorithm? The main contributions of this paper are summarized as follows: (1) We show that the LDA projection can be factored out from the integrated LDA subspace selection and clustering formulation. This results in a simple trace maximization problem associated with a regularized Gram matrix of the data, which is controlled by a regularization parameter ; (2) The solution to this trace maximization problem leads to the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering. DisKmeans is shown to be equivalent to kernel K-means, where discriminative subspace selection essentially constructs a kernel Gram matrix for clustering. This provides new insights into the nature of this subspace selection procedure; (3) The DisKmeans algorithm is dependent on the value of the regularization parameter . We propose an automatic parameter tuning process (model selection) for the estimation of ; (4) We propose the nonlinear extension of DisKmeans using the kernels. We show that the learning of the kernel matrix over a convex set of pre-specified kernel matrices can be incorporated into the clustering formulation, resulting in a semidefinite programming (SDP) [15]. We evaluate the presented theories and algorithms through experiments on a collection of benchmark data sets.

2 Linear Discriminant Analysis and Discriminative Clustering
Consider a data set nonsisting of n data points {xi }n 1  Rm . For simplicity, we assume the data is c i= centered, that is, i=1 xi /n = 0. Denote X = [x1 , · · · , xn ] as the data matrix whose i-th column is given by xi . In clustering, we aim to group the data {xi }n 1 into k clusters {Cj }k=1 . Let F  Rn×k i= j be the cluster indicator matrix defined as follows: F = {fi,j }n×k , where fi,j = 1, iff xi  Cj . (1) We can define the weighted cluster indicator matrix as follows [4]: L = [L1 , L2 , · · · , Lk ] = F (F T F )- 2 . It follows that the j -th column of L is given by 1
nj
1 1

(2)

2 Lj = (0, . . . , 0, , . . . , 1, 0, . . . , 0)T /nj , (3) x where nj is the sample size of the j -th cluster Cj . Denote µj = Cj x/nj as the mean of the j -th cluster Cj . The within-cluster scatter, between-cluster scatter, and total scatter matrices are defined as follows [7]:

Sw =

jk x
=1
i Cj

(xi - µj )(xi - µj )T ,

Sb =

jk
=1

nj µj µT = X LLT X T , j

St = X X T .

(4)

It follows that trace(Sw ) captures the intra-cluster distance, and trace(Sb ) captures the inter-cluster distance. It can be shown that St = Sw + Sb . Given the cluster indicator matrix F (or L), Linear Discriminant Analysis (LDA) aims to compute a linear transformation (projection) P  Rm×d that maps each xi in the m-dimensional space to a vector xi in the d-dimensional space (d < m) as follows: xi  (IRm  xi = P T xi .  IRd , ^ ^ such that the following objective function is maximized [7]: trace P T Sw P )-1 P T Sb P Since St = Sw + Sb , the optimal transformation matrix P is also given by maximizing the following objective function: ( . trace P T St P )-1 P T Sb P (5) For high-dimensional data, the estimation of the total scatter (covariance) matrix is often not reliable. The regularization technique [6] is commonly applied to improve the estimation as follows: ~ St = St + Im = X X T + Im , (6) where Im is the identity matrix of size m and  > 0 is a regularization parameter. In Discriminant Clustering (DisCluster) [5, 10, 16], the transformation matrix P and the weighted cluster indicator matrix L are computed by maximizing the following objective function: = ( ~ f (L, P )  trace P T St P )-1 P T Sb P (T . trace P (X X T + Im )P )-1 P T X LLT X T P (7) 2


The algorithm works in an intertwined and iterative fashion, alternating between the computation of L for a given P and the computation of P for a given L. More specifically, for a given L, P is given by the standard LDA procedure. Since trace(AB ) = trace(B A) for any two matrices [8], for a given P , the objective function f (L, P ) can be expressed as: L . f (L, P ) = trace T X T P (P T (X X T + Im )P )-1 P T X L (8) Note that L is not an arbitrary matrix, but a weighted cluster indicator matrix, as defined in Eq. (3). The optimal L can be computed by applying the gradient descent strategy [10] or by solving a kernel K-means problem [5, 16] with X T P (P T (X X T + Im )P )-1 P T X as the kernel Gram matrix [4]. The algorithm is guaranteed to converge in terms of the value of the objective function f (L, P ), as the value of f (L, P ) monotonically increases and is bounded from above. Experiments [5, 10, 16] have shown the effectiveness of DisCluster in comparison with several other popular clustering algorithms. However, the inherent relationship between subspace selection via LDA and clustering is not well understood, and there is need for further investigation. We show in the next section that the iterative subspace selection and clustering in DisCluster is equivalent to kernel K-means with a specific kernel Gram matrix. Based on this equivalence relationship, we propose the Discriminative K-means (DisKmeans) algorithm for simultaneous LDA subspace selection and clustering.

3 DisKmeans: Discriminative K-means with a Fixed 
Assume that  is a fixed positive constant. Let's consider the maximization of the function in Eq. (7): ( . f (L, P ) = trace P T (X X T + Im )P )-1 P T X LLT X T P (9) Here, P is a transformation matrix and L is a weighted cluster indicator matrix as in Eq. (3). It follows from the Representer Theorem [14] that the optimal transformation matrix P  IRm×d can be expressed as P = X H , for some matrix H  IRn×d . Denote G = X T X as the Gram matrix, which is symmetric and positive semidefinite. It follows that H . -1 T T f (L, P ) = trace (GG + G) H H GLLT GH (10) We show that the matrix H can be factored out from the objective function in Eq. (10), thus dramatically simplifying the optimization problem in the original DisCluster algorithm. The main result is summarized in the following theorem: Theorem 3.1. Let G be the Gram matrix defined as above and  > 0 be the regularization parameter. Let L and P  be the optimal solution to the maximization of the objective function f (L, P ) in Eq. (7). Then L solves the following maximization problem: L I L 1 T L = arg max trace - (In + G)-1 . (11) n  L Proof. Let G = U U T be the Singular Value Decomposition (SVD) [8] of G, where U  IRn×n is orthogonal,  = diag (1 , · · · , t , 0, · · · , 0)  IRn×n is diagonal, and t = rank(G). Let U1  IRn×t consist of the first t columns of U and t = diag (1 , · · · , t )  IRt×t . Then T (12) G = U U T = U1 t U1 .
T Denote R = (2 + t )- 2 t U1 L and let R = M R N T be the SVD of R, where M and t N are orthog( nal and R is diagonal, with rank(R ) = rank(R) = q . Define the matrix Z as o 1 Z = U diag 2 + t )- 2 M , In-t where diag(A, B ) is a block diagonal matrix. It follows that t ~ , I , GT Z 0 0 t T T Z LL G = Z (GG + G) Z = (13) 00 00 ~ where  = (R )2 is diagonal with non-increasing diagonal entries. It can be verified that = ~= ( + f (L, P )  trace  trace GG + G) GLLT G = L + trace T G (GG + G) GL L I L 1 T , (14) - (In + G)-1 trace n  where the equality holds when P = X H and H consists of the first q columns of Z .
1

3


3.1 Computing the Weighted Cluster Matrix L The weighted cluster indicator matrix L solving the maximization problem in Eq. (11) can be computed by solving a kernel K-means problem [5] with the kernel Gram matrix given by I -1 1 ~ G = In - n + G . (15)  Thus, DisCluster is equivalent to a kernel K-means problem. We call the algorithm Discriminative K-means (DisKmeans). 3.2 Constructing the Kernel Gram Matrix via Subspace Selection The kernel Gram matrix in Eq. (15) can be expressed as ~ G = U diag (1 /( + 1 ), 2 /( + 2 ), · · · , n /( + n )) U T . (16)

Recall that the original DisCluster algorithm involves alternating LDA subspace selection and clustering. The analysis above shows that the LDA subspace selection in DisCluster essentially constructs a kernel Gram matrix for clustering. More specifically, all the eigenvectors in G is kept unchanged, while the following transformation is applied to the eigenvalues: ( ) =  /( +  ). This elucidates the nature of the subspace selection procedure in DisCluster. The clustering algorithm is dramatically simplified by removing the iterative subspace selection. We thus address issues (1)­(3) in Section 1. The last issue will be addressed in Section 4 below. 3.3 Connection with Other Clustering Approaches ~ Consider the limiting case when   . It follows from Eq. (16) that G  G/. The optimal L is thus given by solving the following maximization problem: L . arg max trace T GL
L

The solution is given by the standard K-means clustering [4, 5].
T ~ Consider the other extreme case when   0. It follows from Eq. (16) that G  U1 U1 . Note that the columns of U1 form the full set of (normalized) principal components [9]. Thus, the algorithm is equivalent to clustering in the (full) principal component space.

4

DisKmeans : Discriminative K-means with Automatically Tuned 

Our experiments show that the value of the regularization parameter  has a significant impact on the performance of DisKmeans. In this section, we show how to incorporate the automatic tuning of  into the optimization framework, thus addressing issue (4) in Section 1. The maximization problem in Eq. (11) is equivalent to the minimization of the following function: L . I -1 1 T trace (17) L n+ G  It is clear that a small value of  leads to a small value of the objective function in Eq. (17). To overcome this problem, we include an additional penalty term to control the eigenvalues of the 1 matrix In +  G. This leads to the following optimization problem: L + I -1 . I 1 1 T min g (L, )  trace (18) L log det n + G n+ G L,   Note that the objective function in Eq. (18) is closely related to the negative log marginal likelihood 1 function in Gaussian Process [12] with In +  G as the covariance matrix. We have the following main result for this section: Theorem 4.1. Let G be the Gram matrix defined above and let L be a given weighted cluster T indicator matrix. Let G = U U T = U1 t U1 be the SVD of G with t = diag (1 , · · · , t ) T as in Eq. (12), and ai be the i-th diagonal entry of the matrix U1 LLT U1 . Then for a fixed L, 4


the optimal  solving the optimization problem in Eq. (18) is given by minimizing the following objective function: 1. i t ai i + log + (19)  + i  =1 Proof. Let U = [U1 , U2 ], that is, U2 is the orthogonal complement of U1 . It follows that I = I = it 1 1 log det n + G log det t + 1 log (1 + i /) . (20)   =1 + L = L I -1 I -1 = L 1 1 T T T T trace T U2 U2 L trace L trace U1 t + t U1 L n+ G   it
=1

(1 + i /)-1 ai + trace LT

LT

T U2 U2 L

,

(21)

The result follows as the second term in Eq. (21), trace

, T U2 U2 L is a constant.

We can thus solve the optimization problem in Eq. (18) iteratively as follows: For a fixed , we update L by maximizing the objective function in Eq. (17), which is equivalent to the DisKmeans algorithm; for a fixed L, we update  by minimizing the objective function in Eq. (19), which is a single-variable optimization and can be solved efficiently using the line search method. We call the algorithm DisKmeans , whose solution depends on the initial value of , as the optimization problem is non-convex.

5 Kernel DisKmeans: Nonlinear Discriminative K-means using the kernels
The DisKmeans algorithm can be easily extended to deal with nonlinear data using the kernel trick. Kernel methods [14] work by mapping the data into a high-dimensional feature space F equipped with an inner product through a nonlinear mapping  : IRm  F . The nonlinear mapping can be implicitly specified by a symmetric kernel function K , which computes the inner product of the images of each data pair in the feature space. For a given training data set {xi }n 1 , the kernel Gram i= matrix GK is defined as follows: GK (i, j ) = ((xi ), (xj )). For a given GK , the weighted cluster matrix L = [L1 , · · · , Lk ] in kernel DisKmeans is given by minimizing the following objective function: L =k I -1 I -1 j 1 1 T T trace L Lj Lj . (22) n + GK n + GK   =1 The performance of kernel DisKmeans is dependent on the choice of the kernel Gram matrix. Following [11], we assume that GK is restricted to be a convex combination of a given set i i of kernel Gram matrices {Gi }i=1 as GK = =1 i Gi , where the coefficients {i } =1 satisfy i =1 i trace(Gi ) = 1 and i  0 i. If we relax the entries of L to be any continuous values, the optimal coefficients {i }i=1 and L may be computed by solving a Semidefinite programming (SDP) problem as summarized in the following theorem: Theorem 5.1. Let GK be constrained to be a convex combination of a given set of kernel matrices i {Gi }i=1 as GK = =1 i Gi satisfying the constraints defined above. Then the optimal GK minimizing the objective function in Eq. (22) is given by solving the following SDP problem:
L,t1 ,··· ,tk ,

min
n

jk
=1

tj ~ Lj tj 0 , for j = 1, · · · , k , (23)

I s.t.

+

1 

i

LT j
i

=1 i Gi

i  0 i,
=1

i trace(Gi ) = 1.

5


Proof. It follows as

LT j

I
n

+

1  GK

-1

I Lj  ti is equivalent to

+

1 

=1 i Gi LT j

i

~

Lj tj

0 .

The parameter  can also be incorporated into the SDP formulation by treating the identity matrix In as one of the kernel Gram matrix as in [11]. The algorithm is named Kernel DisKmeans . Note that unlike the kernel learning in [11], the class label information is not available in our formulation.

6 Empirical Study
In this section, we empirically study the properties of DisKmeans and its variants, and evaluate the performance of the proposed algorithms in comparison with several other representative algorithms, including Locally Linear Embedding (LLE) [13] and Laplacian Eigenmap (Leigs) [1]. Experiment Setup: All algorithms were implemented us- Table 1: Summary of benchmark data sets # DIM # INST # CL ing Matlab and experiments were conducted on a PENData Set (m) ( n) (k ) TIUM IV 2.4G PC with 1.5GB RAM. We test these albanding 29 238 2 soybean 35 562 15 gorithms on eight benchmark data sets. They are five segment 19 2309 7 UCI data sets [2]: banding, soybean, segment, satimage, pendigits 16 10992 10 pendigits; one biological data set: leukemia (http://www. satimage 36 6435 6 leukemia 7129 72 2 upo.es/eps/aguilar/datasets.html) and two image ORL 10304 100 10 data sets: ORL (http://www.uk.research.att.com/ USPS 256 9298 10 facedatabase.html, sub-sampled to a size of 100*100 = 10000 from 10 persons) and USPS (ftp://ftp.kyb.tuebingen.mpg.de/pub/bs/ data/). See Table 1 for more details. To make the results of different algorithms comparable, we first run K -means and the clustering result of K -means is used to construct the set of k initial centroids, for all experiments. This process is repeated for 50 times with different sub-samples from the original data sets. We use two standard measurements: the accuracy (ACC) and the normalized mutual information (NMI) to measure the performance.
Banding
0.772 0.771 0.77 0.769 0.768 ACC 0.767 0.766 0.765 0.764 0.763 0.762 -6 10
-4 -2 0 2 4 6

soybean
0.644 0.642 0.64 0.638 0.636 ACC ACC 0.634 0.632 0.65 0.63 0.628 0.626 0.66 0.69

segment
K-means DisCluster DisKmeans
0.7 0.68

pendigits

0.695 0.67

ACC

0.69

K-means DisCluster DisKmeans

K-means DisCluster DisKmeans

0.685

0.64 0.68 10
-4

K-means DisCluster DisKmeans

10

10

10 

10

10

10

0.624 -6 10

10

-2

10 

0

10

2

10

4

10

6

0.63 -6 10

10

-4

10

-2

10 

0

10

2

10

4

10

6

10

-6

10

-4

10

-2

10 

0

10

2

10

4

10

6

satimage
0.71 0.7 0.69 0.68 0.67 ACC 0.66 0.65 0.64 0.63 0.62 0.61 -6 10

leukemia
0.78 0.775 0.77 0.745 0.744 0.743 0.742 0.741 ACC 0.74 0.739

ORL
0.72 0.7

USPS

ACC

K-means DisCluster DisKmeans

0.765 0.76 0.755 0.75 0.745

K-means DisCluster DisKmeans
ACC

0.68 0.66 0.64 0.62 0.6

K-means DisCluster DisKmeans

0.738 0.58 0.737 0.56 0.54 -6 10

K-means DisCluster DisKmeans

0.74 0.736 10
-4 -2 0 2 4 6

10

10 

10

10

10

0.735 -6 10

10

-4

10

-2

10 

0

10

2

10

4

10

6

0.735

10

-5

10

0


10

5

10

10

10

-4

10

-2

10 

0

10

2

10

4

10

6

Figure 1: The effect of the regularization parameter  on DisKmeans and Discluster. Effect of the regularization parameter : Figure 1 shows the accuracy (y -axis) of DisKmeans and DisCluster for different  values (x-axis). We can observe that  has a significant impact on the performance of DisKmeans. This justifies the development of an automatic parameter tuning process in Section 4. We can also observe from the figure that when   , the performance of DisKmeans approaches to that of K -means on all eight benchmark data sets. This is consistent with our theoretical analysis in Section 3.3. It is clear that in many cases,  = 0 is not the best choice. Effect of parameter tuning in DisKmeans : Figure 2 shows the accuracy of DisKmeans using 4 data sets. In the figure, the x-axis denotes the different  values used as the starting point for DisKmeans . The result of DisKmeans (without parameter tuning) is also presented for comparison. We can observe from the figure that in many cases the tuning process is able to significantly improve 6


satimage
0.72 0.7 0.7 0.698 0.696 0.68

pendigits
0.75 0.748 0.746 0.744 0.694 0.742 ACC

ORL

USPS
0.72 0.7 0.68 0.66 0.64 0.62

ACC

0.66

ACC

0.69 0.688

0.74 0.738

ACC

0.692

0.64 0.686 0.62 0.736

0.6

DisKmeans DisKmeans
10
-4

0.684 0.682 0.68 -6 10
-4

DisKmeans DisKmeans

0.58 0.734 0.732


10
6

DisKmeans DisKmean

10
-5

0.56 0.54 -6 10

DisKmeans DisKmeans
-4 -2 0 2 4


6

0.6 -6 10

10

-2

10 

0

10

2

10

4

10

10

-2

10 

0

10

2

10

4

10

6

0.73

10

0


10

5

10

10

10

10

10 

10

10

10

Figure 2: The effect of the parameter tuning in DisKmeans using 4 data sets. The x-axis denotes the different
 values used as the starting point for DisKmeans .

the performance. We observe similar trends on other four data sets and the results are omitted. Figure 2 also shows that the tuning process is dependent on the initial value of  due to its nonconvex optimization, and when   , the effect of the tuning process become less pronounced. Our results show that a value of , which is neither too large nor too small works well.
satimage
0.098 0.347

pendigits
0.23 0.228 0.346 0.226

segment
0.0275

USPS

0.096

0.027

0.094 0.345 0.224 TRACE TRACE TRACE 0.222 0.22 0.343 0.088 TRACE 0.092 0.0265

0.344

0.09

0.026

DisCluster DisKmeans

DisCluster DisKmeans
0.342

DisCluster DisKmeans

0.218 0.216 0.214

0.086

DisCluster DisKmeans
1 2 3 4  5 6 7

0.0255

0.084

0.341 1 2 3 4  5 6 7 8

0.025

1

2

3 

4

5

1

1.5

2

2.5

3 

3.5

4

4.5

5

Figure 3: Comparison of the trace value achieved by DisKmean and DisCluster. The x-axis denotes the
number of iterations in Discluster. The trace value of DisCluster is bounded from above by that of DisKmean.

DisKmean versus DisCluster: Figure 3 compares the trace value achieved by DisKmean and the trace value achieved in each iteration of DisCluster on 4 data sets for a fixed . It is clear that the trace value of DisCluster increases in each iteration but is bounded from above by that of DisKmean. We observe a similar trend on the other four data sets and the results are omitted. This is consistent with our analysis in Section 3 that both algorithms optimize the same objective function, and DisKmean is a direct approach for the trace maximization without the iterative process. Clustering evaluation: Table 2 presents the accuracy (ACC) and normalized mutual information (NMI) results of various algorithms on all eight data sets. In the table, DisKmeans (or DisCluster) with "max" and "ave" stands for the maximal and average performance achieved by DisKmeans and DisCluster using  from a wide range between 10-6 and 106 . We can observe that DisKmeans is competitive with other algorithms. It is clear that the average performance of DisKmeans is robust against different initial values of . We can also observe that the average performance of DisKmeans and DisCluster is quite similar, while DisCluster is less sensitive to the value of .

7 Conclusion
In this paper, we analyze the discriminative clustering (DisCluster) framework, which integrates subspace selection and clustering. We show that the iterative subspace selection and clustering in DisCluster is equivalent to kernel K-means with a specific kernel Gram matrix. We then propose the DisKmeans algorithm for simultaneous LDA subspace selection and clustering, as well as an automatic parameter tuning procedure. The connection between DisKmeans and several other clustering algorithms is also studied. The presented analysis and algorithms are verified through experiments on a collection of benchmark data sets. We present the nonlinear extension of DisKmeans in Section 5. Our preliminary studies have shown the effectiveness of Kernel DisKmeans in learning the kernel Gram matrix. However, the SDP formulation is limited to small-sized problems. We plan to explore efficient optimization techniques for this problem. Partial label information may be incorporated into the proposed formulations. This leads to semi-supervised clustering [3]. We plan to examine various semi-learning techniques within the proposed framework and their effectiveness for clustering from both labeled and unlabeled data. 7


Table 2: Accuracy (ACC) and Normalized Mutual Information (NMI) results on 8 data sets. "max" and "ave"
stand for the maximal and average performance achieved by DisKmeans and DisCluster using  from a wide range of values between 10-6 and 106 . We present the result of DisKmeans with different initial  values. LLE stands for Local Linear Embedding and LEI for Laplacian Eigenmap. "AVE" stands for the mean of ACC or NMI on 8 data sets for each algorithm. DisKmeans DisCluster DisKmeans Data Sets LLE LEI max ave max ave 10-2 10-1 100 101 ACC banding 0.771 0.768 0.771 0.767 0.771 0.771 0.771 0.771 0.648 0.764 soybean 0.641 0.634 0.633 0.632 0.639 0.639 0.638 0.637 0.630 0.649 segment 0.687 0.664 0.676 0.672 0.664 0.659 0.671 0.680 0.594 0.663 pendigits 0.699 0.690 0.696 0.690 0.700 0.696 0.696 0.697 0.599 0.697 satimage 0.701 0.651 0.654 0.642 0.696 0.712 0.696 0.683 0.627 0.663 leukemia 0.775 0.763 0.738 0.738 0.738 0.753 0.738 0.738 0.714 0.686 ORL 0.744 0.738 0.739 0.738 0.749 0.743 0.748 0.748 0.733 0.317 USPS 0.712 0.628 0.692 0.683 0.684 0.702 0.680 0.684 0.631 0.700 AVE 0.716 0.692 0.700 0.695 0.705 0.709 0.705 0.705 0.647 0.642 NMI banding 0.225 0.221 0.225 0.219 0.225 0.225 0.225 0.225 0.093 0.213 soybean 0.707 0.701 0.698 0.696 0.706 0.707 0.704 0.704 0.691 0.709 segment 0.632 0.612 0.615 0.608 0.629 0.625 0.628 0.632 0.539 0.618 pendigits 0.669 0.656 0.660 0.654 0.661 0.658 0.658 0.660 0.577 0.645 satimage 0.593 0.537 0.551 0.541 0.597 0.608 0.596 0.586 0.493 0.548 leukemia 0.218 0.199 0.163 0.163 0.163 0.185 0.163 0.163 0.140 0.043 ORL 0.794 0.789 0.789 0.788 0.800 0.795 0.801 0.800 0.784 0.327 USPS 0.647 0.544 0.629 0.613 0.612 0.637 0.609 0.612 0.569 0.640 AVE 0.561 0.532 0.541 0.535 0.549 0.555 0.548 0.548 0.486 0.468

Acknowledgments
This research is sponsored by the National Science Foundation Grant IIS-0612069.

References
[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. In NIPS, 2003. [2] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. ¨ [3] O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 2006. [4] I. S. Dhillon, Y. Guan, and B. Kulis. A unified view of kernel k-means, spectral clustering and graph partitioning. Technical report, Department of Computer Sciences, University of Texas at Austin, 2005. [5] C. Ding and T. Li. Adaptive dimension reduction using discriminant analysis and k-means clustering. In ICML, 2007. [6] J. H. Friedman. Regularized discriminant analysis. JASA, 84(405):165­175, 1989. [7] K. Fukunaga. Introduction to Statistical Pattern Classification. Academic Press. [8] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins Univ. Press, 1996. [9] I.T. Jolliffe. Principal Component Analysis. Springer; 2nd edition, 2002. [10] F. De la Torre Frade and T. Kanade. Discriminative cluster analysis. In ICML, pages 241­248, 2006. [11] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. JMLR, 5:27­72, 2004. [12] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006. [13] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323­2326, 2000. ¨ [14] B. Scholkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, 2002. [15] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review, 38:49­95, 1996. [16] J. Ye, Z. Zhao, and H. Liu. Adaptive distance metric learning for clustering. In CVPR, 2007.

8