Efficient MultiClass Maximum Margin Clustering

Bin Zhao zhaobinhere@hotmail.com Fei Wang feiwang03@mails.tsinghua.edu.cn Changshui Zhang zcs@mail.tsinghua.edu.cn State Key Laboratory of Intelligent Technologies and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation, Tsinghua University, Beijing 100084, China

Abstract
This paper presents a cutting plane algorithm for multiclass maximum margin clustering (MMC). The proposed algorithm constructs a nested sequence of successively tighter relaxations of the original MMC problem, and each optimization problem in this sequence could be efficiently solved using the constrained concave-convex procedure (CCCP). Experimental evaluations on several real world datasets show that our algorithm converges much faster than existing MMC methods with guaranteed accuracy, and can thus handle much larger datasets efficiently.

1. Intro duction
Clustering (Duda et al., 2001; Shi & Malik, 2000; Ding et al., 2001) aims at dividing data into groups of similar ob jects, i.e. clusters. Recently, motivated by the success of large margin methods in supervised learning, (Xu et al., 2004) proposed maximum margin clustering (MMC), which borrows the idea from the support vector machine theory and aims at finding the maximum margin hyperplane which can separate the data from different classes in an unsupervised way. Technically, what MMC does is to find a way to label the samples by running an SVM implicitly, and the SVM margin obtained would be maximized over all possible labelings (Xu et al., 2004). However, unlike supervised large margin methods which are usually formulated as convex optimization problems, maximum margin clustering is a non-convex integer optimization problem, which is much more difficult to solve. Several attempts have been made to solve the maxiAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).

mum margin clustering problem in polynomial time. (Xu et al., 2004) and (Valizadegan & Jin, 2007) made several relaxations to the original MMC problem and reformulated it as a semi-definite programming (SDP) problem. However, even with the recent advances in interior point methods, solving SDPs is still computationally very expensive. Consequently, the algorithms can only handle very small datasets containing several hundreds of samples. More recently, Zhang et al. utilized alternating optimization techniques to solve the MMC problem (Zhang et al., 2007), in which the maximum margin clustering result is obtained by solving a series of SVM training problems. However, there is no guarantee on how fast it can converge and the algorithm is still time demanding on large scale datasets. Moreover, the methods described above can only handle binary clustering problems (Zhao et al., 2008), and there are significant complications to deriving an effective maximum margin clustering approach for the multiclass scenario1 . Therefore, how to efficiently solve the multiclass MMC problem to make it capable of clustering large scale dataset is a very challenging research topic. In this paper, we propose a cutting plane multiclass maximum margin clustering algorithm CPM3C. Specifically, the algorithm constructs a nested sequence of successively tighter relaxations of the original multiclass MMC problem, and each optimization problem in this sequence could be efficiently solved using the constrained concave-convex procedure (CCCP). Moreover, we show that the computational time of CPM3C scales roughly linearly with the dataset size. Our experimental evaluations on several real world datasets show that CPM3C performs better than existing MMC methods, both in efficiency and accuracy.
It should be noted that (Xu & Schuurmans, 2005) proposed a multiclass extension for MMC, however, their algorithm has a time complexity of O(n7 ), which renders it impractical for real world datasets.
1


Efficient MultiClass Maximum Margin Clustering

The rest of this paper is organized as follows. Section 2 will show the CPM3C algorithm in detail, and the time complexity analysis of CPM3C will be presented in section 3. Section 4 presents the experimental results on several real world datasets, followed by the conclusions in section 5.

n where i=1 i is divided by n to better capture how the regularization parameter  scales with the dataset size. Different from SVM, where the class labels are given and the only variables are (w1 , . . . , wk ), MMC targets to find not only the optimal weight vectors, but also the optimal labeling vector y  . 2.2. Cutting Plane Algorithm In this section, we will reformulate problem (2) to reduce the number of variables. Specifically, Theorem 1 Problem (2) is equivalent to
w1 ,...,wk ,

2. Cutting Plane Multiclass Maximum Margin Clustering
We will formally present the cutting plane multiclass maximum margin clustering (CPM3C ) algorithm in this section. 2.1. Multiclass Maximum Margin Clustering Maximum margin clustering (MMC) extends the theory of support vector machine (SVM) to the unsupervised scenario. Specifically, given a point set X = {x1 , · · · , xn } and their labels y = (y1 , . . . , yn )  {1, . . . , k }n , SVM defines a weight vector wp for each class p  {1, . . . , k } and classifies sample x by T y  = arg maxy{1,...,k} wy x with the weight vectors ob2 tained as follows (Crammer & Singer, 2001)
w1 ,...,wk ,

min

s.t.

i = 1, . . . , n, r = 1, . . . , k pk q +
k

1p 1i ||wp ||2 + i  2 =1 n =1 q T wp x i
k

k

n

(3)

=1

I(wp xi >wq xi ) T T =1,q =p

T I(wr xi >wq xi ) - wr xi  1 - i T T

=1,q =r

min

s.t.

i = 1, . . . , n, r = 1, . . . , k
T T wyi xi + yi ,r - wr xi  1 - i

i 1p i ||wp ||2 +  2 =1 =1

k

n

where I (·) is the indicator function and the label for sample xi is determined as
yi = pk
k

(1)

=1

q p

I(wp xi >wq xi ) T T =1,q =p

(4)

where the data samples in X are mapped into a high (possibly infinite) dimensional feature space, and by using the kernel trick, this mapping could be done implicitly. However, in those cases where kernel trick cannot be applied, it is possible to compute the coordinates of each sample in the kernel PCA basis (Sch¨lkopf et al., 1999) according to kernel K . Thereo fore, throughout the rest of this paper, we use xi to denote the sample mapped by the kernel function. Instead of finding a large margin classifier given labels on the data as in SVM, MMC targets to find a labeling that would result in a large margin classifier (Xu et al., 2004). That is to say, if we subsequently run an SVM with the labeling obtained from MMC, the margin would be maximal among all possible labelings. multiclass MMC could be formulated as follows:
w1 ,...,wk ,,y

Proof. We will show that for every (w1 , . . . , wk ) the n smallest feasible i=1 i are identical for problem (2) and problem (3), and their corresponding labeling vectors are the same. For a given (w1 , . . . , wk ), the i in problem (2) can be optimized individually. According to the constraint in problem (2),
T T i  1 - (wyi xi + yi ,r - wr xi ), r = 1, . . . , k

(5)

As the ob jective is to minimize value for i is
i
(1)

1 n

=

yi =1,...,k r =1,...,k

min

T T max {1 - (wyi xi + yi ,r - wr xi )}

n

i=1 i ,

the optimal
(6)
(1)

and we denote the corresponding class label by yi . Without loss of generality, we assume the following relationship
T T T wi 1 x i  wi 2 x i  . . .  w i k x i

(7)

min

s.t.

i = 1, . . . , n, r = 1, . . . , k
T T wyi xi + yi ,r - wr xi  1 - i

k n 1p 1i  ||wp ||2 + i 2 =1 n =1

(2)

where (i1 , i2 , . . . , ik ) is a permutation of (1, 2, . . . , k ). T T For yi = i1 , maxr=1,...,k {1 - (wyi xi + yi ,r - wr xi )}  T 1, while for yi = i1 , maxr=1,...,k {1 - (wyi xi + yi ,r -
T wr xi )}  1, therefore, yi
(1) i

(1)

= i1 and
(8)

=

r =1,...,k

T T max {1 - (wi1 xi + i1 ,r - wr xi )}

Although we focus on the multiclass SVM formulation of (Crammer & Singer, 2001), our method can be directly applied to other multiclass SVM formulations.

2

T T = max{0, 1 - (wi1 xi - wi2 xi )}

Similarly, for problem (3), the optimal value for i is


Efficient MultiClass Maximum Margin Clustering
(2)

i

=

r =1,...,k  k

max

 
q +

=1,q =r

=

r =1,...,k

T T max {1 - (wi1 xi + i1 ,r - wr xi )} T T (wi1 xi - wi2 xi )}

  T I(wr xi >wq xi ) - wr xi  T T 

1-


pk

k

=1

q T wp x i

I(wp xi >wq xi ) T T

=1,q =p

For any given (w1 , . . . , wk ), the i in problem (3) can be optimized individually and the optimum is achieved as
(9) i
(2) T T = max{0, 1 - (wi1 xi - wi2 xi )}

(12)

where we assume the relation in (7) holds. Similarly for problem (11), the optimal  is
 S
(3)

= max{0, 1 -

=

and the class label could be calculated as
yi
(2)

c1 ,...,cn {e0 ,...,ek }

max

n

1

in

=

pk

k

=1

q p

I(wp xi >wq xi ) = i1 T T
=1,q =p

(10)

+

pk

n 1i cTe - i n =1 =1

c

T ie

pk

T wp xi zip (13)

=1

T cip (zip - wp xi )

=1

Therefore, the ob jective functions of both optimization problems are equivalent for any (w1 , . . . , wk ) with the same optimal i , and consequently so are their optima. Moreover, their corresponding labeling vectors y are the same. Hence, we proved that problem (2) is 2 equivalent to problem (3). By reformulating problem (2) as problem (3), the number of variables involved is reduced by n, but there are still n slack variables i in problem (3). Define ep as the k × 1 vector with only the p-th element being 1 and others 0, e0 as the k × 1 zero vector and e as the all one vector. To further reduce the number of variables involved in the optimization problem, we have the following theorem Theorem 2 Problem (3) can be equnvalently formui 1  lated as problem (11), with   = n i=1 i .
w1 ,...,wk ,0

ince each ci are independent in Eq.(13), they can be optimized individually. Therefore,
 (3) =
n pk pk T 1i T max{cTe - cTe wp xi zip - cip (zip - wp xi )} i i n =1 ci =1 =1 = in 0 1 T T = max , max [1 - (wi1 xi + i1 ,p - wp xi )] p=1,...,k n =1

= 1 in n

max

1i 1i (2) T T max{0, 1 - (wi1 xi - wi2 xi )} =  n =1 n =1 i

=1 n

0 T T , max[0, 1 - (wi1 xi - wi2 xi )]
n

Hence, for any (w1 , . . . , wk ), the ob jective functions for problem (3) and problem (11) have the same value given the optimal  and i . Therefore, the optima of the two optimization problems are the same. 2 Putting theorems 1 and 2 together, we could therefore solve problem (11) instead to find the same maximum margin clustering solution, with the number of variables reduced by 2n - 1. Although the number of variables in problem (11) is greatly reduced, the number of constraints increases from nk to (k + 1)n . The algorithm we propose in this paper targets to find a small subset of constraints from the whole set of constraints in problem (11) that ensures a sufficiently accurate solution. Specifically, we employ an adaptation of the cutting plane algorithm (Kelley, 1960) to solve problem (11), where we construct a nested sequence of successively tighter relaxations of problem (11). Moreover, we can prove theoretically (see section 3) that we can always find a polynomially sized subset of constraints, with which the solution of the relaxed problem fulfills all constraints from problem (11) up to a precision of . That is to say, the remaining exponential number of constraints are guaranteed to be violated by no more than , without the need for explicitly adding them to the optimization problem (Tsochantaridis et al., 2005). Specifically, the CPM3C algorithm keeps a subset  of working constraints and

min

s.t.

ci  {e0 , e1 , . . . , ek }, i = 1, . . . , n n c pk T pk 1i T T wp xi zip+ cip (zip - wp xi ) ie n =1 =1 =1 1i T ci e -  n =1
n

1p  ||wp ||2 +  2 =1

k

(11)

k T T where zip = q=1,q=p I(wp xi >wq xi ) i = 1, . . . , n; p = 1 . . . , k and each constraint c is represented as a k × n matrix c = (c1 , . . . , cn ). Proof. To justify the above theorem, we will show that problem (3) and problem (11) have the same ob jective value and an equivalent set of constraints. Specifically, we will prove that fon every (w1 , . . . , wk ), r the smalnest feasible  and l i=1 i are related by 1  = n i=1 i . This means, with (w1 , . . . , wk ) fixed, (w1 , . . . , wk ,  ) and (w1 , . . . , wk , i ) are optimal solutions to problem (3) and (11) respectively, and they result in the same ob jective function value.


Efficient MultiClass Maximum Margin Clustering

computes the optimal solution to problem (11) sub ject to the constraints in . The algorithm then adds the most violated constraint in problem (11) into . In this way, a successively strengthening approximation of the original MMC problem is constructed by a cutting plane that cuts off the current optimal solution from the feasible set (Kelley, 1960). The algorithm stops when no constraint in (11) is violated by more than . Here, the feasibility of a constraint is measured by the corresponding value of  , therefore, the most violated constraint is the one that results in the largest  . Since each constraint in problem (11) is represented by a k × n matrix c, then we have
T Theorem 3 Define p = arg maxp (wp xi ) and r  = T arg maxr=p (wr xi ) for i = 1, . . . , n, the most violated constraint could be calculated as fol lows

2.3. Enforcing the Class Balance Constraint In 2-class maximum margin clustering, a trivially "optimal" solution is to assign all patterns to the same class, and the resultant margin will be infinite (Xu et al., 2004). Similarly, for the multiclass scenario, a large margin can always be achieved by eliminating classes (Xu & Schuurmans, 2005). Therefore, we add the following class balance constraints to avoid the trivially "optimal" solutions
-l  in in T T wp xi - wq xi  l, p, q = 1, . . . , k
=1

(16)

=1

where l  0 controls the class imbalance. Therefore, multiclass maximum margin clustering with working constraint set  could be formulated as follows
min 1p (17) ||wp ||2 +   2 =1 n c pk T pk 1i T T wp xi zip+ cip (zip - wp xi ) ie n =1 =1 =1 1i T ci e -  , [c1 , . . . , cn ]   n =1 -l  in
=1 =1 n k

ci =

0

e

r

T T if (wp xi - wr xi ) < 1 , i = 1, . . . , n otherwise

(14)

w1 ,...,wk ,0

s.t.

Proof. The most violated constraint is the one that would result in the largest  . As each ci in the constraint is independent, in order to fulfill all constraints in problem (11), the value of  is as follows
 = = 1 n in max{cTe - cTe i i
ci

=1 n

pk

T wp xi zip-

=1

pk

T cip (zip- wp xi )}

in T T wp xi - wq xi  l, p, q = 1, . . . , k

=1

1i T T max{ci [e - wp xi e - zi + ti ]} n =1 ci

Before getting into details of solving problem (17), we first present the CPM3C approach in Algorithm 1. Algorithm 1 Cutting Plane Multiclass MMC Initialize  =  rep eat Solve problem (17) for (w1 , . . . , wk ) under the current working constraint set  and select the most violated constraint c with Eq.(14). Set  =   {c}. 2 until (w1 , . . . , wk ) satisfies c up to precision

T T where ti = (w1 xi , . . . , wk xi )T . Since ci  {e0 , . . . , ek }, T ci selects the largest element of the vector e - wp xi e - T T zi +ti , which could be calculated as 1-(wp xi -wr xi ). Therefore, the most violated constraint c that results in the largest   could be calculated as in Eq.(14). 2

The CPM3C algorithm iteratively selects the most violated constraint under the current weight vectors and adds it into the working constraint set  until no violation of constraints is detected. Moreover, if a point (w1 , . . . , wk ,  ) fulfills all constraints up to precision 
ci  {e0 , e1 , . . . , ek }n , i = 1, . . . , n (15) c n n k k p p 1i T 1i T T T t ci e-- wp xi zip+ cip (zip - wp xi ) ie n =1 n =1 =1 =1

.4. Optimization via the CCCP In each iteration of the CPM3C algorithm, we need to solve problem (17) to obtain the optimal classifying hyperplanes under the current working constraint set . Although the ob jective function in (17) is convex, the constraints are not, and this makes problem (17) difficult to solve. Fortunately, the constrained concaveconvex procedure (CCCP) is just designed to solve those optimization problems with a concave-convex ob jective function under concave-convex constraints (Smola et al., 2005). In the following, we will show how to utilize CCCP to solve problem (17).

hen the point (w1 , . . . , wk ,  + ) is feasible. Furthermore, as in the ob jective function of problem (11), there is a single slack variable  that measures the clustering loss. Hence, we could simply select the stopping criterion as all samples satisfying the inequality (15). Then, the approximation accuracy of this approximate solution is directly related to the training loss.


Efficient MultiClass Maximum Margin Clustering

The ob jective function in (17) and the second constraint are convex. Moreover, the first constraint is, although non-convex, the difference of two convex functions. Hence, we ccn solve (17) with CCCP. Notice a i n k k 1 T that while n i=1 Te p=1 wp xi zip + p=1 cip zip s i convex, it is a non-smooth function of (w1 , . . . , wk ). To use CCCP, we need to calculate the subgradients :
w r
n

Algorithm 2 Solve problem (17) with CCCP
0 Initialize wp = wp for p = 1, . . . , k . rep eat t t Find (w1+1 , . . . , wk+1 ) as the solution to the quadratic programming problem (20). t Set wp = wp+1 , p = 1, . . . , k until stopping criterion satisfied.

n

1

in

=1

c

T ie

pk

T wp xi zip +

=1

pk

cip zip

=1

w

(18)
=w(t)

=

1i (t) cTezip xi i n =1

r = 1, . . . , k
(0) (0)

ob jective function is larger than 1 - % of the ob jective function in last iteration, since CCCP decreases the ob jective function monotonically. 2.5. Theoretical Analysis We provide the following theorem regarding the correctness of the CPM3C algorithm. Theorem 4 For any dataset X = (x1 , . . . , xn ) and   any > 0, if (w1 , . . . , wk ,   ) is the optimal solution to problem (11) with the class balance constraint, then our CPM3C algorithm returns a point (w1 , . . . , wk ,  ) for which (w1 , . . . , wk ,  + ) is feasible in problem (11) and satisfies the class balance constraint. Moreover, the corresponding objective value is better than the one   corresponds to (w1 , . . . , wk ,   ). Based on the above theorem, indicates how close one wants to be to the error rate of the best classifying hyperplanes and can thus be used as the stopping criterion (Joachims, 2006).

Given an initial point (w1 , . . . , wk ), CCCP com(t+1) (t+1) (t) (t) putes (w1 , . .c , wk ) from (w1 , . . . , wk ) by re. i k k n 1 T placing n i=1 Te p=1 wp xi zip + p=1 cip zip n the i constraint with its first order Taylor expansion at (t) (t) (w1 , . . . , wk ), i.e.
1i n =1 +
n

c

T ie

B

p 1i (t) ( (wp - wpt) )Txi zip cTe i n =1 =1 c n p k T (t) p k 1i (t) T = wp xi zip + cip zip ie n =1 =1 =1

n

pk

(t) (T wpt) xi zip +

=1 k

pk

(t) cip zip

=1

(

19)

y substituting the above first-order Taylor expansion into problem (11), we obtain the following quadratic programming (QP) problem:
w1 ,...,wk ,0

min

s.t. [c1 , . . . , cn ]  
n

1p  ||wp ||2 +  2 =1
n k

k

3. Time Complexity Analysis
In this section, we will provide analysis on the time complexity of CPM3C. For the high-dimensional (say, d-dimensional) sparse data commonly encountered in applications like text mining and bioinformatics, we assume each sample has only s d non-zero features, i.e., s implies the sparsity, while for non-sparse data, by simply setting s = d, all our theorems still hold. Theorem 5 Each iteration of CPM3C takes time O(snk ) for a constant working set size ||. Moreover, for the binary clustering scenario, we have the following theorem Theorem 6 For any > 0,  > 0, and any dataset X = {x1 , . . . , xn } with samples belonging to two different classes, the CPM3C algorithm terminates after adding at most R constraints, where R is a constant 2 number independent of n and s. It is true that the number of constraints can potentially explode for small values of , however, experi-

(20)

1i p 1i T cTe -  + cip wp xi i n =1 n =1 =1 c  n p k T (t) p k 1i (t) T - wp xi zip + cip zip 0 ie n =1 =1 =1 -l  in
=1 =1

in T T wp xi - wq xi  l, p, q = 1, . . . , k

Moreover, the dual problem of (20) is a QP problem with || + 2 variables and could be solved in polynomial time, where || denotes the total number of constraints in . Putting everything together, according to the formulation of the CCCP (Smola et al., 2005), we solve problem (17) with the approach presented in Algorithm 2, where we set the stopping criterion in CCCP as the difference between two iterations less than % and set % = 0.01, which means the current


Efficient MultiClass Maximum Margin Clustering

ence with CPM3C shows that relatively large values of are sufficient without loss of clustering accuracy. Since the number of iterations in CPM3C (with k = 2) is bounded by R , a constant independent of n and s, 2 and each iteration of the algorithm takes time O(snk ) (O(sn) for the binary clustering scenario), we arrive at the following theorem Theorem 7 For any dataset X = {x1 , . . . , xn } with n samples belonging to 2 classes and sparsity of s, and any fixed value of  > 0 and > 0, the CPM3C algorithm takes time O(sn) to converge. For the multiclass scenario, experimental results shown in section 4 also demonstrate that the computational time of CPM3C scales roughly linearly with the dataset size n.

"Topic Codes" hierarchy in the training set.
Table 1. Descriptions of the datasets.
Data Letter UCIDig UCISat MNIST Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC 20-news RCVI Size (n) 1555 1797 2236 70000 751 400 1617 1246 1575 827 814 1166 1210 3970 21251 Feature (N ) 16 64 36 784 6234 3989 8329 6737 7949 4134 4029 4165 4189 8014 47152 Class 2 10 2 10 9 7 7 4 9 7 7 7 7 4 4 Sparsity 98.9% 51.07% 100% 19.14% 0.68% 1.1% 0.58% 0.75% 0.56% 2.32% 1.97% 2.05% 2.16% 0.75% 0.16%

4.2. Comparisons and Clustering Results Besides our CPM3C algorithm, we also implements some other competitive algorithms and present their results for comparison. Specifically, we use K-Means (KM) and Normalized Cut (NC) as baselines, and also compared with Maximum Margin Clustering (MMC) (Xu et al., 2004), Generalized Maximum Margin Clustering (GMC) (Valizadegan & Jin, 2007) and Iterative Supp ort Vector Regression (SVR) (Zhang et al., 2007) which all aim at clustering data with the maximum margin hyperplane. Technically, for k-means, the cluster centers are initialized randomly. For NC, the implementation is the same as in (Shi & Malik, 2000), and the width of the Gaussian kernel is set by exhaustive search from the grid {0.10 , 0.20 , . . . , 0 }, where 0 is the range of distance between any two data points in the dataset. Moreover, for MMC and GMC, the implementation is the same as in (Xu et al., 2004; Xu & Schuurmans, 2005) and (Valizadegan & Jin, 2007) respectively. Furthermore, the implementation code for SVR is downloaded from http://www.cse.ust.hk/twinsen and the initialization is based on k-means with randomly selected initial data centers, and the width of the Gaussian kernel is set in the same way as in NC. In the experiments, we set the number of clusters equal to the true number of classes k for all the clustering algorithms. To assess clustering accuracy, we follow the strategy used in (Xu et al., 2004) where we first take a set of labeled data, remove the labels for all data samples and run the clustering algorithms, then we label each of the resulting clusters with the ma jority class according to the original training labels, and finally measure the number of correct classifications made by each clustering. Moreover, we also calculate the Rand Index (Rand, 1971) for each clustering result. The clustering accuracy and Rand index results are summarized in table 2 and table 3 respectively,

4. Exp eriments
In this section, we will validate the accuracy and efficiency of the CPM3C algorithm on several real world datasets. Moreover, we will also analyze the scaling behavior of CPM3C with the dataset size and the sensitivity of CPM3C to , both in accuracy and efficiency. All the experiments are performed with MATLAB 7.0 on a 1.66GHZ Intel CoreTM 2 Duo PC running Windows XP with 1.5GB main memory. 4.1. Datasets We use eight datasets in our experiments, which are selected to cover a wide range of properties: Digits, Letter and Satellite from the UCI repository, MNIST3 , 20 newsgroup4 , WebKB5 , Cora (McCallum et al., 2000) and RCVI (Lewis et al., 2004). In order to compare CPM3C with other MMC algorithms which can only perform binary clustering, we choose the first two classes from Letter and Satellite. For the 20 newsgroup dataset, we choose the topic rec which contains autos, motorcycles, basebal l and hockey from the version 20-news-18828. For WebKB, we select a subset consists of about 6000 web pages from computer science departments of four schools (Cornell, Texas, Washington, and Wisconsin). For Cora, we select a subset containing the research paper of subfield data structure (DS), hardware and architecture (HA), machine learning (ML), operating system (OS) and programming language (PL). For RCVI, we use the data samples with the highest four topic codes (CCAT, ECAT, GCAT and MCAT) in the
3

http://yann.lecun.com/exdb/mnist/ http://people.csail.mit.edu/jrennie/20Newsgroups/ 5 http://www.cs.cmu.edu/WebKB/
4


Efficient MultiClass Maximum Margin Clustering Table 2. Clustering accuracy(%) comparisons.
Data Dig 3-8 Dig 1-7 Dig 2-7 Dig 8-9 Letter UCISat Text-1 Text-2 UCIDig MNIST Dig 0689 Dig 1279 Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC 20-news RCVI KM 94.68 94.45 96.91 90.68 82.06 95.93 50.53 50.38 96.38 89.21 42.23 40.42 28.24 34.02 27.08 23.87 33.80 55.71 45.05 53.52 49.53 35.27 27.05 NC 65.00 55.00 66.00 52.00 76.80 95.79 93.79 91.35 97.57 89.92 93.13 90.11 36.88 42.00 31.05 23.03 33.97 61.43 35.38 32.85 33.31 41.89 MMC 90.00 68.75 98.75 96.25 94.83 91.91 GMC 94.40 97.8 99.50 84.00 SVR 96.64 99.45 100.0 96.33 92.80 96.82 96.82 93.99 98.18 92.41 CPM3C 96.92 100.0 100.0 97.74 94.47 98.48 95.00 96.28 99.38 95.71 96.63 94.01 43.75 59.75 45.58 58.89 46.83 71.95 69.29 77.96 73.88 70.63 61.97

tive algorithms on almost all the datasets. 4.3. Sp eed of CPM3C Table 4 compares the CPU-time of CPM3C with other competitive algorithms. According to the table, CPM3C is at least 18 times faster than SVR, 200 times faster than GMC. As reported in (Valizadegan & Jin, 2007), GMC is about 100 times faster than MMC. Hence, CPM3C is still faster than MMC by about four orders of magnitude. Moreover, as the sample size increases, the CPU-time of CPM3C grows much slower than that of iterative SVR, which indicates CPM3C has much better scaling property with the sample size than SVR. Finally, CPM3C also performs much faster than conventional kmeans, which is a very appealing result. As for the Ncut method, since the calculation of the similarity matrix is very time consuming and usually takes several hours on the text datasets, we do not report the time it spends here.
Table 4. CPU-time (seconds) comparisons.
Data Dig 3-8 Dig 1-7 Dig 2-7 Dig 8-9 Letter UCISat Text-1 Text-2 Dig 0689 Dig 1279 Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC 20-news RCVI KM 0.51 0.54 0.50 0.49 0.08 0.19 66.09 52.32 34.28 17.78 839.67 204.43 22781 47931 7791.4 672.69 766.77 4135.2 1578.2 2387.8 428770 GMC 276.16 289.53 304.81 277.26 SVR 19.72 20.49 19.69 19.41 2133 6490 930.0 913.8 CPM3C 1.10 0.95 0.75 0.85 0.87 4.54 19.75 16.16 9.66 17.47 35.31 24.35 69.04 13.98 165.0 9.534 10.53 10.67 9.041 215.6 587.9

where the results for k-means and iterative SVR are averaged over 50 independent runs and `-' means the corresponding algorithm cannot handle the dataset in reasonable time. Since GMC and iterative SVR can only handle binary clustering problems, we also provide experiments on several 2-class problems: Letters, Satellite, autos vs. motorcycles (Text-1) and baseball vs. ho ckey (Text-2). Moreover, for the UCI-Digits and MNIST datasets, we enumerate all 45 possible class pairs, and report the average clustering results. Furthermore, as the MMC and GMC algorithms can only handle datasets with no more than a few hundred samples, we perform experiments on UCI Digits and focus on those pairs (3 vs 8, 1 vs 7, 2 vs 7, 8 vs 9, 0689 and 1279) that are difficult to differentiate. From the tables we can clearly observe
Table 3. Rand Index comparisons.
Data Dig 3-8 Dig 1-7 Dig 2-7 Dig 8-9 Letter UCISat Text-1 Text-2 UCIDig MNIST Dig 0689 Dig 1279 Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL WK-CL WK-TX WK-WT WK-WC 20-news RCVI KM 0.904 0.995 0.940 0.835 0.706 0.922 0.500 0.500 0.933 0.808 0.696 0.681 0.589 0.385 0.514 0.518 0.643 0.603 0.604 0.616 0.581 0.581 0.471 NC 0.545 0.504 0.550 0.500 0.644 0.919 0.884 0.842 0.956 0.818 0.939 0.909 0.744 0.659 0.720 0.522 0.675 0.602 0.514 0.581 0.509 0.496 MMC 0.823 0.571 0.978 0.929 0.941 0.913 GMC 0.899 0.962 0.994 0.733 SVR 0.940 0.995 1.00 0.934 0.867 0.939 0.939 0.887 0.967 0.860 CPM3C 0.945 1.00 1.00 0.956 0.897 0.971 0.905 0.929 0.989 0.921 0.968 0.943 0.735 0.692 0.754 0.721 0.703 0.728 0.707 0.747 0.752 0.782 0.698

4.4. Dataset size n vs. Sp eed In the theoretical analysis section, we state that the computational time of CPM3C scales linearly with the number of samples. We present numerical demonstraCora & 20News 10 CPU-Time (seconds)
3

CPU-Time (seconds)

10

2

Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL 20News O(n)

10

3

WebKB & RCVI
WK-CL WK-HA WK-WT WK-WC RCVI O(n)

10

2

10

1

10

1

10 2 10

0

10 Number of Samples

3

10 2 10

0

10 Number of Samples

3

10

4

that our CPM3C algorithm can beat other competi-

Figure 1. CPU-Time (seconds) of CPM3C as a function of dataset size n.


Efficient MultiClass Maximum Margin Clustering

tion for this statement in figure 1, where a log-log plot of how computational time increases with the size of the data set is shown. Specifically, lines in the log-log plot correspond to polynomial growth O(nd ), where d is the slope of the line. Figure 1 shows that the CPU-time of CPM3C scales roughly O(n), which is consistent with the statement in section 3. 4.5. vs. Accuracy & Sp eed

Acknowledgments
This work is supported by the pro jects (60721003) and (60675009) of the National Natural Science Foundation of China.

References
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265­292. Ding, C., He, X., Zha, H., Gu, M., & Simon, H. D. (2001). A min-max cut algorithm for graph partitioning and data mining. ICDM (pp. 107­114). Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. John Wiley & Sons, Inc. Joachims, T. (2006). Training linear svms in linear time. SIGKDD 12. Kelley, J. E. (1960). The cutting-plane method for solving convex programs. Journal of the Society for Industrial Applied Mathematics, 8, 703­712. Lewis, D. D., Yang, Y., Rose, T., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. JMLR, 5, 361­397.

Theorem 6 states that the total number of iterations involved in CPM3C is at most R , and this means with 2 higher , the algorithm might converge fast. However, as is directly related to the training loss in CPM3C, we need to determine how small should be to guarantee sufficient accuracy. We present in figure 2 and figure 3 how clustering accuracy and computational time scale with . According to figure 2, = 0.01
0.8 0.7 Clustering Accuracy 0.6 0.5 0.4 0.3 0.2 -4 10
Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL 20News

Cora & 20News

0.9 0.8 Clustering Accuracy 0.7 0.6 0.5 0.4 -4 10
WK-CL WK-TX WK-WT WK-WC data5

WebKB & RCVI

10

-2

Epsilon

10

0

10

-2

Epsilon

10

0

Figure 2. Clustering accuracy of CPM3C vs. .
10
3

Cora & 20News

10

4

WebKB & RCVI

McCallum, A., Nigam, K., Rennie, J., & Seymore, K. (2000). Automating the contruction of internet portals with machine learning. Information Retrieval Journal, 3, 127­163. Rand, W. M. (1971). Ob jective criteria for the evaluation of clustering methods. JASA, 66, 846­850.

CPU-Time (seconds)

CPU-Time (seconds)

10

2

10

2

10

1

10

0

Cora-DS Cora-HA Cora-ML Cora-OS Cora-PL 20News O(x-0.5)

10

0

10 10
-2

-2

WK-CL WK-TX WK-WT WK-WC RCVI O(x
-4
-0.5

Scholkopf, B., Smola, A. J., & Muller, K. R. (1999). Ker¨ ¨ nel principal component analysis. Advances in kernel methods: support vector learning, 327­352.
10
-2

)

10 -4 10

-1

Epsilon

10

0

10

Epsilon

10

0

Figure 3. CPU-time (seconds) of CPM3C vs. .

Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. PAMI. Smola, A. J., Vishwanathan, S., & Hofmann, T. (2005). Kernel methods for missing variables. AISTATS 10. Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. JMLR, 6, 1453­1484. Valizadegan, H., & Jin, R. (2007). Generalized maximum margin clustering and unsupervised kernel learning. NIPS 19 (pp. 1417­1424). Xu, L., Neufeld, J., Larson, B., & Schuurmans, D. (2004). Maximum margin clustering. NIPS 17. Xu, L., & Schuurmans, D. (2005). Unsupervised and semisupervised multi-class support vector machines. AAAI. Zhang, K., Tsang, I. W., & Kowk, J. T. (2007). Maximum margin clustering made practical. ICML 24. Zhao, B., Wang, F., & Zhang, C. (2008). Efficient maximum margin clustering via cutting plane algorithm. SDM (pp. 751­762).

is small enough to guarantee clustering accuracy. The log-log plot in figure 3 verifies that the CPU-time of CPM3C decreases as increases. Moreover, the em1 pirical scaling of roughly O( 0.5 ) is much better than 1 O( 2 ) in the bound from theorem 6.

5. Conclusions
We propose the cutting plane multiclass maximum margin clustering (CPM3C) algorithm in this paper, to cluster data samples with the maximum margin hyperplane. Preliminary theoretical analysis of the algorithm is provided, where we show that the computational time of CPM3C scales linearly with the sample size n with guaranteed accuracy. Moreover, experimental evaluations on several real world datasets show that CPM3C performs better than existing MMC methods, both in efficiency and accuracy.