Fast Evolutionary Maximum Margin Clustering

Fabian Gieseke Faculty of Computer Science, TU Dortmund, Germany

fabian.gieseke@cs.tu-dortmund.de

Tapio Pahikkala tapio.pahikkala@utu.fi Turku Centre for Computer Science, Department of Information Technology, University of Turku, Finland Oliver Kramer Faculty of Computer Science, TU Dortmund, Germany oliver.kramer@cs.tu-dortmund.de

Abstract
The maximum margin clustering approach is a recently proposed extension of the concept of support vector machines to the clustering problem. Briefly stated, it aims at finding an optimal partition of the data into two classes such that the margin induced by a subsequent application of a support vector machine is maximal. We propose a method based on stochastic search to address this hard optimization problem. While a direct implementation would be infeasible for large data sets, we present an efficient computational shortcut for assessing the "quality" of intermediate solutions. Experimental results show that our approach outperforms existing methods in terms of clustering accuracy.

puted by a subsequent application of a support vector machine is maximal. Experimental results show that this technique often outperforms common clustering methods with respect to the accuracy. However, applying the approach requires solving a non-convex integer problem, which is, due to its combinatorial nature, a difficult task. Aiming at practical solutions, Zhang et al. (2007) proposed a method which is based on iteratively applying a SVM to improve an initial "guess" obtained by a k-means (Hartigan & Wong, 1979) preprocessing step. Our method is inspired by this approach, i.e. starting with an initial set of (random) candidate solutions, we iteratively update the quality of these candidates by means of a stochastic search heuristic. Our key contribution is a computational shortcut for assessing the quality of a candidate solution. More precisely, we depict how to efficiently update some auxiliary information and, based on this information, how to compute the quality of a candidate in linear time. Compared to standard methods for evaluating the quality, this constitutes a runtime reduction by a quadratic factor and will make our approach capable of testing a massive amount of candidate solutions. Our experimental results show that our approach yields a better clustering accuracy than conventional techniques in most cases. Notations. We use [n] to denote the set {1, . . . , n}. Further, the set of all n × m matrices with real coefficients is denoted by Rn×m . Given a matrix M  Rn×m , we denote the element in the i-th row and j-th column by [M ]i,j . For two sets R = {i1 , . . . , ir }  [n] and S = {k1 , . . . , ks }  [m] of indices, we use MRS to denote the matrix that contains only the rows and columns of M that are indexed by R and S, respectively. Moreover, we set MR,[m] = MR . At last, we use yi to denote the i-th coordinate of a vector y  Rn .

1. Introduction
The task of clustering a given set of objects into groups of "similar" items is one of the most investigated problems in both data mining and machine learning and can be applied to many real-world situations such as computer vision, information retrieval, marketing, and many others (Jain & Dubes, 1988). Recently, a new clustering technique called maximum margin clustering (MMC) has been proposed by Xu et al. (2005). This technique can be seen as an extension of support vector machines (SVM) (Vapnik, 1998) to unsupervised learning scenarios: Having no class labels at hand, the aim is to find a partition of the objects into two classes such that the margin comAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Fast Evolutionary Maximum Margin Clustering

2. Maximum Margin Clustering
2.1. Standard Formulation Given a training set T = {(x1 , y1 ), . . . , (xn , yn )} with patterns xi belonging to a set X and class labels yi  Y = {-1, +1}, the aim of the SVM learning process is to find a hyperplane f (x) = w, (x) + b in a feature space H0 induced by a feature mapping  : X  H0 . Both the feature map and the feature space stem from a kernel function k : X × X  R with k(xi , xj ) = (xi ), (xj ) . This leads to the optimization problem minimize n 1 2 ||w|| + C 2
n

Central for our approach is the square-loss LLS (y, t) = 2 (y - t) leading to inf 1 n
n

f H

yi - f (xi )
i=1

2

+ ||f ||H

2

(4)

(Rifkin et al., 2003; Suykens & Vandewalle, 1999). By the representer theorem (Sch¨lkopf et al., 2001), any o minimizer f   H of (4) has the form
n

f  (·) =
i=1

ci k(xi , ·)
t

(5)

wH0 ,R ,bR

i
i=1

(1)

s.t. yi ( w, (xi ) + b)  1 - i , i  0, where the i values are called slack variables and where C > 0 is a manually chosen constant. The maximum margin clustering approach aims at finding a partition of the patterns such that the margin yielded by a subsequent application of a SVM is maximal. Originally, this task was formulated in terms of the following optimization problem (Xu et al., 2005): minimize 1 2 ||w|| + C 2
n

with appropriate coefficients c = (c1 , . . . , cn )  Rn . 2 Hence, by using ||f  ||H = ct Kc, where K  Rn×n is the (symmetric) kernel matrix with entries [K]i,j = k(xi , xj ), we can rewrite optimization problem (4) as 1 t minimize J(y, c) = (y-Kc) (y-Kc) + ct Kc. (6) cRn n Our approach is based on the replacement of the hinge loss by its least-squares variant, i.e. instead of solving problem (2) we aim at finding a solution for
y{-1,+1}

minimize n

y{-1,+1}n ,w,,b

i
i=1

(2)

,cRn

1 t (y - Kc) (y - Kc) + ct Kc (7) n
n

s.t. yi ( w, (xi ) + b)  1 - i , i  0,
n

s.t.

-l 
i=1

yi  l.

and - l 
i=1

yi  l,

2.3. Related Work The first attempts at coping with problem (2) consisted in relaxing the definition to form semidefinite programming problems (Xu et al., 2005; Valizadegan & Jin, 2007). However, these approaches resort to solving SDP problems, which is computationally expensive. The iterative method was the first approach capable of dealing with large data sets (Zhang et al., 2007). One of their key insights consisted in the replacement of the hinge loss by several other loss functions including the square-loss. The cutting plane algorithm is one of the most recent techniques (Zhao et al., 2008a). It is based on constructing a sequence of successively tighter relaxations of (2) and each of the intermediate tasks is solved using the so-called constrained concave-convex procedure. In addition to these methods, extensions to multiclass scenarios have been proposed in the literature (Xu & Schuurmans, 2005; Zhao et al., 2008b).
the offset term yields no known advantages for more complex kernel functions like the Gaussian RBF kernel (Rifkin, 2002; Steinwart & Christmann, 2008). Furthermore, when using a linear kernel, a regularized bias effect can be obtained by adding a dimension of 1's to the data.

where C > 0 and l  0 are manually chosen constants. 2.2. Least-Squares Variant The concept of support vector machines can also be considered to be a special case of regularization problems of the form inf 1 n
n

f H

L yi , f (xi ) + ||f ||H ,
i=1

2

(3)

where  > 0 is a fixed real number, L : Y ×R  [0, ) is a loss function measuring the "quality" of the pre2 diction function f on the training set, and ||f ||H is the squared norm in a reproducing kernel Hilbert space (RKHS) H  RX = {f : X  R} induced by a kernel function. Using the so-called hinge loss Lhinge (y, t) = max{0, 1 - yt} with y  {-1, +1} leads to the support vector machine approach depicted above (Sch¨lkopf o et al., 2001; Steinwart & Christmann, 2008).1
1 In the latter formulation, the offset b is omitted. From the theoretical as well as from the practical point of view,

Fast Evolutionary Maximum Margin Clustering

The MMC technique is related to the concept of Transductive Support Vector Machines (TSVM) (Vapnik, 1998) and the corresponding optimization problem is approached by several heuristics including genetic algorithms (Silva et al., 2005; Adankon & Cheriet, 2007). Another related technique is given by Bach and Harchaoui (2008). Their method also uses the square-loss instead of the hinge loss. Based upon this replacement, they show how to solve the resulting problem by approximating a discrete set of equivalence matrices by a convex set (Bach & Harchaoui, 2008).

3. Fast Evolutionary Approach
In this section, we outline our evolutionary approach for optimization problem (7). Evolutionary algorithms are inspired by biological evolution (Beyer & Schwefel, 2002): They start with an initial set of candidate solutions, called population. For each candidate, also named individual, the fitness measuring the quality is computed. After the initialization phase the iteration over all generations is started. For each generation, some individuals of the current population are mutated and the fitness values for the mutated individuals are computed. At the end of each generation, the population is updated by selecting the best performing candidates. The key insight of our approach is the way we obtain the fitness values. More precisely, we show how to compute the fitness values for a mutated individual based on some auxiliary information given for the parental individual. Further, we describe how to efficiently update this auxiliary information such that it is available for further generations. This computational shortcut greatly reduces the runtime of our approach. 3.1. Evolutionary Optimization Approach The framework of our evolutionary approach is given by Algorithm 1. The starting point is the population P0 = {y1 , . . . , yµ }  {-1, +1}n consisting of µ randomly generated individuals. Each of these individuals (along with a corresponding vector c  Rn ) constitutes a possible solution for optimization problem (7). Throughout our algorithm, we ensure that only valid individuals are created, i.e. individuals y n fulfilling the balance constraint -l  i=1 yi  l. In Step 2, the fitness F (y) is computed for each of the initial individuals, where F (y) = minimize J(y, c). n
cR

Algorithm 1 Evolutionary Optimizer Require: Training set T = {x1 , . . . , xn }  X, constants C > 0, l  0, and µ, ,   N. Ensure: (^ , c)  {-1, +1}n × Rn y ^ 1: Initialize P0 = {y1 , . . . , yµ }  {-1, +1}n 2: Compute the fitness F (yj ) for each yj  P0 3: t = 0 4: while t   do 5: for i = 1 to  do 6: Randomly select parent y  Pt 7: Generate valid mutated individual yµ+i 8: Compute fitness F (yµ+i ) 9: end for 10: Compute sorted sequence yi1 , . . . , yiµ+ 11: Pt+1 = {yi1 , . . . , yiµ } 12: t=t+1 13: end while ^ 14: Compute solution c for minimizecRn J(yi1 , c) ^ 15: return (yi1 , c)

produce mutated individuals. Each of these mutated individuals is created by flipping s = max{1, n/(t+1)} coordinates of the parental individual.2 After the computation of the fitness values for the mutated individuals, all resulting individuals are sorted upwards by their fitness values yielding a sorted sequence yi1 , . . . , yiµ+ . Finally, the population Pt is updated by selecting the µ individuals with the best fitness values. When generation  is reached, the best individual ^ along with its corresponding vector c is returned. 3.2. Fast Recurrent Fitness Computations For fixed y, the function J(y, ·) is convex and differentiable. Thus, the computation of the fitness F (yµ+i ) d in Step 8 can be performed by solving dc J(y, c) = 0 with respect to c which leads to c = Gy,
-1

(9)

with G = (K + nI) , where I  Rn×n denotes the identity matrix (Rifkin, 2002). This requires inverting a n×n matrix which takes O(n3 ) time. Assuming that G is stored in memory, a fitness value F (¯ ) for a new y ¯ individual y can be obtained in O(n2 ) time. Unfortunately, the quadratic runtime would still render our approach infeasible. Our key contribution is the following computational shortcut for obtaining the fitness values: Fix a parental individual y  Pt as selected in Step 6 and observe ¯ that a mutated individual y = yµ+i differs from y in
2 If a flip violates the balance constraint, a coordinate with a reverse label is selected and both labels are swapped.

(8)

Note that, in our case, individuals with a lower fitness represent better solutions. The iteration over all generations 1, . . . ,  is started in Step 4. For each generation t, we randomly select  parental individuals to

Fast Evolutionary Maximum Margin Clustering

only a small number s of coordinates.3 Further, by substituting (9) into (6), the value F (¯) becomes y F (¯ ) = y 1 t (¯ - KG¯ ) (¯ - KG¯ ) + ¯ t GKG¯ . (10) y y y y y y n

µn2 +  sn). Furthermore, the algorithm obviously needs O(n2 + µn) = O(n2 ) space to store all matrices, the auxiliary information, and the individuals. To sum up, for reasonable values for the parameters  , µ, and , the overall runtime of our algorithm is essentially the same as training a single classifier via (9). 3.3. Kernel Matrix Approximation The computational shortcut described above already offers a promising runtime reduction compared to the direct O(n2 ) time approach for computing the fitness values. However, when dealing with large data sets, the cubic runtime for obtaining the diagonal matrix D as well as the quadratic space consumption for storing the kernel matrix form bottlenecks. In this section, we propose a way to shorten both drawbacks. One way to deal with the above bottlenecks consists in approximating the kernel matrix: Fix an individual y and consider optimization problem (6). Here, the first term corresponds to the loss evaluated over all training patterns and the second term is the regularizer penalizing "complex" solutions. Now assume that only a subset of the coefficients c1 , . . . , cn in (5) is allowed to be nonzero. More formally, let R = {i1 , . . . , ir }  [n] be a subset of indices such that only ci1 , . . . , cir are ^ nonzero in (5). Then, we search for minimizers f (·) = r j=1 cij k(xij , ·)  H solving ^ F (y) = 1 ^ y - (KR )t c n ^ +^ KRR c. c minimize r
^ cR t t

Now, let K = V V t be the eigendecomposition of the kernel matrix, where V  Rn×n contains the eigenvectors of K and where the diagonal matrix  contains the corresponding eigenvalues. Using the latter decomposition of the kernel matrix, we can write G as -1 G = V V t , where  = ( + nI) . Moreover, using t G = G , the fitness value can then be reformulated as F (¯ ) y 1 t (¯ - KG¯ ) (¯ - KG¯ ) + ¯ t GKG¯ y y y y y y n 2 1 1 ¯ ¯ I - KG + GKKG + GKG y = yt n n n 1 2 2 2 ¯ ¯ = 1 + yt V   -  +  V t y. n n =
=:D

Furthermore, let S = {k1 , . . . , ks }  [n] denote the set ¯ of indices such that y differs from y for all coordinates k  S and agrees for all coordinates k  S. Then, /
t ¯ yt V = yt V + (¯ S - yS ) VS , y

(11)

can be computed in O(sn) time if yt V  Rn is precomputed. As the remaining multiplications can each be performed in O(n) time, it follows that the overall runtime for computing F (¯ ) is O(sn), assuming that y yt V and the diagonal matrix D  Rn×n are available. Theorem 1 Assuming that yt V  Rn and D  Rn×n are precomputed and stored in memory, one can compute the fitness F (¯ ) of a mutated individual with y n 1 |yj - yj | = s in O(sn) time. ¯ j=1 2 The latter result permits an efficient implementation of Algorithm 1, which is based on iteratively updating the auxiliary information yt V along with the shortcut for computing the fitness: The diagonal matrix D and t t the auxiliary information y1 V, . . . , yµ V for the initial population P0 can be obtained in O(n3 + µn2 ) time. Given this auxiliary information, the fitness value for each y  P0 can be computed in O(n) time. For each generation, the auxiliary information can be updated in O(sn) time. Moreover, due to Theorem 1, a fitness computation in Step 8 can be performed in O(sn) time. Hence, taking into account that Step 14 takes O(n3 ) time, the overall runtime of Algorithm 1 is O(n3 +
3 The number s decreases rapidly. Also, the stochastic search works well with s = 1 for all generations; for ease of notation, we therefore assume s to be a small constant for the further analysis.

^ y - (KR )t c (12)

Note that, by applying this approximation scheme, the loss is still evaluated over all training patterns. The benefit of this approach consists in a runtime and space reduction, i.e. the solution for (12) can be obtained in O(nr2 ) time and O(nr) space (Rifkin, 2002) via ^ c = (KR (KR )t + nKRR )-1 KR y. (13)

In the remainder of this section, we show that the above approximation scheme can be integrated into our approach for obtaining the fitness values: The op^ timal prediction vector ^ = (KR )t c  Rn for a muf ¯ tated individual y can be written as ^ = (KR )t 1 (KRR )-1 KR u = 1 Ku = K G¯ f y   using K = (KR )t (KRR ) u = (K + nI)
-1 -1

KR , G = (K + nI)

,

-1

¯ ^ y , c =

1 -1 (KRR ) KR u , 

Fast Evolutionary Maximum Margin Clustering

refer to the thesis of Rifkin (2002) for the derivation of the last equation.4 Thus, using (^ ) KRR (^ ) = c c
t

u 

t

K

u 

¯ = yt GK G¯ , (14) y

The evolutionary algorithm induced by Theorem 2 spends O(nr2 + µnr +  sn) time and uses O(nr + µn) space.6 Thus, both drawbacks, the cubic preprocessing time as well as the quadratic space consumption, are reduced at the cost of approximate fitness values.

^ y ¯ we can rewrite F (¯ ) for the mutated individual y as
t 1 (¯ - K G¯ ) (¯ - K G¯ ) + ¯ t GK G¯ . y y y y y y n The latter representation shows that applying the approximation approach corresponds to exchanging the matrices K and G by K and G in (10). The remaining parts of the computational shortcut leading to Theorem 1 can essentially be carried over directly: Let -1 (KRR ) = BB t be the Cholesky decomposition of -1 (KRR ) , where B  Rr×r is a lower triangular matrix. Then, we can write K as K = (KR )t BB t KR . Further, let (KR )t B = V U t be the reduced singular value decomposition5 of (KR )t B, where V  Rn×r ,   Rr×r , and U  Rr×r contain the first r left singular vectors, the nonzero singular values, and the right singular vectors, respectively. Then, since U t U = I, the modified kernel matrix can be expressed as

^ y F (¯ )

4. Experiments
We denote the evolutionary algorithms induced by the computational shortcuts depicted in Theorem 1 and Theorem 2 with EvoMMC and FastEvoMMC, respectively. Both algorithms are implemented with Python 2.5.2 including the numpy package. The runtime analyses are performed on a 3 GHZ Intel CoreTM Duo PC running Ubuntu 8.10. 4.1. Data Sets Various data sets are used to test the clustering accuracy as well as the runtime performance. Following previous works (Zhang et al., 2007; Zhao et al., 2008a), we apply our approach to data sets from the UCI repository7 (digits, ionosphere, letter, satellite) and from the MNIST database8 . More specifically, for the letter and satellite data set each containing several classes, we use the first two classes, i.e. A vs. B and 1 vs. 2, respectively. 4.2. Parameters For both algorithms, a couple of parameters need to be tuned, where we use a different number of initial solutions as well as a different number of mutations for EvoMMC (µ = 10,  = 10) and FastEvoMMC (µ = 1,  = 1), respectively. Further, we stop a run when no more fitness changes occur for a longer period of generations. In addition to these parameters, the number r determining the size of the index set R has to be selected for FastEvoMMC, which we fix to r = 0.1n for all data sets except for the MNIST digits where we use r = 0.01n. The corresponding set R of indices is selected randomly. While there are more sophisticated methods, this choice is based upon the observation of Rifkin et al. (2003) who reported that random selection yields very good results in practice. We use the RBF kernel exp(-||x - x || / 2 ) with ker6 The term  sn stems from copy operations which are necessary for generating mutated individuals if µ > 1 and  > 1. For the remaining cases, a single individual can be updated efficiently resulting in a  sr term. In these cases, one can afford   nr/s generations until this term dominates the runtime. 7 http://archive.ics.uci.edu/ml/ 8 http://yann.lecun.com/exdb/mnist/

=

K = V V t , where  = t contains the eigenvalues. Note that we can obtain the above decomposition in O(nr2 ) time, as performing the Cholesky decomposition and the reduced singular value decomposition takes O(r3 ) and O(nr2 ) time, respectively (Golub & Van Loan, 1989). Using the decomposition of K and matrix calculations as in the non-approximation case, one can derive a ^ y decomposition of the fitness F (¯ ) having the form ^ y ¯ ¯ F (¯ ) = 1 + yt V DV t y, (15)

where D  Rr×r is a diagonal matrix. Given the auxiliary information yt V  R1×r for the parental individual y, the according information for the mutated individual can be obtained in O(sr) time via
t ¯ yt V = yt V + (¯ S - yS ) VS . y

(16)

Hence, given the diagonal matrix D and the auxiliary information for the parental individual y, one can ob^ y ¯ tain the fitness F (¯ ) for y in O(sr) time. Theorem 2 Assuming that yt V  R1×r and D  Rr×r are precomputed and stored in memory, one can ^ y compute the fitness F (¯ ) for a mutated individual with n 1 |yj - yj | = s in O(sr) time. ¯ j=1 2
4 The details are depicted on page 111. We also assume that R is selected such that KRR is strictly positive definite. 5 Only such singular vectors of (KR )t B are calculated that correspond to nonzero singular values.

2

Fast Evolutionary Maximum Margin Clustering

nel width  for all experiments. As reported in related works, the parameters determining the optimization problems have a significant influence on the results. To select them, we set l to 0.03n, 0.1n, 0.3n, or 0.4n depending on the balance ratio of the specific data set. The parameters  and  are tuned via grid search.9 4.3. Clustering Accuracy Following Xu et al. (2005), we evaluate the clustering accuracy for our algorithms as follows: First, all class labels available in the data sets are removed. Afterwards, the algorithms are applied resulting in a prediction vector y for each data set. Finally, the accuracy is measured in terms of the clustering error which is obtained by counting the number of discrepancies between the real labels and those predicted by y. Evolutionary algorithms are susceptible to the problem of local optima. Typically, this problem is faced by running the algorithms multiple times and by taking the best result out of these runs afterwards. In our experiments, we apply each algorithm 10 times to each data set and take the partition vector with the smallest clustering error out of the reported vectors. To test the significance of such a result, we repeat this procedure 10 times and average the resulting errors. Except for the MNIST data set, all experiments are performed on the complete data sets in the way described above. As a pairwise comparison of MNIST digits involves roughly 14, 000 samples, we apply EvoMMC only to subsets containing 1, 000 digits. As mentioned in Section 3, applying EvoMMC to (much) larger data sets becomes infeasible due to the cubic preprocessing time and the quadratic storage requirement. The latter two drawbacks are shortened by FastEvoMMC and we apply this algorithm to the complete sets of digits. Further, for the UCI as well as for the MNIST data set, the averaged clustering error on all 45 pairs of digits is reported in addition to four particular comparisons of digits. Table 1 shows the resulting clustering accuracies. Previous results, including those of the k-means (KM) algorithm, are simply transferred from the according sources. It can clearly be seen that both algorithms consistently produce good clustering results and that, except for subsets of the UCI digits, either EvoMMC or FastEvoMMC yields the most accurate ones.
1 1 1 { 2n , 200n , 1000n }

Figure 1. Runtime comparison between the different fitness computation methods on artificial data sets whose distribution is shown in the upper left corner. Table 2. CPU time in seconds for single runs along with the average number of performed generations (in brackets).
Data UCI Digits 3­8 UCI Digits 1­7 UCI Digits 2­7 UCI Digits 8­9 Ionosphere Letter Satellite MNIST Digits 1­2 MNIST Digits 1­7 EvoMMC 20.80 16.61 18.40 14.46 14.18 172.09 394.95 (4923) (3600) (4374) (3320) (3288) (26867) (58107) FastEvoMMC 3.41 (8254) 2.92 (6978) 3.93 (9636) 3.09 (7383) 4.12 (10248) 44.16 (66297) 70.44 (157971) 274.60 (565929) 304.00 (605932)

4.4. Runtime Performance We start by comparing the efficiency of our computational shortcuts with two standard ways for obtaining the fitness values, i.e. we use Algorithm 1 and also resort to these standard techniques. The first alternative (EvoMMC(n2 )) is the quadratic time approach described at the beginning of Section 3.2. The second way (EvoMMC(SVM)) consists in applying a support vector machine to obtain fitness values, where we use the LIBSVM package (Chang & Lin, 2001) as implementation. To compare the efficiency, we apply all algorithms to a sequence of two-dimensional artificial data sets of varying sizes, see Figure 1. For all induced evolutionary algorithms, various parameters need to be tuned. In order to evaluate the performance of the computational shortcuts, we set µ = 1 and  = 1 for all four algorithms. Moreover, we set l = 0.03n,  = 1s,  = 1/(1000n), and, for FastEvoMMC, r = 0.1n. The average runtime of a single run (measured over 10 runs) in dependence on the size of the data sets is shown in Figure 1 for each algorithm. Clearly, both EvoMMC as well as FastEvoMMC offer a significant runtime improvement over the two standard alternatives. In addition to these experiments, we evaluate the performance of our approach on the real-world data sets. As assignments for the parameters, we use the best performing values leading to the clustering errors

Similarly to Zhang et al. (2007), we process (, )  × {1s, 3s, 5s}, where the value s = "P ` ´2 "1/2 d k max{X } - min{X k } is a rough estimate of 1 the maximum distance between any pair of samples; here, {X k } denotes the set of all k-th attribute values.

9

Fast Evolutionary Maximum Margin Clustering Table 1. Clustering errors in % on the various data sets of KM (Hartigan & Wong, 1979), MMC (Xu et al., 2005), GMMC (Valizadegan & Jin, 2007), IterSVR (Zhang et al., 2007), CPMMC (Zhao et al., 2008a), EvoMMC, and FastEvoMMC. Bold numbers indicate the method with the lowest clustering error on the particular data set.
Data UCI Digits 3­8 UCI Digits 1­7 UCI Digits 2­7 UCI Digits 8­9 Ionosphere Letter Satellite UCI Digits MNIST Digits Size 357 361 356 354 351 1, 555 2, 236 1, 797 70, 000 KM 5.32 ± 0 0.55 ± 0 3.09 ± 0 9.32 ± 0 32 ± 17.9 17.94 ± 0 4.07 ± 0 2.62 10.79 MMC 10 31.25 1.25 3.75 21.25 GMMC 5.6 2.2 0.5 16.0 23.5 IterSVR 3.36 ± 0 0.55 ± 0 0.0 ± 0 3.67 ± 0 32.3 ± 16.6 7.2 ± 0 3.18 ± 0 1.82 7.59 CPMMC 3.08 0 0.0 2.26 27.64 5.53 1.52 0.62 4.29 EvoMMC 2.52 ± 0 0.0 ± 0 0.0 ± 0 3.22 ± 0.40 17.94 ± 6.84 3.67 ± 0.33 1.16 ± 0.45 0.72 (3.23) FastEvoMMC 2.58 ± 0.5 0.0 ± 0 0.0 ± 0 3.78 ± 4.13 18.75 ± 3.19 3.27 ± 1.08 1.14 ± 0.65 0.81 3.45

shown in Table 1. The average runtimes as well as the average number of performed generations for single runs (measured over 10 runs) are depicted in Table 2. They demonstrate that especially FastEvoMMC can handle large data sets within a reasonable amount of time. The latter results also show that both methods can efficiently perform a huge number of generations in order to converge to (local) optima.10 Both IterSVR and CPMMC have a superior runtime performance compared to previous approaches (Zhao et al., 2008a). Hence, we will focus on indicating the (theoretical) runtime of our approach with respect to these techniques. The runtime of IterSVR depends on the specific implementation for solving the intermediate optimizations problems. For instance, Zhang et al. (2007) resort to LIBSVM. The CPMMC approach is guaranteed to converge after CR iterations, where  2 is a user-supplied constant and where R is a constant being independent of n. Further, the runtime per iteration is bounded by O(sn), where s is the number of nonzero features. The results for CPMMC reported by Zhao et al. are based on linear kernels. Nonlinear kernels, however, can only be applied by decomposing the kernel matrix to form an empirical feature map beforehand spending O(n3 ) time in the worst case. This can be accelerated, for example, by using a similar approximation as in Section 3.3 spending O(nr2 ) time, where r determines the number of features in the map. In our experiments we observed that even a slight variation of the parameters of IterSVR and CPMMC may lead to significant differences in the actual runtimes. The dependence on both different kernel func10 Each generation is performed efficiently, but a huge number of generations is "wasted" for fine tuning (i.e. for the correct assignment of the last few elements). The probability for switching wrong cluster assignments becomes extremely small at the end of the search. Heuristic extensions like intermediate classification steps or a k-means preprocessing phase could be used to accelerate the search.

tions and parameter selections render a detailled runtime comparison beyond the scope of this paper. A systematic investigation of the runtimes in relation to the clustering errors will be subject of future work. 4.5. Number of Restarts To assess how many restarts are needed for satisfying results, we investigate the number of restarts against the clustering error, i.e. m runs are performed and the lowest obtained clustering error is taken as result. For each m, this procedure is repeated 10 times and the averaged results are reported. Again, we select the values leading to the results given in Table 1 as values for the parameters. The plots given in Figure 2 show that a small number of restarts is sufficient to yield good clustering errors. Further, they demonstrate that both methods perform similarly on the data sets, i.e. the different assignments for the parameters µ and  seem to have little influence on the clustering error.

5. Conclusions and Discussion
Maximum margin clustering methods turn out to be an effective approach to the clustering problem. Aiming at practical methods, we proposed a stochastic optimization approach along with an efficient computational shortcut making the overall algorithm capable of dealing with large data sets. Our experimental analyses reveal that our approach yields better clustering accuracies than competitive methods in most cases. We expect our approach to be extendible in various situations. For instance, our stochastic search along with the computational shortcuts could also be applied as a fine-tuning step after a k-means preprocessing phase. As the partitions obtained by such a preprocessing would already constitute reasonable solutions, we expect a considerable runtime improvement by the preprocessing. We plan to investigate such extensions.

Fast Evolutionary Maximum Margin Clustering

(a) EvoMMC Figure 2. Restarts vs. clustering error.

(b) FastEvoMMC

Acknowledgments
We thank the anonymous reviewers for their careful reading and detailed comments. This work was supported in part by the Academy of Finland.

Conference on Computational Learning Theory (pp. 416­426). Silva, M. M., Maia, T. T., & Braga, A. P. (2005). An evolutionary approach to transduction in support vector machines. Proc. 5th International Conference on Hybrid Intelligent Systems (pp. 329­334). Steinwart, I., & Christmann, A. (2008). Support vector machines. New York, NY, USA: Springer-Verlag. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9, 293­300. Valizadegan, H., & Jin, R. (2007). Generalized maximum margin clustering and unsupervised kernel learning. Adv. in Neural Information Proc. Systems 19 (pp. 1417­1424). Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Xu, L., Neufeld, J., Larson, B., & Schuurmans, D. (2005). Maximum margin clustering. Adv. in Neural Information Proc. Systems 17 (pp. 1537­1544). Xu, L., & Schuurmans, D. (2005). Unsupervised and semi-supervised multi-class support vector machines. Proc. National Conference on Artificial Intelligence (pp. 904­910). Zhang, K., Tsang, I. W., & Kwok, J. T. (2007). Maximum margin clustering made practical. Proc. International Conference on Machine Learning (pp. 1119­1126). Zhao, B., Wang, F., & Zhang, C. (2008a). Efficient maximum margin clustering via cutting plane algorithm. Proc. SIAM International Conference on Data Mining (pp. 751­762). Zhao, B., Wang, F., & Zhang, C. (2008b). Efficient multiclass maximum margin clustering. Proc. International Conference on Machine Learning (pp. 1248­1255).

References
Adankon, M. M., & Cheriet, M. (2007). Learning semisupervised svm with genetic algorithm. Proc. International Joint Conference on Neural Networks (pp. 1825­1830). Bach, F., & Harchaoui, Z. (2008). Diffrac: a discriminative and flexible framework for clustering. Adv. in Neural Information Proc. Systems 20 (pp. 49­56). Beyer, H.-G., & Schwefel, H.-P. (2002). Evolution strategies - A comprehensive introduction. Natural Computing, 1, 3­52. Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Golub, G. H., & Van Loan, C. (1989). Matrix computations. Baltimore and London: Johns Hopkins University Press. Second edition. Hartigan, J. A., & Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics, 28, 100­ 108. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall. Rifkin, R., Yeo, G., & Poggio, T. (2003). Regularized least-squares classification. In Adv. in learning theory: Methods, models and applications. IOS Press. Rifkin, R. M. (2002). Everything old is new again: A fresh look at historical approaches in machine learning. Doctoral dissertation, MIT. Sch¨lkopf, B., Herbrich, R., & Smola, A. J. (2001). A o generalized representer theorem. Proc. 14th Annual