Spectral Clustering based on the graph p-Laplacian Thomas B¨ hler u tb@cs.uni-sb.de Matthias Hein hein@cs.uni-sb.de Saarland University, Computer Science Department, Campus E1 1, 66123 Saarbr¨cken, Germany u Abstract We present a generalized version of spectral clustering using the graph p-Laplacian, a nonlinear generalization of the standard graph Laplacian. We show that the second eigenvector of the graph p-Laplacian interpolates between a relaxation of the normalized and the Cheeger cut. Moreover, we prove that in the limit as p 1 the cut found by thresholding the second eigenvector of the graph p-Laplacian converges to the optimal Cheeger cut. Furthermore, we provide an efficient numerical scheme to compute the second eigenvector of the graph pLaplacian. The experiments show that the clustering found by p-spectral clustering is at least as good as normal spectral clustering, but often leads to significantly better results. Kahng, 1991) and normalized cut (Shi & Malik, 2000). There are also relaxations of balanced graph cut criteria based on semi-definite programming (De Bie & Cristianini, 2006), which turn out to be better than the standard spectral ones but are computationally more expensive. In this paper we establish a connection between the Cheeger cut and the second eigenvector of the graph p-Laplacian, a nonlinear generalization of the graph Laplacian. A p-Laplacian which differs slightly from the one used in this paper has been used for semisupervised learning by Zhou and Sch¨lkopf (2005). o Our main motivation for the use of eigenvectors of the graph p-Laplacian was the generalized isoperimetric inequality of Amghibech (2003) which relates the second eigenvalue of the graph p-Laplacian to the optimal Cheeger cut. The isoperimetric inequality becomes tight as p 1, so that the second eigenvalue converges to the optimal Cheeger cut value. In this article we extend the isoperimetric inequality of Amghibech to the unnormalized graph p-Laplacian. However, our key result is to show that the cut obtained by thresholding the second eigenvector of the p-Laplacian converges to the optimal Cheeger cut as p 1, which provides theoretical evidence that p-spectral clustering is superior to the standard case. Moreover, we provide an efficient algorithmic scheme for the (approximate) computation of the second eigenvector of the p-Laplacian and the resulting clustering. This allows us to do p-spectral clustering also for large scale problems. Our experimental results show that as one varies p from 2 (standard spectral clustering) to 1 the value of the Cheeger cut obtained by thresholding the second eigenvector of the graph p-Laplacian is always decreasing. In Section 2, we review balanced graph cut criteria. In Section 3, we introduce the graph p-Laplacian followed by the definition of eigenvectors of nonlinear operators. In Section 4, we provide the theoretical key result relating the cut found by thresholding the second eigenvector of the graph p-Laplacian to the optimal Cheeger cut. The algorithmic scheme is presented in Section 5 1. Introduction In recent years, spectral clustering has become one of the major clustering methods. The reasons are its generality, efficiency and its rich theoretical foundation. Spectral clustering can be applied to any kind of data with a suitable similarity measure and the clustering can be computed for millions of points. The theoretical background includes motivations based on balanced graph cuts, random walks and perturbation theory. We refer to (von Luxburg, 2007) and references therein for a detailed introduction to various aspects of spectral clustering. In this paper our focus lies on the motivation of spectral clustering as a relaxation of balanced graph cut criteria. It is well known that the second eigenvectors of the unnormalized and normalized graph Laplacians correspond to relaxations of the ratio cut (Hagen & Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Spectral Clustering based on the graph p-Laplacian and extensive experiments on various datasets, including large scale ones, are given in Section 6. cut NCC(C, C): NCC(C, C) NCut(C, C) 2 NCC(C, C). The analogous result holds for the ratio cut RCut(C, C) and the ratio Cheeger cut RCC(C, C). It is known that finding the global optimum of all these balanced graph cut criteria is NP-hard, see (von Luxburg, 2007). In Section 4, we will show how spectral relaxations of these criteria are related to the eigenproblem of the graph p-Laplacian. Up to now the cuts are just defined for a partition of V into two sets. For a partition of V into k sets C1 , . . . , Ck the ratio and normalized cut can be generalized (von Luxburg, 2007) as k 2. Balanced graph cut criteria Given a set of points in a feature space and a similarity measure, the data can be transformed into a weighted, undirected graph G, where the vertices V represent the points in the feature space and the positive edge weights W encode the similarity of pairs of points. A clustering of the points is then equivalent to a partition of V into subsets C1 , . . . , Ck (which will be called clusters in the following). The usual objective for such a partitioning is to have high within-cluster similarity and low inter-cluster similarity. Additionally, the clusters should be balanced in the sense that the "size" of the clusters should not differ too much. All the graph cut criteria presented in this section implement these objectives with slightly different emphasis on the individual properties. Before the definition of the balanced graph cut criteria, we have to introduce some notation. The number of points is denoted by n = |V | and the complement of a set A V is written as A = V \A. The degree function n d : V R of the graph is given as di = j=1 wij and the cut of A V and A is defined as cut(A, A) = iA, jA RCut(C1 , . . . , Ck ) = i=1 k cut(Ci , Ci ) , |Ci | cut(Ci , Ci ) . vol(Ci ) (1) NCut(C1 , . . . , Ck ) = i=1 (2) There seems to exist no generally accepted multipartition version of the Cheeger cuts. We come back to this issue in Section 5, when we discuss how to get multiple clusters using the second eigenvector of the graph p-Laplacian. wij . 3. The graph p-Laplacian It is well known, see e.g. (Hein et al., 2007), that the standard graph Laplacian 2 can be defined as the operator which induces the following quadratic form for a function f : V R: f, 2 f = 1 wij (fi - fj )2 . 2 i,j=1 n Moreover, we denote by |A| the cardinality of the set A and by vol(A) = iA di the volume of A. In the balanced graph cut criteria one either tries to balance the cardinality or the volume of the clusters. The ratio cut RCut(C, C) (Hagen & Kahng, 1991) and the normalized cut NCut(C, C) (Shi & Malik, 2000) for a partition of V into C, C are defined as cut(C, C) cut(C, C) RCut(C, C) = + , |C| |C| NCut(C, C) = cut(C, C) cut(C, C) + . vol(C) vol(C) A slightly different balancing behavior is induced by the corresponding ratio Cheeger cut RCC(C, C) and normalized Cheeger cut NCC(C, C) defined as RCC(C, C) = NCC(C, C) = cut(C, C) , min{|C| , C } cut(C, C) . min{vol(C), vol(C)} For the standard inner product one gets the unnor(u) malized graph Laplacian 2 which in matrix nota(u) tion is given as 2 = D - W , and for the weighted n inner product, f, g = i=1 di fi gi , one obtains the (n) (n) normalized1 graph Laplacian 2 given as 2 = I - D-1 W . One can ask now if there exists an operator p which induces the general form (for p > 1), f, p f = 1 wij |fi - fj |p . 2 i,j=1 n It turns out that this question can be answered positive, see (Amghibech, 2003). The resulting operator Note that our notation differs from the one in (Hein et al., 2007) where they denote our normalized graph Laplacian as "random walk graph Laplacian". 1 One has the following simple relation between the normalized cut NCut(C, C) and the normalized Cheeger Spectral Clustering based on the graph p-Laplacian p is the graph p-Laplacian (which we abbreviate as p-Laplacian if no confusion is possible). Similar to the graph Laplacian we obtain, dependent on the choice of the inner product, the unnormalized and normalized (u) (n) p-Laplacian p and p . Let i V , then ((u) f )i = p jV now be carried over to nonlinear operators. We define (u) for the unnormalized p-Laplacian p , Qp (f ) := f, (u) f p 1 = wij |fi - fj |p , 2 i,j=1 n wij p (fi - fj ), 1 di wij p (fi - fj ). jV and define similarly the functional Fp : RV R, Fp (f ) := Qp (f ) p . f p ((n) f )i = p where p : R R is defined for x R as p (x) = |x| p-1 sign(x). Note that 2 (x) = x, so that we recover the standard graph Laplacians for p = 2. In general, the p-Laplacian is a nonlinear operator: p (f ) = p f for R. 3.1. Eigenvalues and eigenvectors of the graph p-Laplacian Since our goal is to use the p-Laplacian for spectral clustering, the natural question arises how one can define eigenvectors and eigenvalues for such a nonlinear operator. For notational simplicity we restrict us in this section to the case of the unnormalized p(u) Laplacian p but all definitions and results carry (n) over to the normalized version p . Definition 3.1 The real number p is called an eigen(u) value for the p-Laplacian p if there exists a function v : V R such that ((u) v)i = p p (vi ) , p i = 1, ..., n. (u) Theorem 3.1 The functional Fp has a critical point at v RV if and only if v is a p-eigenfunction of (u) p . The corresponding eigenvalue p is given as p = Fp (v). Moreover, we have Fp (f ) = Fp (f ) for all f RV and R. Proof: One can check that the condition for a critical point of Fp at v can be rewritten as p v - Qp (v) p p (v) = 0. v p Thus, with Definition 3.1 v is an eigenvector of p . Moreover, the equation implies that a given eigenvector v to the eigenvalue p is a critical point of Fp if p = Fp (v). Summing up the eigenvector equation of Definition 3.1 shows this equality. The last statement follows directly from the definition. This theorem shows that in order to get all eigen(u) vectors and eigenvalues of p we have to find all critical points of the functional Fp . Moreover, with Fp (f ) = Fp (f ), we observe that the usual property for linear operators that eigenvectors are invariant under scaling carries over to the nonlinear case. The following proposition is a generalization of a result by Fiedler (1973) to the graph p-Laplacian. It relates the connectivity of the graph to properties of the (1) first eigenvalue p of the p-Laplacian. We denote by V 1A R the function which is one on A and zero else. Proposition 3.1 The multiplicity of the first eigen(1) (u) value p = 0 of the p-Laplacian p is equal to the number K of connected components C1 , . . . , CK of the (1) graph. The corresponding eigenspace for p = 0 is K given as { i=1 i 1jCi | i R, i = 1, . . . , K}. Proof: We have Qp (f ) 0, so that all eigenvalues (u) p of p are non-negative. Similar to the case p = 2, n one can check that i,j=1 wij |fi -fj |p = 0, if and only if f is constant on each connected component. In spectral clustering the graph is usually assumed to (1) be connected, so that vp = c1 for c R, otherwise The function v is called a p-eigenfunction of p responding to the eigenvalue p . cor- The origin of this definition of an eigenvector for nonlinear operators lies in the Rayleigh-Ritz principle, a variational characterization of eigenvalues and eigenvectors for linear operators. For a symmetric matrix A Rn×n , it is well-known that one can obtain the smallest eigenvalue (1) and the corresponding eigenvector v (1) satisfying Av (1) = (1) v (1) via the variational characterization v (1) = arg min f Rn f, A f f 2 2 Rn , n p where the p-norm is defined as f p := i=1 |fi | . Note that this characterization implies that (up to rescaling) v (1) is the global minimizer of f, Af subject to f 2 = 1. This variational characterization can p Spectral Clustering based on the graph p-Laplacian spectral clustering is trivial. For the following we assume that the graph is connected. The previous proposition suggests that similar to the standard case p = 2 we need at least the second eigenvector to construct a partitioning of the graph. For p = 2, we get the second eigenvector again by the variational Rayleigh-Ritz principle, v (2) = arg min f Rn 4.1. Spectral relaxation of balanced graph cuts It is well known that the second eigenvector of the unnormalized and normalized standard graph Laplacians (p = 2) is the solution of a relaxation of the ratio cut RCut(C, C) and normalized cut NCut(C, C), see e.g. (von Luxburg, 2007). We will show now that the (2) second eigenvector vp of the p-Laplacian can also be seen as a relaxation of balanced graph cuts. Theorem 4.1 For p > 1 and every partition of V into C, C there exists a function fp,C RV such that (2) the functional Fp associated to the unnormalized pLaplacian satisfies (2) Fp (fp,C ) = cut(C, C) f, 2 f f 2 2 (u) | f, 1 = 0. . This form is not suited for the p-Laplacian since its eigenvectors are not necessarily orthogonal. However, for a function with f, 1 = 0 one has f 2 2 = f- 1 f, 1 1 n 2 2 = mincR f - c 1 2 2 . 1 |C| p-1 1 + C 1 1 p-1 p-1 , Thus, we can write equivalently, v (2) = arg min f Rn f, 2 f mincR f - c 1 (2) (u) with the special cases, 2. 2 F2 (f2,C ) = RCut(C, C), p1 (2) lim Fp (fp,C ) = RCC(C, C). (2) (2) This motivates the definition of Fp (2) Fp (f ) = : RV R, p. p (2) Qp (f ) mincR f - c1 Theorem 3.2 The second eigenvalue p of the (u) graph p-Laplacian p is equal to the global mini(2) mum of the functional Fp . The corresponding eigen(2) (u) (2) vector vp of p is then given as vp = u - c 1 (2) for any global minimizer u of Fp , where c = n p arg mincR i=1 |ui -c| . Furthermore, the functional (2) (2) (2) Fp satisfies Fp (tu + c1) = Fp (u), for all t, c R. Proof: Can be found in (B¨hler & Hein, 2009). u Thus, instead of solving the complicated nonlinear equation of Definition 3.1 to obtain the second eigenvector of the graph p-Laplacian, we just have to find (2) the global minimum of the functional Fp . In the next section, we discuss the relation between the sec(2) ond eigenvalue p of the graph p-Laplacian and the balanced graph cuts of Section 2. In Section 5, we provide an algorithmic framework to compute the second eigenvector of the p-Laplacian efficiently. Moreover, one has Fp (fp,C ) 2p-1 RCC(C, C). Equivalent statements hold for a function gp,C for the (n) normalized cut and the normalized p-Laplacian p . Proof: Let p > 1, then we define for a partition C, C of V the function fp,C : V R as (fp,C )i = 1/ |C| p-1 -1/ C 1 p-1 1 , i C, , i C. 1 1 |C| p-1 One has Qp (fp,C ) = Moreover, one has mincR fp,C - c1 (2) p p iC, jC p p + 1 p |C | + 1 p-1 . = fp,C = |C| p-1 With Fp (fp,C ) = Qp (fp,C )/mincR (2) Fp (fp,C ) = iC,yC 1 1 . |C | p-1 p f - c1 p , we get 1 1 wij 2 1 |C| p-1 1 p-1 + C 1 1 p-1 p-1 wij 4. Spectral properties of the graph p-Laplacian and the Cheeger cut Now that we have discussed the variational characterization of the second eigenvector of the p-Laplacian, we will provide the relation to the relaxation of balanced graph cut criteria as it can be done for the standard graph Laplacian. iC,yC min{|C| , C } p-1 1 = 2p-1 RCC(C, C). The first equality shows the general result and simplifies to the ratio cut for p = 2. The limit p 1 follows with lim (a + b )1/ = max{a, b}. Thus, since one minimizes over all functions in the eigenproblem for the second eigenvector of the p(u) (n) Laplacian p and p it is a relaxation of the Spectral Clustering based on the graph p-Laplacian ratio/normalized cut for p = 2 and for the ratio/normalized Cheeger cut in the limit of p 1. In the interval 1 < p < 2 the eigenproblem can be seen as as a relaxation of the interpolation between ratio/normalized cut and the ratio/normalized Cheeger (2) (u) cut, for the functional Fp of p we get, (2) Fp (fp,C ) = cut(C, C) Theorem 4.2 (Amghibech, 2003) Denote by p the second eigenvalue of the normalized p-Laplacian (n) p . Then for any p > 1, 2p-1 hNCC p p (2) (2) 2p-1 hNCC . p 1 |C| 1 p-1 + C 1 1 p-1 p-1 , We extend the result of Amghibech to the unnormalized p-Laplacian. Theorem 4.3 Denote by p the second eigenvalue (u) of the unnormalized p-Laplacian p . For p > 1, 2 maxi di p-1 (2) which can be understood using the inequalities between lp -norms, for 1 one has x x , 1 1 + |C| |C| 1 1 + |C| |C| 1 1 1 max , , |C| |C| hRCC p p (2) 2p-1 hRCC . p with = 1/(p - 1) and thus for 1 < p < 2, one has > > 1. The spectral relaxation of ratio (Hagen & Kahng, 1991) and normalized cut (Shi & Malik, 2000) was one of the main motivations for standard spectral clustering. There exist other possibilities to relax the ratio and normalized cut problem, see (De Bie & Cristianini, 2006), which lead to a semi-definite program. These relaxations give better bounds on the true cut than the standard spectral relaxation (p = 2), though they are computationally expensive. However, up to our knowledge the bounds which can be achieved by semidefinite programming are not as tight as the ones which we provide in the next section for the p-Laplacian as p 1. 4.2. Isoperimetric Inequality - the second (2) eigenvalue p and the Cheeger cut The isoperimetric inequality (Chung, 1997) for the graph Laplacian (p = 2) provides additional theoretical backup for the spectral relaxation. It provides upper and lower bounds on the ratio/normalized Cheeger cut in terms of the second eigenvalue of the graph pLaplacian. We define the optimal ratio and normalized Cheeger cut values hRCC and hNCC as hRCC = inf RCC(C, C) and hNCC = inf NCC(C, C). C C Proof: Can be found in (B¨hler & Hein, 2009). u hRCC Note that hNCC < 1 and maxi di < 1, so that in both cases the left hand side of the bound is smaller than hNCC resp. hRCC . When considering the limit p 1, one observes that the bounds on p become tight as p 1. Thus in the limit of p 1, the second eigenvalue of the unnormalized/normalized p-Laplacian approximates the optimal ratio/normalized Cheeger cut arbitrarily well. Still the problem remains how to transform the realvalued second eigenvector of the p-Laplacian into a partitioning of the graph. We use the standard proce(2) dure and threshold the second eigenvector vp to obtain the partitioning. The optimal threshold is determined by minimizing the corresponding Cheeger cut. (2) For the second eigenvector vp of the unnormalized (u) graph p-Laplacian p we determine, arg min Ct ={iV | vp (i)>t} (2) RCC(Ct , Ct ), (2) (3) and similarly for the second eigenvector vp of the (n) normalized graph p-Laplacian p we compute, arg min Ct ={iV | vp (i)>t} (2) NCC(Ct , Ct ). (4) The standard isoperimetric inequality for p = 2 (see Chung, 1997) is given as h2 (2) NCC 2 2 hNCC , 2 where 2 is the second eigenvalue of the standard normalized graph Laplacian (p = 2). The isoperimetric inequality for the normalized p-Laplacian has been proven by Amghibech (2003). (2) The obvious question is how good the cut values obtained by thresholding the second eigenvector of the p-Laplacian are compared to optimal Cheeger cut values. The following Theorem answers this question and provides the key motivation for p-spectral clustering. Theorem 4.4 Denote by h RCC and hNCC the ratio/normalized Cheeger cut values obtained by tresh(2) olding the second eigenvector vp of the unnormal(u) ized/normalized p-Laplacian via (3) for p resp. (4) Spectral Clustering based on the graph p-Laplacian Algorithm 1 p-Laplacian based Spectral Clustering 1: Input: weight matrix W , number of desired clusters k, choice of p-Laplacian. 2: Initialization: cluster C1 = V , number of clusters s = 1 3: repeat (2) 4: Minimize Fp : RCi R for the chosen pLaplacian for each cluster Ci , i = 1, . . . , s. 5: Compute optimal threshold for dividing each (u) (n) cluster Ci via (3) for p or (4) for p . 6: Choose to split the cluster Ci so that the total multi-partition cut criterion is minimized (ratio (u) (n) cut (1) for p and normalized cut (2) for p ). 7: ss+1 8: until number of clusters s = k (n) often very fast to convergence to a non-optimal local minimum. Thus we use a different procedure using the fact that for p = 2 we can easily compute (2) the global minimizer of F2 . It is just the second eigenvector of the standard graph Laplacian, which can be efficiently computed for sparse matrices e.g. using ARPACK. Since the functional Fp (f ) is continuous in p, we can hope for close values p1 and p2 that (2) (2) the global minimizer of Fp1 and Fp2 are also close (at least the local minimizer should be close). Moreover, it is well known that Newton-like methods have superlinear convergence close to the local optima (Bertsekas, 1999). These two facts suggest to solve the problem (2) Fp (u) by minimizing a sequence of functionals Fpi , (2) (2) (2) Fp0 , Fp1 , ..., Fp , with p0 = 2 > p1 > ... > p , for p . Then for p > 1, hRCC h RCC p maxiV di hNCC h NCC p hNCC 1 p p-1 p hRCC 1 p , . Proof: Can be found in (B¨hler & Hein, 2009). u One observes that in the limit of p 1 both inequalities become tight, which implies that for p 1 the cut found by thresholding the second eigenvector of the p-Laplacian converges to the optimal Cheeger cut. where each step is initialized with the solution of the previous step and initialization is done with p0 = 2. In the experiments we found that the update rule pt+1 = 0.9 pt yields a good trade-off between decreas(2) ing too fast with the danger that the optimum for Fpt (2) is far away from the optimum of Fpt+1 and decreasing too slow which yields fast convergence of the Newton method but needs a lot of iterations. The minimimization of the functionals Fpt is done using a mixture of gradient and Newton steps. However, the Hessian of Fpi is not sparse, which causes problems for large scale problems, but it can be decomposed into H = A + abT + baT + bbT , where a, b Rn and the matrix A is sparse. Thus H is a sum of a sparse matrix plus low-rank updates. Thus, we just discard the low-rank updates and use A as a surrogate for the true Hessian. We use the Minimal Residual method (Paige & Saunders, 1975) for solving the linear system of the Newton step as the matrix A is symmetric but not necessarily positive definite. In order to avoid problems with an illconditioned matrix A, we add a small ridge. Note that (2) p the term mincR f - c1 p in the functional Fp (f ) is itself a (convex) optimization problem which can be solved very fast using bisection. 5. p-Spectral Clustering The algorithmic scheme for p-Spectral Clustering is shown in Algorithm 1. More than two clusters are obtained by consecutive splitting of clusters until the desired number of clusters is reached. As multi-partition criterion, we use the established generalized versions of ratio cut (1) and normalized cut (2). However, one could also think about multi-partition versions of the Cheeger cut. The sequential splitting of clusters is the more "traditional" way to do spectral clustering. Alternatively, one uses for the standard graph Laplacian the first k eigenvectors to define a new representation of the data. In this new k-dimensional representation one then applies a standard clustering algorithm like k-means. This alternative is not possible in our case since at the moment we are not able to compute higher-order eigenvectors of the p-Laplacian. However, as Theorem 4.4 shows there is also need for going this way since thresholding will yield the optimal Cheeger cut in the limit p 1. The functional Fp : RV R is non-convex and thus we cannot guarantee to reach the global minimum. Indeed, a direct minimization for small values of p leads (2) 6. Experimental evaluation In all experiments, we used a symmetric K-NN graph with K = 10 and weights wij defined as wij = max{si (j), sj (i)}, where si (j) = e - 4 2 i xi -xj 2 , with i being the Euclidean distance of xi to its K-nearest neighbor. We evaluate the clustering on Spectral Clustering based on the graph p-Laplacian datasets with known number of classes k. We then clustered the data into k clusters and checked the agreement of the found clusters C1 , . . . , Ck with the class structure using the error measure error(C1 , .., Ck ) = 1 |V | k IYj =Yi , i=1 jCi (5) where Yj is the true label of j and Yi is the dominant label in cluster Ci . 6.1. High-dimensional noisy two moons The two moons dataset is generated as two half-circles in R2 which are embedded into a d-dimensional space where Gaussian noise N (0, 2 Id ) is added. When varying d, n and , we always made the same observation: unnormalized and normalized p-spectral clustering leads for decreasing values of p to cuts with decreasing values of the Cheeger cuts RCC and NCC. In Fig. 1, we illustrate this for the case d = 100, n = 2000 and 2 = 0.02. Note that this dataset is far from being trivial since the high-dimensional noise has corrupted the graph (see the edge structure in Fig. 1). The histogram of the values of the second eigenvectors for p equal to 2, 1.7, 1.4 and 1.1, show strong differences. For p = 2, the values are scattered over the interval, whereas for p = 1.1, they are almost concentrated on two peaks. This suggests that for p = 1.1, the p-eigenvector is quite close to the function fp,C as defined in Theorem 4.1. The third row in Fig. 1 shows the resulting clusters found by p-spectral clus(n) tering with p . For p 1, the clustering is almost perfect despite the difficulty of this dataset. In order to illustrate that this result is representative, we have repeated the experiment 100 times. The plot in the bottom left of Fig. 1 shows the mean of the normal(2) ized Cheeger cut, the second eigenvalue p , normalized cut and error as p 1. One observes that despite there is some variance, the results of p-spectral clustering are significantly better than standard spectral clustering. 6.2. UCI-Datasets In Table 2 we show results for p-spectral clustering on several UCI datsets both for the unnormalized (right column) and the normalized p-Laplacian (left column). The corresponding Cheeger-cuts (second row) are consistently decreasing as p 1. For most of the datasets this also implies that the ratio/normalized cut decreases. Note that the error is often constant despite the fact that the cut is still decreasing. Opposite to the other examples, minimizing the cut does not necessarily lead to a smaller error. Table 1. Top: Results of unnormalized p-spectral clustering with k = 10 for USPS and MNIST using the ratiomulti-partition criterion (1). In both cases the RCut and the error significantly decrease as p decreases. Bottom: confusion matrix for MNIST of the clusters found by pspectral clustering for p = 1.2. Class 1 has been split into two clusters and class 4 and 9 have been merged. Thus there exists no class 9 in the table. Apart from the merged classes the clustering reflects the class structure quite well. USPS RCut Error 0.819 0.741 0.718 0.698 0.684 0.676 0.693 0.684 0.679 0 6845 1 38 5 3 15 23 1 18 15 p 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 True/Cluster 0 1 2 3 4 5 6 7 8 9 MNIST RCut Error 0.225 0.209 0.186 0.170 0.164 0.161 0.158 0.155 0.153 0.189 0.172 0.170 0.169 0.170 0.133 0.132 0.131 0.129 6 26 2 8 2 14 61 6797 0 23 4 7 8 4 3 16 2 114 26 45 22 5 4 5 9 0 1 7067 1 18 5961 77 8 0.233 0.142 0.141 0.139 0.134 0.133 0.141 0.138 0.137 1 5 7794 47 6 45 1 17 83 51 15 2 7 32 6712 31 2 4 6 22 13 3 3 0 8 25 6939 1 92 0 1 507 117 4 5 21 15 30 6750 39 9 116 112 6708 5 8 1 5 61 0 6087 23 2 122 11 6.3. USPS and MNIST We perform unnormalized p-spectral clustering on the full USPS and MNIST-datasets (n = 9298 and n = 70000). In Table 1 one observes that for p 1 the ratio cut as well as the error decreases for both datasets. The error is even misleading since the class separation is quite good but one class has been split which implies that two classes have been merged. This happens for both datasets and in Table 1 we provide the confusion matrix for MNIST for p-spectral clustering with p = 1.2. For larger values of number of clusters k we thus expect better results. In the following table we present the runtime behavior (in seconds) for USPS: p t 2.0 10 1.9 81 1.8 99 1.7 144 1.6 224 1.5 456 1.4 1147 1.3 2266 1.2 4660 As p 1, the problem becomes more difficult which is clear since one approximates asymptotically the optimal Cheeger cut. However, there is still room for improvement to speed up our current implementation. Acknowledgments This work has been supported by the Excellence Cluster on Multimodal Computing and Interaction at Saarland University. Spectral Clustering based on the graph p-Laplacian 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 2 NCut NCC 2nd eigenvalue Error 1.8 1.6 1.4 1.2 1 60 50 40 30 20 10 0 -0.04 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 2 1.8 1.6 1.4 1.2 1 NCut NCC 2nd eigenvalue Error 120 100 80 250 1000 200 800 150 60 100 40 20 0 0.04 -0.04 50 600 400 200 -0.02 0 0.02 -0.02 0 0.02 0.04 0 -0.02 0 0.02 0.04 0 -0.01 0 0.01 0.02 0.03 NCut 0.1461 NCC 0.07374 NCut 0.1408 NCC 0.07050 NCut 0.1155 NCC 0.05791 NCut 0.1035 NCC 0.05330 Figure 1. Results for the two moons data set, 2000 points in 100 dimensions, noise variance 0.02. First row, from left to right: Second eigenvector of the p-Laplacian for p = 2.0, 1.7, 1.4, 1.1. Second row: Histogram of the values of the second eigenvector. Last row: Resulting clustering after finding optimal threshold according to the NCC criterion. First column, (2) top: The values of NCC, the eigenvalue p , NCut and the error for the example shown on the right. Middle: Plot of (2) the edge structure. Bottom: Average values plus standard deviation of NCC, NCut, p and the error for varying p. Table 2. Results of unnormalized/normalized p-spectral clustering on UCI-datasets. For each dataset, the rows correspond to NCut, NCC resp. RCut, RCC and error. p Breast 2.0 0.0254 0.0209 0.293 0.118 0.0621 0.215 0.443 0.222 0.281 0.0821 0.0411 0.0257 0.101 0.0552 0.227 Normalized 1.4 1.1 0.0229 0.0289 0.0135 0.0174 0.293 0.293 0.0796 0.0796 0.0579 0.0579 0.356 0.356 0.420 0.420 0.210 0.210 0.288 0.287 0.0813 0.0811 0.0407 0.0406 0.0259 0.0261 0.0857 0.0828 0.0460 0.0438 0.211 0.201 Unnormalized 2.0 1.4 1.1 0.0467 0.0332 0.0332 0.0300 0.0220 0.0220 0.293 0.293 0.293 0.108 0.0946 0.0946 0.0550 0.0473 0.0473 0.204 0.219 0.219 0.219 0.210 0.210 0.109 0.105 0.105 0.290 0.310 0.309 0.0392 0.0388 0.0387 0.0196 0.0194 0.0193 0.0255 0.0261 0.0269 0.0485 0.0410 0.0396 0.0265 0.0221 0.0210 0.225 0.212 0.201 De Bie, T., & Cristianini, N. (2006). Fast SDP relaxations of graph cut clustering, transduction, and other combinatorial problems. J. Mach. Learn. Res., 7, 1409­1436. Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Math. J., 23, 298­305. Hagen, L., & Kahng, A. B. (1991). Fast spectral methods for ratio cut partitioning and clustering. Proc. IEEE Intl. Conf. on Computer-Aided Design, 10­13. Hein, M., Audibert, J.-Y., & von Luxburg, U. (2007). Graph Laplacians and their convergence on random neighborhood graphs. J. Mach. Learn. Res., 8, 1325­1368. Paige, C., & Saunders, M. (1975). Solution of sparse indefinite systems of linear equations. SIAM J. Numer. Anal., 12, 617­629. Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Trans. Patt. Anal. Mach. Intell., 22, 888­905. von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17, 395­416. Zhou, D., & Sch¨lkopf, B. (2005). Regularization on o discrete spaces. Deutsche Arbeitsgemeinschaft f¨r u Mustererkennung-Symposium (pp. 361­368). Heart Ring norm Two norm Wave form References Amghibech, S. (2003). Eigenvalues of the discrete pLaplacian for graphs. Ars Combin., 67, 283­302. Bertsekas, D. (1999). Nonlinear programming. Athena Scientific. B¨hler, T., & Hein, M. (2009). u Supplementary material. http://www.ml.uni-saarland.de/ Publications/BueHei09tech.pdf. Chung, F. (1997). Spectral graph theory. AMS.