Spectral Clustering with Perturbed Data Ling Huang Intel Research ling.huang@intel.com Donghui Yan UC Berkeley dhyan@stat.berkeley.edu Michael I. Jordan UC Berkeley jordan@cs.berkeley.edu Nina Taft Intel Research nina.taft@intel.com Abstract Spectral clustering is useful for a wide-ranging set of applications in areas such as biological data analysis, image processing and data mining. However, the computational and/or communication resources required by the method in processing large-scale data are often prohibitively high, and practitioners are often required to perturb the original data in various ways (quantization, downsampling, etc) before invoking a spectral algorithm. In this paper, we use stochastic perturbation theory to study the effects of data perturbation on the performance of spectral clustering. We show that the error under perturbation of spectral clustering is closely related to the perturbation of the eigenvectors of the Laplacian matrix. From this result we derive approximate upper bounds on the clustering error. We show that this bound is tight empirically across a wide range of problems, suggesting that it can be used in practical settings to determine the amount of data reduction allowed in order to meet a specification of permitted loss in clustering performance. 1 Introduction A critical problem in machine learning is that of scaling: Algorithms should be effective computationally and statistically as various dimensions of a problem are scaled. One general tool for approaching large-scale problems is that of clustering or partitioning, in essence an appeal to the principle of divide-and-conquer. However, while the output of a clustering algorithm may yield a set of smaller-scale problems that may be easier to tackle, clustering algorithms can themselves be complex, and large-scale clustering often requires the kinds of preprocessing steps that are invoked for other machine learning algorithms [1], including proto-clustering steps such as quantization, downsampling and compression. Such preprocessing steps also arise in the distributed sensing and distributed computing setting, where communication and storage limitations may preclude transmitting the original data to centralized processors. A number of recent works have begun to tackle the issue of determining the tradeoffs that arise under various "perturbations" of data, including quantization and downsampling [2, 3, 4]. Most of these analyses have been undertaken in the context of well-studied domains such as classification, regression and density estimation, for which there are existing statistical analyses of the effect of noise on performance. Although extrinsic noise differs conceptually from perturbations to data imposed by a data analyst to cope with resource limitations, the mathematical issues arising in the two cases are similar and the analyses of noise have provided a basis for the study of the tradeoffs arising from perturbations. In this paper we focus on spectral clustering, a class of clustering methods that are based on eigendecompositions of affinity, dissimilarity or kernel matrices [5, 6, 7, 8]. These algorithms often outperform traditional clustering algorithms such as the K-means algorithm or hierarchical clustering. To date, however, their impact on real-world, large-scale problems has been limited; in particular, a distributed or "in-network" version of spectral clustering has not yet appeared. Moreover, there has been little work on the statistical analysis of spectral clustering, and thus there is little theory to guide the design of distributed algorithms. There is an existing literature on numerical techniques for 1 Procedure SpectralClustering (x1 , . . . , xn ) Input: n data samples {xi }n 1 , xi Rd i= ¯ Output: Bipartition S and S of the input data 1. Compute the similarity matrix K : " " x -x 2 Mis-clustering rate 6 ? Proposition 1 Eigen error 6 v2 - v2 2 ~ ? Eqn. (5), (6) 2. 3. 4. 5. 6. , xi , xj Kij = exp - i22j k Compute the P gonal degree matrix D: dia Di = n=1 Kij j Compute the normalized Laplacian matrix: L = I - D -1 K Find the second eigenvector v2 of L Obtain the two partitions using v2 : ¯ S = {[i] : v2i > 0}, S = {[i] : v2i 0} Laplacian matrix error 6 dL ? Lemma 2 & Eqn. (7)- (13) Similarity matrix error 6 dK ? Lemma 3 or 4 Data error Error propagation Assumption A Perturbation analysis Figure 1: A spectral bipartitioning algorithm. Figure 2: Perturbation analysis: from clustering error to data perturbation error. scaling spectral clustering (including downsampling [9, 10] and the relaxation of precision requirements for the eigenvector computation [7]), but this literature does not provide end-to-end, practical bounds on error rates as a function of data perturbations. In this paper we present the first end-to-end analysis of the effect of data perturbations on spectral clustering. Our focus is quantization, but our analysis is general and can be used to treat other kinds of data perturbation. Indeed, given that our approach is based on treating perturbations as random variables, we believe that our methods will also prove useful in developing statistical analyses of spectral clustering (although that is not our focus in this paper). The paper is organized as follows. In Section 2, we provide a brief introduction to spectral clustering. Section 3 contains the main results of the paper; specifically we introduce the mis-clustering rate , and present upper bounds on due to data perturbations. In Section 4, we present an empirical evaluation of our analyses. Finally, in Section 5 we present our conclusions. 2 2.1 Spectral clustering and data perturbation Background on spectral clustering algorithms Given a set of data points {xi }n 1 , xi R1×d and some notion of similarity between all pairs of data i= points xi and xj , spectral clustering attempts to divide the data points into groups such that points in the same group are similar and points in different groups are dissimilar. The point of departure of a spectral clustering algorithm is a weighted similarity graph G(V , E ), where the vertices correspond to data points and the weights correspond to the pairwise similarities. Based on this weighted graph, spectral clustering algorithms form the graph Laplacian and compute an eigendecomposition of this Laplacian [5, 6, 7]. While some algorithms use multiple eigenvectors and find a k -way clustering directly, the most widely studied algorithms form a bipartitioning of the data by thresholding the second eigenvector of the Laplacian (the eigenvector with the second smallest eigenvalue). Larger numbers of clusters are found by applying the bipartitioning algorithm recursively. We present a specific example of a spectral bipartitioning algorithm in Fig. 1. 2.2 Input data perturbation Let the data matrix X Rn×d be formed by stacking n data samples in rows. To this data matrix we ~ assume that perturbation W is applied, such that we obtain a perturbed version X of the original data ~ and we wish to compare the results X . We assume that a spectral clustering algorithm is applied to X of this clustering with respect to the spectral clustering of X . This analysis captures a number of data perturbation methods, including data filtering, quantization, lossy compression and synopsis-based data approximation [11]. The multi-scale clustering algorithms that use "representative" samples to approximate the original data can be treated using our analysis as well [12]. 2 3 Mis-clustering rate and effects of data perturbation ~ ~ Let K and L be the similarity and Laplacian matrix on the original data X , and let K and L be those on the perturbed data. We define the mis-clustering rate as the proportion of samples that have ~ different cluster memberships when computed on the two different versions of the data, X and X . ~ - X , which we now We wish to bound in terms of the "magnitude" of the error matrix W = X define. We make the following general stochastic assumption on the error matrix W : A. All elements of the error matrix W are i.i.d. random variables with zero mean, bounded variance 2 and bounded fourth central moment µ4 ; and are independent of X . Remark. (i) Note that we do not make i.i.d. assumptions on the elements of the similarity matrix; rather, our assumption refers to the input data only. (ii) This assumption is distribution free, and captures a wide variety of practical data collection and quantization schemes. (iii) Certain data perturbation schemes may not satisfy the independence assumption. We have not yet conducted an analysis of the robustness of our bounds to lack of independence, but in our empirical work we have found that the bounds are robust to relatively small amounts of correlation. We aim to produce practically useful bounds on in terms of and the data matrix X . The bounds should be reasonably tight so that in practice they could be used to determine the degree of perturbation given a desired level of clustering performance, or to provide a clustering error guarantee on the original data even though we have access only to its approximate version. Fig. 2 outlines the steps in our theoretical analysis. Briefly, when we perturb the input data (e.g., by filtering, quantization or compression), we introduce a perturbation W to the data which is quantified ~ ~ by 2 . This induces an error dK := K - K in the similarity matrix, and in turn an error dL := L - L in the Laplacian matrix. This further yields an error in the second eigenvector of the Laplacian matrix, which results in mis-clustering error. Overall, we establish an analytical relationship between the mis-clustering rate and the data perturbation error 2 , i.e., a functional relationship of the form = f ( 2 ), where f is usually monotonically increasing. Our goal is to allow practitioners to specify a mis-clustering rate , and by inverting this relationship, to determine the right magnitude of the perturbation allowed. That is, our work can provide a practical method to determine the ~ tradeoff between data perturbation and the loss of clustering accuracy due to the use of X instead of X . When the data perturbation can be related to computational or communications savings, then our analysis yields a practical characterization of the overall resource/accuracy tradeoff. Practical Applications Consider in particular a clustering task in a distributed networking system that allows application to specify a desired clustering error C on the distributed data (which is not available to the coordinator). Through a communication protocol similar to that in [4], the coor~ dinator (e.g., network operation center) gets access to the perturbed data X for spectral clustering. The coordinator can compute a clustering error bound C using our method. By setting C C , it determines the tolerable data perturbation error and instructs distributed devices to use appropriate numbers of bits to quantize their data. Thus we can provide guarantees on the achieved error, C C , with respect to the original distributed data even with access only to the perturbed data. 3.1 Upper bounding the mis-clustering rate Little is currently known about the connection between clustering error and perturbations to the Laplacian matrix in the spectral clustering setting. [5] presented an upper bound for the clustering error, however this bound is usually quite loose and is not viable for practical applications. In this section we propose a new approach based on a water-filling argument that yields a tighter, practical ~ ~ bound. Let v2 and v2 be the unit-length second eigenvectors of L and L, respectively. We derive a 2 ~ relationship between the mis-clustering rate and := v2 - v2 2. The intuition behind our derivation is suggested in Fig. 3. Let a and b denote the sets of components in v2 corresponding to clusters of size k1 and k2 , respectively, and similarly for a and b in the case ~ ~ of v2 . If v2 is changed to v2 due to the perturbation, an incorrect clustering happens whenever a component of v2 in set a jumps to set b , denoted as a b , or a component in set b jumps to set a , denoted as b a . The key observation is that each flipping of cluster membership in either a b 3 Component values Wisconsin Breast Cancer Data 0.7 Upper Bound of Kannan Our Upper Bound Mis-clustering Rate a a misclustering Component indices misclustering Perturbation 0.6 0.5 0.4 0.3 0.2 0.1 0 0.005 b b 0.01 0.015 0.02 0.025 of noise 0.03 0.035 Figure 3: The second eigenvector v2 and its per~ turbed counterpart v2 (denoted by dashed lines). Figure 4: An example of the tightness of the upper bound for in Eq. (1). or b a contributes a fairly large amount to the value of 2 , compared to the short-range drifts in a a or b b . Given a fixed value of 2 , the maximum possible number of flippings (i.e., missed clusterings) is therefore constrained, and this translates into an upper bound for . We make the following assumptions on the data X and its perturbation: B1. The components of v2 form two clusters (with respect to the spectral bipartitioning algorithm in Fig. 1). The size of each cluster is comparable to n. B2. The perturbation is small with the total number of mis-clusterings m < min(k1 , k2 ), and ~ the components of v2 form two clusters. The size of each cluster is comparable to n. B3. The perturbation of individual components of v2 in each set of a a , a b , b a and b b have identical (not necessary independent) distributions with bounded second moments, respectively, and they are uncorrelated with the components in v2 . Our perturbation bound can now be stated as follows: Proposition 1. Under assumptions B1, B2 and B3, the mis-clustering rate of the spectral bipartitioning algorithm under the perturbation satisfies 2 = v2 - v2 2. If we further assume that ~ ~ all components of v2 - v2 are independent, then (1 + op (1))E v2 - v2 2. ~ The proof of the proposition is provided in the Appendix. Remarks. (i) Assumption B3 was motivated by our empirical work. Although it is difficult to establish general necessary and sufficient conditions for B3 to hold, in the Appendix we present some special cases that allow B3 to be verified a priori. It is also worth noting that B3 appears to hold (approximately) across a range of experiments presented in Section 4. (ii) If we assume piecewise constancy for v2 , then we can relax the uncorrelated assumption in B3. (iii) Our bound has a different flavor than that obtained in [5]. Although the bound in Theorem 4.3 in [5] works for k -way clustering, it assumes a block-diagonal Laplacian matrix and requires the gap between the k th and (k + 1)th eigenvalues to be greater than 1/2, which is unrealistic in many data sets. In the setting of 2-way spectral clustering and a small perturbation, our bound is much tighter than that derived in [5]; see Fig. 4 in particular. 3.2 Perturbation on the second eigenvector of Laplacian matrix (1) We now turn to the relationship between the perturbation of eigenvectors with that of its matrix. One approach is to simply draw on the classical domain of matrix perturbation theory; in particular, applying Theorem V.2.8 from [13], we have the following bound on the (small) perturbation of the second eigenvector: v2 - v2 ~ 4dL F - 2 dL , F (2) where is the gap between the second and the third eigenvalue. However, in our experimental evaluation we found that can be quite small in many data sets (e.g., ranges from 0.009 to 0.2 in 4 (a) Wisconsin Breast Cancer Data RHS LHS 0.07 0.06 0.05 0.06 Value Value RHS LHS (b) Waveform Data 0.05 RHS LHS (c) Pen-digits Data 0.08 0.04 0.04 0.03 0.02 Value 0.01 0.015 0.02 0.025 of noise 0.03 0.035 0.04 0.03 0.02 0.02 0.01 0 0.005 0.01 0.015 0.02 0.025 of noise 0.03 0.035 0 0.005 0.01 0 0.005 0.01 0.015 0.02 0.025 of noise 0.03 0.035 Figure 5: Experimental examples of the fidelity of the approximation in Eq. (5). We add i.i.d. zero mean Gaussian noise to the input data with different , and we see that the right-hand side (RHS) of (5) approximately upper bounds the left-hand side (LHS). a large portion of data sets we used in our experiments), and in these cases the right-hand side of (2) can be quite large even for a small perturbation (or can be negative for a large perturbation). Thus the bound given by (2) is often not useful in practical applications. To derive a more practically useful bound, we begin with a well-known first-order Taylor expansion to compute the perturbation on the second eigenvector of a Laplacian matrix as follows: T jn jn pn q n vj dLv2 vj ~ vj + O(dL2 ) vpj vq2 dLpq v2 - v2 = 2 - j 2 - j =1 =1 =1,j =2 =1,j =2 n · n n pn q j vpj · vj p p up , (3) = = vq2 dLpq 2 - j =1 =1 =1 =1,j =2 n where p = q=1 vq2 dLpq , is a random variable determined by the effect of the perturbation on i v n pj v the Laplacian matrix L, and the vector up = j =1,j =2 2 -jj s a constant determined by the eigendecomposition of the Laplacian matrix L. Then we have n 2 p in j n pn . E v2 - v2 2 E ~ p up = (4) E i ui · j uT E p up 2 + 2 j =1 =1 =1 =i+1 In our experimental work we have found that for i = j , i ui is either very weakly correlated with j uj (i.e., the total sum of all cross terms is typically one or two orders of magnitude less than that of squared term), or negatively correlated with j uj (i.e., the total sum of all cross terms is less than zero). This empirical evidence suggests the following approximate bound: p 2 (5) E v2 - v 2 2 ~ n Ep · up 2. =1 Examples of the fidelity of this approximation for particular data sets are shown in Fig. 5. 2 Finally, Ep is related to dLpq , and can be upper bounded by 2 n in jn q 2 Ep = E [vi2 vj 2 · E (dLpi ) E (dLpj ) + |vi2 vj 2 |pi pj ] , vq2 dLpq =1 =1 =1 (6) where pi is the variance of dLpi . Remark. Through Eqs. (5) and (6), we can bound the squared norm of the perturbation on the second eigenvector in expectation, which in turn bounds the mis-clustering rate. To compute the bound, we need to estimate the first two moments of dL, which we discuss next. 3.3 Perturbation on the Laplacian matrix j Let D be the diagonal matrix with Di = Kij . We define the normalized Laplacian matrix as -1 ~ - D and dK = K - K , we have the following approximation for ~ L = I - D K . Letting = D ~ - L: dL = L 5 Lemma 2. If perturbation dK is small compared to K , then dL = (1 + o(1)) D-2 K - D-1 dK. Then, element-wisely, the first two moments of dL can be estimated as E()D-2 K - D-1 E(dK ) = E(dL2 ) E D-2 K D-2 K - 2D-1 dK D-2 K + D-1 dK D-1 dK D -4 2 d - E2 K + D -2 E K 2 2E(dK )D-3 K, E(dL) (8) (9) 2 (7) where denotes element-wise product. The quantities needed to estimate E(dL) and E(dL ) can ~ be obtained from moments and correlations among the elements of the similarity matrix Kij . In particular, we have + - ~ ~ 2 ~2 Kij (10) Kij , E(dKij )2 = EKij - 2Kij E Kij E(dKij ) = E Kij Ei = ~ EDi - Di , jn ~ EDi = jn jn =1 ~2 EDi = E = = =1 ~ Kij = 2 , ~ E Kij 2 ~2 ~ E2 = EDi - 2Di · EDi + Di (11) i n ~2 EKij + 2 =1 jn q =1 =j +1 E kk ~ ~ Kij EKiq + kj q ij iq i ( 12) E(dK )ij - ~ ~ ~ ~ ~ Di EKij - Kij Ei E(Di - Di )(Kij - Kij ) = E Di Kij qn ~2 ~ ~ ~ E Kij + Kij Kiq - Di EKij - Kij Ei =1,q =j n = ~2 EKij + q =1,q =j E - kk ~ ~ ~ Kij EKiq + kj q ij iq Di EKij - Kij Ei , (13) i k ~ where ij is the standard deviation of Kij and -1 kj q 1 is the correlation coefficient between i ~ ij and Kiq . Estimating all k s would require an intensive effort. For simplicity, we could set ~ K ij q kj q to 1 in Eq. (12) and to -1 in Eq. (13), and obtain an upper bound for E(dL2 ). This bound could i optionally be tightened by using a simulation method to estimate the values of kj q . However, in our i experimental work we have found that our results are insensitive to the values of kj q , and setting i kj q = 0.5 usually achieves good results. i Remark. Eqs. (8)­(13) allow us to estimate (i.e., to upper bound) the first two moments of dL using that of dK , which are computed using Eq. (15) or (16) in Section 3.4. 3.4 Perturbation on the similarity matrix ~ ~ The similarity matrix K on perturbed data X is - ||xi - xj + i - j ||2 ~ Kij = exp 2 2k , (14) ~ where k is the kernel bandwidth. Then, given data X , the first two moments of dKij = Kij - Kij , the error in the similarity matrix, can be determined by one of the following lemmas. Lemma 3. Given X , if all components of i and j are i.i.d. Gaussian N (0, 2 ), then -2, -2 , = ~ = ~ 2 2 Mij E Kij (15) Mij E Kij 2 2 k k , / e | . d/ 2 it and ij = |xi - xj ||2 /2 2 where Mij (t) = xp 1-j2t (1 - 2t) 6 (a) Gaussian data 5 4 3 2 1 0 -1 -2 -2 0 2 4 -1 -2 3 2 1 0 (b) Sin-sep data (c) Concentric data 10 5 0 -5 -10 -3 -2 -1 0 1 2 -15 -10 -5 0 5 10 Figure 6: Synthetic data sets illustrated in two dimensions. Lemma 4. Under Assumption A, given X and for large values of the dimension d, the first two ~ moments of Kij can be computed approximately as follows: - - , , = ~ = ~ 1 1 2 Mij E Kij Mij E Kij (16) 2 2 2k k t, t d 2 where Mij (t) = exp + µ4 + d 4 + 4 2 2j 2 and ij = ||xi - xj ||2 . ij + 2d i Remark. (i) Given data perturbation error , kernel bandwidth k and data X , the first two moments of dKij can be estimate directly using (15) or (16). (ii) Through Eqs. (1)­(16), we have established a relationship between the mis-clustering rate and the data perturbation magnitude . By inverting this relationship (e.g., using binary search), we can determine a for a given . 4 Evaluation In this section we present an empirical evaluation of our analysis on 3 synthetic data sets (see Fig. 6) and 6 real data sets from the UCI repository [14]. The data domains are diverse, including image, medicine, agriculture, etc., and the different data sets impose different difficulty levels on the underlying spectral clustering algorithm, demonstrating the wide applicability of our analysis. In the experiments, we use data quantization as the perturbation scheme to evaluate the upper bound provided by our analysis on the clustering error. Fig. 7 plots the mis-clustering rate and the upper bound for data sets subject to varying degrees of quantization. As expected, the mis-clustering rate increases as one decreases the number of quantization bits. We find that the error bounds are correct and remarkably tight, which validate the assumptions we make in the analysis. It is also interesting to note that even when using as few as 3-4 bits, the clustering degrades very little in both real error and as assessed by our bound. The effectiveness of our bound should allow the practitioner to determine the right amount of quantization bits given a permitted loss in clustering performance. 5 Conclusion In this paper, we proposed a theoretical analysis of the clustering error for spectral clustering in the face of stochastic perturbations. Our experimental evaluation has provided support for the assumptions made in the analysis, showing that the bound is tight under conditions of practical interest. We believe that our work, which provides an analytical relationship between the mis-clustering rate and the variance of the perturbation, constitutes a critical step towards enabling a large class of applications that seek to perform clustering of objects, machines, data, etc in a distributed environment. Many networks are bandwidth constrained, and our methods can guide the process of data thinning so as to limit the amount of data transmitted through the network for the purpose of clustering. References [1] L. Bottou and O. Bousquet, "The tradeoffs of large scale learning," in Advances in Neural Information Processing Systems 20, 2007. [2] A. Silberstein, G. P. A. Gelfand, K. Munagala, and J. Yang, "Suppression and failures in sensor networks: A Bayesian approach," in Proceedings of VLDB, 2007. [3] X. Nguyen, M. J. Wainwright, and M. I. Jordan, "Nonparametric decentralized detection using kernel methods," IEEE Transactions on Signal Processing, vol. 53, no. 11, pp. 4053­4066, 2005. 7 (a) Sin-sep Data (b) Concentric Circle Data 1.4 Upper Bound Test Value (c) Gaussian Data 0.07 0.06 Mis-Clustering Rate 0.05 0.04 0.03 0.02 0.01 Upper Bound Test Value 0.25 Mis-Clustering Rate 0.2 0.15 0.1 0.05 0.037 0.018 0.009 0.005 0.002 1.2 Mis-Clustering Rate 1 0.8 0.6 0.4 0.2 Upper Bound Test Value 0.001 0.001 0.036 0.018 0.009 0.004 0.002 0.001 0.001 0.036 0.018 0.009 0.005 0.002 0.001 0.001 0 3 4 5 6 7 8 Number of quantization bits 9 0 3 4 5 6 7 8 Number of quantization bits 9 0 3 4 5 6 7 8 Number of quantization bits 9 x 10 8 Mis-Clustering Rate -3 (d) Image Segmentation Data (e) Pen-digits Data (f) Wine Data Upper Bound Test Value Mis-Clustering Rate 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0.071 0.036 0.018 0.009 0.005 0.002 0.001 Upper Bound Test Value Mis-Clustering Rate 0.06 0.05 Upper Bound Test Value 6 0.04 0.03 0.02 0.01 4 2 0.056 0.029 0.015 0.008 0.004 0.002 0.001 0.062 0.030 0.015 0.008 0.004 0.002 0.001 0 2 3 4 5 6 7 Number of quantization bits (g) Iris Data 8 0 2 3 4 5 6 7 Number of quantization bits 8 2 3 4 5 6 7 Number of quantization bits (i) Waveform Data 8 (h) Wisconsin Breast Cancer Data Upper Bound Test Value 0.03 0.025 0.05 0.04 0.03 0.02 0.01 Upper Bound Test Value Mis-Clustering Rate 0.08 Upper Bound Test Value Mis-Clustering Rate 0.02 0.015 0.01 0.005 0.070 0.037 0.017 0.009 0.004 0.002 0.001 Mis-Clustering Rate 0.06 0.04 0.02 0.074 0.036 0.018 0.009 0.005 0.002 0.001 0.072 0.036 0.018 0.009 0.005 0.002 0.001 0 2 3 4 5 6 7 Number of quantization bits 8 0 2 3 4 5 6 7 Number of quantization bits 8 0 2 3 4 5 6 7 Number of quantization bits 8 Figure 7: Upper bounds of clustering error on approximate data obtained from quantization as a function of the number of bits. (a­c) Simulated data sets (1000 sample size, 2, 2, 10 features, respectively); (d) Statlog image segmentation data (2310 sample size, 19 features); (e) Handwritten digits data (10992 sample size, 16 features); (f) Wine data (178 sample size, 13 features); (g) Iris data (150 sample size, 4 features). (h) Wisconsin breast cancer data (569 sample size, 30 features); (i) Waveform data (5000 sample size, 21 features). The x-axis shows the number of quantization bits and (above the axis) the corresponding data perturbation error . Error bars are derived from 25 replications. In the experiments, all data values are normalized in range [0, 1]. For data sets with more than two clusters, we choose two of them for the experiments. [4] L. Huang, X. Nguyen, M. Garofalakis, A. D. Joseph, M. I. Jordan, and N. Taft, "In-network PCA and anomaly detection," in Advances in Neural Information Processing Systems (NIPS), 2006. [5] R. Kannan, S. Vempala, and A. Vetta, "On clusterings: Good, bad and spectral," Journal of the ACM, vol. 51, no. 3, pp. 497­515, 2004. [6] A. Y. Ng, M. Jordan, and Y. Weiss, "On spectral clustering: Analysis and an algorithm," in Advances in Neural Information Processing Systems (NIPS), 2002. [7] J. Shi and J. Malik, "Normalized cuts and image segmentation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888­905, 2000. [8] von Luxburg, U. M. Belkin, and O. Bousquet, "Consistency of spectral clustering," Annals of Statistics, vol. 36, no. 2, pp. 555­586, 2008. [9] P. Drineas and M. W. Mahoney, "On the Nystr¨m method for approximating a Gram matrix for improved o kernel-based learning," in Proceedings of COLT, 2005, pp. 323­337. [10] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, "Spectral grouping using the Nystr¨m method," IEEE o Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, 2004. [11] G. Cormode and M. Garofalakis, "Sketching streams through the net: Distributed approximate query tracking," in Proceedings of VLDB, 2005, pp. 13­24. [12] D. Kushnir, M. Galun, and A. Brandt, "Fast multiscale clustering and manifold identification," Pattern Recognition, vol. 39, no. 10, pp. 1876­1891, 2006. [13] G. W. Stewart and J. Guang Sun, Matrix Perturbation Theory. Academic Press, 1990. [14] A. Asuncion and D. Newman, "UCI Machine Learning Repository, Department of Information and Computer Science," 2007, http://www.ics.uci.edu/ mlearn/MLRepository.html. 8