Model Selection and Stability in k -means Clustering Ohad Shamir and Naftali Tishby School of Computer Science and Engineering Interdisciplinary Center for Neural Computation The Hebrew University, Jerusalem 91904, Israel {ohadsh,tishby}@cs.huji.ac.il Abstract Clustering Stability methods are a family of widely used model selection techniques applied in data clustering. Their unifying theme is that an appropriate model should result in a clustering which is robust with respect to various kinds of perturbations. Despite their relative success, not much is known theoretically on why or when do they work, or even what kind of assumptions they make in choosing an 'appropriate' model. Moreover, recent theoretical work has shown that they might 'break down' for large enough samples. In this paper, we focus on the behavior of clustering stability using k -means clustering. Our main technical result is an exact characterization of the distribution to which suitably scaled measures of instability converge, based on a sample drawn from any distribution in Rn satisfying mild regularity conditions. From this, we can show that clustering stability does not 'break down' even for arbitrarily large samples, in the k -means framework that we study. Moreover, it allows us to identify the factors which influence the behavior of clustering stability for any sample size. This leads to some interesting preliminary observations about what kind of assumptions are made when using these methods. While often reasonable, these assumptions might also lead to unexpected consequences. 1 Introduction The important and difficult problem of model selection in data clustering has been the focus of an extensive literature spanning several research communities in the natural and social sciences. Since clustering is often used as a first step in the data analysis process, the questions of what type of clusters or how many clusters are in the data can be crucial. An important family of model selection methods, whose popularity has grown in the past few years, is based on clustering stability. The unifying theme of these methods is that an appropriate model for the data should result in a clustering which is robust with respect to various kinds of perturbations. In other words, if we choose an appropriate clustering algorithm, and feed it with the 'correct' parameters (such as the number of clusters, the metric used, etc.), the clustering returned by the algorithm should not be overly sensitive to the exact structure of the data. In particular, we will focus on clustering stability methods which compare the discrepancy or 'distance' between clusterings of different random subsets of our data. These methods seek a 'stable' model, in the sense that the value of such distance measures should tend to be small. Although these methods have been shown to be rather effective in practice (cf. [2],[4],[7],[9]), little theory exists so far to explain their success, or for which cases are they best suited for. Over the past few years, a theoretical study of these methods has been initiated, in a framework where the data are assumed to be an i.i.d sample. However, a fundamental hurdle was the observation [1] that under mild conditions and for any model choice, the clustering algorithm should tend to converge to a single solution which is optimal with respect to the underlying distribution. As a result, clustering stability might 'break down' for large enough samples, since we get approximately the same clustering hypothesis based on each random subsample, and thus achieve stability regardless of whether the model fits the data or not (this problem was also pointed out in [6]). A possible solution to this difficulty was proposed in [15]. In a nutshell, that paper showed that the important factor in the way these clustering stability methods work may not be the asymptotic stability of the model, but rather how fast exactly does it converge to this stability. With this more refined analysis, it was argued that differences in the stability of different models should usually be discernible for any sample size, no matter how large, despite the universal convergence to absolute stability. Although it provided the necessary groundwork, that paper only rigorously proved this assertion for a single toy example, as a proof-of-concept. In this paper, we formally investigate the application of clustering stability to the well known and popular k -means clustering framework, when the goal is to determine the value of k , or the number of clusters in the data. Assuming an algorithm which minimizes the k -means objective function, we consider arbitrary distributions in Rn satisfying certain mild regularity conditions, and analyze the behavior of the clustering distance measure, scaled by the square root of the sample size. Rather than converging to zero in probability as the sample size increases to infinity, this scaled measure converges to a non-degenerate distribution which depends on the choice of k . From this we can show that clustering stabil- ity does not 'break down' even for arbitrarily large samples, in the sense described earlier, at least for the k -means framework that we study. The asymptotic distribution is also interesting for two additional reasons. The first is that it can be seen as an approximation which improves as the sample size increases. The second and more profound reason is that if we are interested in discovering what fundamental assumptions are implicit in performing model selection with clustering stability, these should not be overly dependent on the sample size used. Therefore, as we look at larger samples, noisy and hard to analyze finite sample effects diminish, and what remains are the fundamental characteristics, which should be relevant for any sample size. As a result, the analysis leads to some preliminary observations about the factors influencing clustering stability in k -means, of both theoretical and practical interest. and returns a set of centroids c = (c1 , . . . , ck ) Rnk , which are a global minimum of the objective function: m 1i ^ W (c) := min cj - xi 2. m = 1 j [ k ] 2 Problem Setting and Notation We refer the reader to Fig. 1 for a graphical illustration of the basic setting, and some of the notation introduced below. Denote {1, . . . , k} as [k ]. Vectors will be denoted by bold-face characters. · will denote the Euclidean norm unless stated otherwise. N (µ, ) denotes the multivariate normal distribution with mean µ and covariance matrix . We will use the stochastic order notation Op (·) and op (·) (cf. [18]). Let {Xm } and {Ym } be sequences of random vectors, defined on the same probability space. We write Xm = Op (Ym ) to mean that for each > 0 there exists a real number M such that Pr( Xm M Ym ) < if m is large enough. We write Xm = op (Ym ) to mean that Pr( Xm Ym ) 0 for each > 0. Notice that {Ym } may also be non-random. For example, Xm = op (1) means that Xm 0 in probability. Let D be a probability distribution on Rn , with a bounded probability density function p(·) which is continuous as a function on Rn . Assume that the following two regularity conditions hold: R p(x) x 2 dx < (in words, D has bounded vari· n From the continuity assumptions on p, we may assume that the set of points not in the interior of some cluster has zero measure with respect to p. We can therefore neglect the issue of how points along cluster boundaries are assigned. The (scaled) distance between two clusterings Ak (S1 ) and Ak (S2 ), where S1 , S2 are samples of size m, is defined as: dm (Ak (S1 ), Ak (S2 )) := D A , m Pr k (S1 )(x1 , x2 ) = Ak (S2 )(x1 , x2 ) x1 ,x2 D We assume that such a minimizer exists, is unique up to permutation of the centroids, and that all centroids are distinct (for all i = j , µi = µj ). To avoid ambiguities involving permutation of the centroids, we assume that the numbering of the centroids is by some uniform canonical ordering (for example, by sorting with respect to the coordinates). For some set of centroids c = (c1 , . . . , ck ), and for each cluster centroid ci , we denote the interior of its corresponding cluster as Cc,i , defined as: x . n Cc,i := R : arg min cj - x 2 = i j [ k ] Let µ = (µ1 , . . . , µk ) Rnk be an optimal k -means solution with respect to D, defined as a minimizer of R W (c) := p(x) min cj - xi 2dx. n j [ k ] ance). · There exists a bounded, monotonically decreasing function g (·) : R R, such that p(x) g ( x ) for all x Rn , and r =0 rn g (r) < . The second requirement is needed in order to apply the main theorem of [13] (it is a slightly stronger version of condition (iv) there), and can probably be improved. Nevertheless, it is quite mild, and holds in particular for any distribution that is not heavy-tailed or has bounded support. As to the continuity requirement of p(·), it should be noted that our results hold even if we assume continuity solely in some neighborhood of the optimal cluster boundaries, but we will take this stronger assumption for simplicity. Let Ak denote an 'ideal' version of the standard k -means algorithm, which is given a sample S = {xi }m 1 Rn , i= sampled i.i.d from D, and a required number of clusters k , where Ak (S )(x1 , x2 ) is an indicator function of whether the instances x1 , x2 are in the same cluster according to the clustering given by Ak (S ). This definition llows that of [1] fo and [15], with the additional scaling by m (the 'correct' scaling factor as will become evident later on). A typical way to measure instability in practice is to cluster independent subsamples of the data, and empirically estimate the distance between the resulting clusterings. Thus, understanding the behavior of dm (Ak (S1 ), Ak (S2 )) (over drawing and clusD tering independent samples) is of much interest in analyzing the behavior of clustering stability. Any choice of cluster centroids c induces a Voronoi partition on Rn . We will denote Fc,i,j , for i = j , as the boundary face between clusters i and j . Namely, the points in Rn whose two closest cluster centroids are ci and cj , and are equidistant from them: x . n Fc,i,j := R : arg min ca - x 2 = {i, j } a[k] Assuming ci ,cj are distinct, Fc,i,j is a (possibly empty) subset of the hyperplane Hc,i,j , defined as x . x ci + cj Hc,i,j := Rn : - · (c1 - c2 ) = 0 2 In our discussion, we use integrals with respect to both the n-dimensional Lebesgue measure, as well as the (n - 1)dimensional Lebesgue measure. The type of integral we are using should be clear from the context, depending on the set over which we are integrating. For example, integrals over some Cc,i are of the first type, while integrals over some Fc,i,j are of the second type. Let be the k n × k n matrix, which is the Hessian of the mapping W (·) at the optimal solution µ. This matrix is composed of k × k blocks i,j for i, j [k ]. Each block i,j can be shown to be equal to1 C I c1 µ1 c1 Fc,1,2 F c ,1,2 c2 µ2 c2 i,j := 2 p(x)dx µ, i n i -2 a F µ,i,a p(x)(x - µi )(x - µi ) dx µi - µa Hc,1,2 c3 c3 µ3 Hc,1,2 =i f i = j , and for i = j it is defined as i,j := 2 µi - µj Fµ,i,j p(x)(x - µi )(x - µj ) dx Hµ,1,2 We will use the same block notation later for its inverse -1 . The existence of these integrals can be shown to follow from the assumptions on p(·). We assume that the matrix is positive definite. This is in fact an almost redundant requirement, since the optimality of µ entails that is always positive semidefinite. Therefore, cases where is not positive definite correspond to singularities which are apparently pathological (for more discussion on this, see [14]). Let V be a k n × k n matrix, which represents (up to a constant) the covariance matrix of D with respect to each cluster, assuming the optimal clustering induced by µ. More specifically, V is composed of k diagonal blocks Vi of size n × n for i [k ] (all other elements of V are zero), where C Vi := 4 p(x)(x - µi )(x - µi ) dx. µ,i Figure 1: An illustrative drawing of the setting and notation used. Thicker lines represent the optimal k -means clustering partition (for k = 3 clusters) with respect to the underlying distribution. Clustering two independent random samples gives us two random centroid sets c and c . These induce two different Voronoi partitions of Rn , and the distance measure is intimately related to the probability mass in the area which switches between clusters, when we compare these two partitions (gray area). We shall assume that Vi = 0 for any i. ^ W (·) for any k of interest, and assume that c converges in probability to some set of k distinct centroids µ which are the unique global minimizer of W (·). Furthermore, assume that is invertible and that Vi = 0 for any i [k ]. Then we have that dm (Ak (S1 ), Ak (S2 )) converges in distribution D to that of C × 1 22 p(x)dx i 0 (corresponding to R = ), it is easy to show that the probability of detecting ks as the most stable model converges to 1 as m . 3.2 Factors Influencing Stability of Clustering Models According to Thm. 1, for any distribution satisfying the necessary conditions, the distance between clusterings (after scal ing by m) converges to a generally non-degenerate distribution, which depends on the underlying distribution and the number of clusters k . As Thm. 2 shows, this implies that clustering stability does not 'break down' in the large sample regime, and its choice of the most 'appropriate' value of k seems to depend essentially on instab(Ak , D). Thm. 1 provides an explicit formula for instab(Ak , D). Although one can always calculate it for specific cases, it is of much more interest to try and understand what are the governing factors influencing its value. These factors eventually determine what is considered by clustering stability as the 'correct' model, with a low value for instab(Ak , D). Therefore, analyzing these factors can explain what sample-sizefree assumptions correspond to the use of clustering stability, at least in the k -means setting that we study. Since a rigorous analysis is a complex endeavor in itself, we will limit ourselves to some preliminary and non-formal observations, which should be taken as such. According to Thm. 1, the value of instab(Ak , D) is asymptotically determined by three factors: · The probability density along the cluster boundaries. · The Hessian of the objective function W (·) at µ. · The variance V and mass of the clusters with respect to the underlying distribution. A fourth factor appearing in the formula is µi - µj , but this can be seen simply as a normalization term, eliminating the dependence on the norm of x. The probability density along the cluster boundaries seems to play an interesting role. For example, when the density at the boundaries is exactly 0, we get that instab(Ak , D) = 0. Although this density is multiplied by (x, i, j ), note that (x, i, j ) actually becomes 'nicer' when the boundary density is lower (since -1 approaches a diagonal matrix with entries proportional to the inverse of the mass of the clusters, hence having well-controlled eigenvalues assuming reasonably balanced clusters). Therefore, we might expect low instability even when the boundary density is low but not exactly 0. As to the Hessian , an exact analysis of its influence instab(Ak , D) is problematic in the general case, but a on useful rough characterization is the spectrum of . If all the eigenvalues of -1 are 'large', then we might expect (x, i, j )/ µi - µj to be relatively large as well, leading to a higher value for instab(Ak , D). On the other hand, small eigenvalues might lead to lower values of instab(Ak , D). Thus, we see that a small spectral radius of the Hessian , representing a 'locally shallow' optimal solution, may result in where (x1 , .., xm ) is a random permutation of S3 . For any distribution D satisfying the conditions of Thm. 1, assume that for some two values of k , ks = ku , the ratio of instab(Aku , D) and instab(Aks , D) (as defined in Thm. 1) is > R > 3. Then we have that: 0.3 + 3 log(R) ^ ^ + o(1), Pr ks ,4m ku ,4m R where the probability is over a sample of size 4m used for both estimators, and o(1) converges to 0 as m . The theorem implies the following: Suppose we are considering two possible values for k , designated as ks and ku , such that the ratio between instab(Aku , D) and instab(Aks , D) is some reasonably large constant (one can think of it as a relatively unstable model corresponding to ku , vs. a relatively stable model corresponding to ks ). Then the probability of not empirically detecting ks as the most stable model has an upper bound which actually decreases with the sample size, converging to a constant value dependent on the ratio of instab(Aks , D) and instab(Aku , D). In this sense, according to the bound, clustering stability does not 'break down' in the large sample regime, and the asymptotic reliability of its empirical estimation is determined by instab(Ak , D). We emphasize that the theorem deals with the reliability of detecting the most stable model, not whether a stable model is really a 'good' model in any other sense. We note that our proof actually produces an entire range of bounds, which provides a trade off for the minimality requirement on R with the tightness in terms of the constants. more instability. It is interesting to note that shallow, illdefined minima in terms of the objective function are often a sign of a mismatch between the model and the data, and therefore clustering stability seems to be doing a good thing on that regard. When will the spectral radius of be small, contributing to instability? By inspecting the formula for , and assuming all clusters have equal sizes, we see that the diagonal elements of are at most 2/k , and can become smaller if the density along the boundary points is larger. Since the main diagonal majorizes the spectrum of the symmetric matrix (cf. [5]), it seems that a small spectral radius might correspond to larger values of k , as well as high density along the cluster boundaries. A similar analysis for V seems to indicate that high cluster variance increases instability as well. These observation also imply that clustering instability might tend to be larger for higher values of k . As k becomes larger, instab(Ak , D) is the result of integrating over a larger area (all cluster boundaries), and the Hessian might tend to have a smaller spectral radius, especially if the boundaries have high density. This is somewhat compensated in the formula by the mass and variance of each cluster becoming smaller, but these seem to scale down more slowly than the cluster boundaries area (and number) scaling up, especially in high dimensions. This matches a well known experimental phenomenon, in which clusterings tend to be less stable for higher k , even in hierarchical clustering settings where more than one value of k is acceptable. When the 'correct' model has a very low boundary density and nice structure compared to competing models, this might overcome the general tendency of instability to increase with k . However, when this is not the case, normalization procedures might be called for, as in [7]. 3.3 Examples setting. In this case, all three Gaussians are separated, but one of them is relatively more separated than the other two. As before, k = 4 is less stable than k = 3 and k = 2, but now k = 2 is the most stable model. This is primarily because the sum of the boundary densities in k = 3 is larger than the density at the boundary point for k = 2. Deciding on k = 2 as the number of clusters in the data is not unreasonable (recall that clustering stability makes no explicit generative assumption on how the clusters look like). However, it can indicate that in a hierarchical clustering setting, clustering stability might prefer high levels in the hierarchy, which may or may not be what we want. 3.4 Convergence Rates After establishing the asymptotic distribution of the clustering distance measures for k -means clustering, a reasonable next step is exploring what kind of guarantees can be made on the convergence rate to this asymptotic limit. As a first step, we establish the following negative result, which demonstrates that without additional assumptions, no universal guarantees can be given on the convergence rate. The theorem refers to the case k = 3, but the proof idea can easily be extended to other values of k . Theorem 3. For any positive integer m0 , there exists a distribution D such that dm (A3 (S1 ), A3 (S2 )) converges inrobp D ability to 0 as m , but Pr(dm (A3 (S1 ), A3 (S2 )) > m/4) D is at least 1/3 for some m m0 . The theorem does not imply that the asymptotic convergence rate is arbitrarily bad. In fact, a complicated secondorder analysis (omitted from this paper due to lack of space), seems to indicate a uniform power-law convergence rate for any distribution satisfying the conditions of Thm. 1, as well as a few other conditions such as Lipschitz-continuity and bounded third moment. However, the exact constants in this power law can be arbitrarily bad, depending on various characteristics of the distribution. Finding sufficient and empirically verifiable conditions which provide finite sample guarantees is therefore of much interest. To illustrate some of the observations from the previous subsection, we empirically evaluated the instability measure on a few simple toy examples, where everything is well controlled and easy to analyze. The results are displayed in Fig. 2. We emphasize that these are just simple illustrations of possible expected and unexpected characteristics of clustering stability in some very limited cases, which can be gleaned from the theoretical results above, and are not meant to represent more realistic or higher dimensional settings. First of all, the average value of dm (Ak (S1 ), Ak (S2 )) tends D to converge to a constant value, which differs based on the choice of the model order k , and clustering stability does not seem to 'break down' as sample size increases. The three leftmost plots demonstrate how, for these particular examples, the density along the cluster boundaries seem to play an important role in determining instab(Ak , D). For each distribution, k = 3 emerges as the most stable model, since the boundaries between the clusters with k = 3 have low density. However, k = 3 becomes less stable as the Gaussians get closer to each other, leading to higher densities in the boundaries between them. At some point, when the density along the cluster boundaries for k = 3 becomes large enough, k = 2 becomes more stable than k = 3. A different manifestation of this behavior can be seen in the rightmost plot, which simulates a hierarchical clustering 4 4.1 Proofs Proof of Thm. 1 Before embarking on the proof, we briefly sketch its outline: 1. Using the central limit theorem for k -means due to Pollard [13], we can characterize the asymptotic Gaussian distribution of the cluster centroids c, in terms of the underlying distribution D (Lemma 1). 2. The cluster boundaries are determined by the positions of the centroids. Hence, we can derive the asymptotic distribution of these boundaries. In particular, for every boundary Fc,i,j , we characterize the asymptotic distribution of the pointwise Euclidean distance between two realizations of this boundary, over drawing and clustering two independent samples. This distance is defined relative to a projection on the hyperplane Hµ,i,j (Lemma 2). Distribution 0.3 p(x) p(x) 0.2 0.1 0 -10 0 10 0.3 Distribution 0.3 p(x) Distribution 0.3 p(x) 0.2 0.1 -10 0 10 0 -10 Distribution 0.2 0.1 0 -10 0 10 0.2 0.1 0 0 10 Values of Instability Measure 1.5 1 0.5 0 0 Values of Instability Measure 1.5 1 0.5 0 0 Values of Instability Measure 1.5 1 0.5 0 0 k=3 Values of Instability Measure 1.5 1 0.5 0 0 250 500 m 750 1000 250 500 m 750 1000 k=2 250 k=4 500 m 750 1000 250 500 m 750 1000 Figure 2: Illustrative examples of the behavior of clustering stability. In each column, the upper plot is the underlying distribution we sample from (a mixture of unit variance Gaussians on R), while the lower plot is an empirical average of dm (Ak (S1 ), Ak (S2 )) over 1000 trials, for different sample sizes m. D 3. We show that the probability mass of D, which switches between clusters i and j over the two independent clusterings, has an asymptotic distribution definable by an integral involving the distance function above, and the values of p(·) on Fµ,i,j (Lemma 3 and Lemma 4). This allows us to formulate the asymptotic distribution of m dD (Ak (S1 ), Ak (S2 )), and its expected value. For convenience, we shall use = (1 , . . . , k ) to denote the random element c - µ. Proof. We will separate the expression in the definition of (x, ci , cj ) into 2 components and analyze them separately. We have that: c · i + cj - x (ci - cj ) 2 · µ = ( i + µj + i + j µi - µj ) + (i - j ) = -x 2 · µ i + µj - x (µi - µj ) 2 µ · i + µj + - x (i - j ) 2 · i + j (µi - µj ) + O( 2 ). + 2 Notice that the first summand is exactly 0 (by definition of x as lying on Fµ,i,j ), and can therefore be dropped. After expanding and simplifying, we get that the above is equal to (µi - x) · i - (µj - x) · j + O( 2 ) (1) Lemm 1. Und the notation and assumptions of the thea er o orem, m = 0 m(c - µ) converges in distribution tv, . where v N , -1 V -1 As a result, = Op (1/ m). This lemma is a straightforward consequence of the main theorem in [13]. Notice that it allows us to assume that for large enough values of m, with arbitrarily high probability and for any i, j [k ], i = j , the nearest centroid to µi is ci , all centroids are distinct, Fc,i,j is non-orthogonal to Fµ,i,j , and is arbitrarily small. We shall tacitly use these assumptions in the remainder of the proof. Considering the projection of Hc,i,j to Hµ,i,j , we have that (x, ci , cj ) is the signed Euclidean distance of x from the point on Hc,i,j which projects to it (see the left half of Fig. 3). This is because (x, ci , cj ) must satisfy the equation: · x µi - µj ci + cj (ci -cj ) = 0. + (x, ci , cj ) µi - µj - 2 Lemma 2. For some i, j [k ], i = j , assume that Fµ,i,j = . For any x Hµ,i,j , define the function: · +c µi - µj ci 2 j - x (ci - cj ) . (x, ci , cj ) = (µi - µj ) · (ci - cj ) Then if is smaller than some positive constant which depends only on µ, (x, ci , cj ) can be rewritten as + 1 i i-x O(( x + 1) 2 ). x - µj j µi - µ j µ As to the second component in the definition of (x, ci , cj ), we have that ( µi - µj µ µi - µj = µi - µj ) · (ci - cj ) i - µj 2 + (µi - µj ) · (i - j ) 1 = = ( µ -µ ) · ( - ) µi - µj 1 + i µ j µ i 2 j - i j 1 = µi - µj µi - µj 1 (1 + O( )) O( ) - 1 1 + O( ) = 1 + O( ) µi - µ j , (2) assuming to be small enough. Multiplying Eq. (1) and Eq. (2) gives us the expression in the lemma. In order to calculate the asymptotic distribution of dm (Ak (S1 ), Ak (S2 )), we need to characterize the distribution D of the probability mass of D in the 'wedges' created between two boundaries for clusters i,j , based on two independent samples (see Fig. 1). For any two given boundaries, calculating the probability mass requires integration of the underlying density function p(·) over these wedges, making it very hard to write the distribution of this probability mass explicitly. The purpose of the next two lemmas is to derive a more tractable, asymptotically exact approximation for each such wedge, which depends only on the values of p(·) along the boundary Fµ,i,j . We begin with an auxiliary lemma, required for the main Lemma 4 which follows. To state these lemmas, we will need some additional notation. For some Hµ,i,j , fix some (possibly unbounded) polytope F Hµ,i,j . For notational convenience, we shall assume w.l.o.g that Hµ,i,j is aligned with the axes, in the sense that for all x Hµ,i,j , its last coordinate is 0 (it can be easily shown that the regularity conditions on p(·) will still hold). Also, denote F = {y Rn-1 : (y, 0) F }, which is simply the n - 1 dimensional representation of F on the hyperplane. Finally, for ease of ~ notation, denote ((y, 0), ci , cj ) for any y F as (y), where = c - µ. Lemma 3. Let , be two independent copies of c - µ, each induced by clustering an independent sample of size m. Let B = {x Rn : x R} be a ball of radius R centered at the origin. Then we have that F (y) ~ d p(y, )d y B The expression above is upper bounded in turn by: Z ~ ~ (| (y)| + | (y)|) F B sup ~ ~ y [ (y), (y)] |p(y, y ) - p(y, 0)|dy, assuming the integral exists. Since , have the same distribution, it is enough to show existence and analyze the convergence to zero in probability for F ~ |p(y, y ) - p(y, 0)|dy. (4) | (y)| sup B ~ ~ y [ (y), (y)] This integral can be upper bounded by ~ sup | (y)| yF B sup ~ ~ y [ (y), (y)] |p(y, y ) - p(y, 0)| Z 1dy. F B (5) Since B is bounded, we have according to Lemma 2 that if is small enough, y F B ~ sup | (y)| = O( + 2 ), (6) ~ (y ) - F B ~ (y ) where the constants implicit in the r.h.s depend on R. Proof. Since p(·) is a non-negative function, we can rewrite the expression in the lemma as F m~ ~ ax{ (y), (y)} B (y) ~ p(y, 0)d y d = ~ and a similar equation holds for (·) with replaced by in the r.h.s. To make the equations less cumbersome, we will ignore the higher order term 2 , since converges to 0 in probability anyway by Lemma 1 (it is straightforward to verify that the analysis below still holds). From Eq. (6) and the sentence which follows, we have that supyF B ,y [ (y), (y)] y = O( ). Since con~ ~ verges to zero in probability, this implies that y converges to zero in probability, uniformly for any y F B . Moreover, p(·) is uniformly continuous in the compact domain B , and thus p(y, y ) converges uniformly in probability to p(y, 0). As a result, we have that sup sup ~ ~ y F B y [ ( y ) , ( y ) ] op (1/ m), (3) |p(y, ) - p(y, 0)| = op (1). (7) Substituting Eq. (6) and Eq. (7) into Eq. (5), and using et the fact that = Op (1/ m), we g that the expression in Eq. (5) (and hence Eq. (4)) is op (1/ m) as required. Lemma 4. For some non-empty Fµ,i,j , let t(c, c , i, j ) be a random variable, defined as the probability mass of D which switches between clusters i, j with respect to the two clusterings defined by c, c , induced by independently sampling and clustering a pair of samples S1 , S2 each of size m. More formally, define the set-valued random variable Q(c, c , i, j ) = {x Rn : (x Cc,i x Cc ,j ) (x Cc ,i x Cc,j )} Fc,i,j Fc ,i,j , so that t(c, c , i, j ) = ~ ~ min{ (y), (y)} p(y, )d dy - or F F B ~ ~ min{ (y), (y) max{ (y), (y)} ~ ~ p(y, 0)d dy . , B ~ ~ min{ (y), (y)} max{ (y), (y)} ~ ~ p(y, ) - p(y, 0)d dy By the integral mean value theorem, since p(·) is continuous, we have that the expression above is equal to: F , ~ (y) - (y)|(p(y, y ) - p(y, 0))dy ~ | B Then t(c, c , i, j ) is distributed as F p(x)|l(x, ci , c )|dx + op (1/ m), j µ,i,j Q p(x)dx. (8) (c,c ,i,j ) where l(x, ci , c ) is distributed as j 1 µi - µj -x x - µj µ i where y is between the minimum and maximum of ~ ~ { (y), (y)}. For simplicity of notation, we will write ~ (y), (y)]. ~ y [ - i j - j i . Proof. The right half of Fig. 3 should help to clarify the notation and the intuition of the following proof. Intuitively, the probability mass which switches between clusters i and j over the two samples is the probability mass of D lying 'between' Fc,i,j and Fc ,i,j . A potential problem is that this probability mass is also affected by the positions of other neighboring boundaries. However, the fluctuations of these additional boundaries decrease as m , and their effect on the probability mass in question becomes negligible. Our goal is to upper and lower bound the integral in Eq. (8) by ex pressions which are identical up to op (1/ m) terms, giving us the desired result. As in Lemma 3, we assume that Hµ,i,j is aligned with the axes, such that for any x Hµ,i,j , its last coordinate is 0. Define Fmax (µ, c, c , i, j ) Hµ,i,j as the projection of ~ ~ Q(c, c , i, j ) on Hµ,i,j . By definition of (y), (y), any point x = (y, 0) in Fmax (µ, c, c , i, j ) has the property that the width of Q(c, c , i, j ) relative to Hµ,i,j at x is at most ~ ~ | (y) - (y)|. Define F (µ, c, c , i, j ) Hµ,i,j as the projection on Hµ,i,j of Q(c, c , i, j )\(Fc,i,j Fc ,i,j ), where Q(c, c , i, j ) is the boundary of Q(c, c , i, j ). In words, it is the projection of the boundaries of Q(c, c , i, j ), other than Fc,i,j , Fc ,i,j , on Hµ,i,j . Any point x = (y, 0) in F (µ, c, c , i, j ) has the property that the width of Q(c, c , i, j ), relative to Hµ,i,j at ~ ~ x, is less than | (y)- (y)|. This is because the segment of the normal to Hµ,i,j at x, between Hc,i,j and Hc ,i,j , passes through other clusters besides clusters i, j . For notational convenience, we will drop most of the parameters from now on, as they should be clear from the context. Let Fmin = Fmax \ F . By the properties of Fmax , F , any point x = (y, 0) in Fmin has the property that the width ~ ~ of Q relative to Hµ,i,j at x is exactly | (y) - (y)|. Let Fmax , Fmin and F be the n - 1 dimensional projections of Fmax , Fmin and F respectively, by removing the last zero coordinate which we assume to characterize Hµ,i,j . As a result of the previous discussion, by Fubini's theorem, we have that: ~ d F Q (y) p(y, )d y p(x)dx max As before, to avoid making our equations too cumbersome, we shall ignore in the analysis below the higher order term 2 , since converges to 0 in probability and therefore it becomes insignificant compared to . Also, since we conveniently assume that Hµ,i,j passes through the origin, then any normal to a point in Hµ,i,j B c lies outside B . This is not critical for our analysis (in the general case, we could have simply defined B as centered on some point in Hµ,i,j ), but does simplify things a bit. With these observations, we have that ~ d F (y) c max B F F ~ (y ) p(y, )d y c max B ~ ~ | (y) - (y| sup p(y, )dy R c max B ~ ~ (| (y)| + | (y)|) sup p(y, )dy F R c max B a( + ) a( + ) a( + ) H ( y + 1) sup p(y, )dy R c r =R µ,i,j B ( x + 1)g ( x )dx (r + 1)g (r) ern-1 dr, where g (·) is the dominating function on p(·) assumed to exist by the regularity conditions (see section 2), and e is the surface area of an n dimensional unit sphere. By the as sumptions on g (·) and the fact that , = Op (1/ m), we have that F (y ) ~ (y ) c max B ~ p(y, )d d y = Op ~ (y ) F (y) ~ (y ) min ~ p(y, )d d y, (9) Assuming these integrals exist. Our goal will be to show that both the upper and lower bounds above are of the form F p(x)|l(x, ci , c )|dx + op (1/ m), j µ,i,j (10) where h(R) 0 as R . Notice that to reach this conclusion, we did not use any characteristics of Fmax , be, side it being a subset of Hµ,i,j . Thereforesince |l(x, ci , c )| a( x + 1)( + )/ m for some conj stant a > 0, a very similar analysis reveals that F . h (11) p(y, 0)|l(x, ci , cj )|dy = Op (R)/ m B c h , (R)/ m which entails that the 'sandwiched' integral in Eq. (9) has the same form. We will prove this assertion for the upper bound only, as the proof for the lower bound is almost identical. As in Lemma 3, we let B be a closed ball of radius R in Rn centered on the origin, and separately analyze the integral in the upper bound of Eq. (9) with respect to what happens inside and outside this ball. By Lemma 2, assuming is small enough, there exists a constant a > 0 dependent only on µ, such that | (y)| a( y + 1)( + 2 ). We note for later that none of the constants implicit in the Op (·) notation, other than h(R), depend on R. Turning now to what happens inside the ball, we have by Lemma 3 that F (y) ~ (y ) max B = F ~ p(y, )d d y max B ~ ~ | (y) - (y)|p(y, 0)dy + op (1/ m). (12) Leaving this equation aside for later, we will now show that F - ~ ~ | (y) - (y)|p(y, 0)dy = op (1/ m). (13) F max B B ~ ~ | (y) - (y)|p(y, 0)dy The l.h.s can be upper bounded by ( ~ ~ | (y) - (y)|p(y, 0)dy Fmax F )B that ( Fmax F )B ~ ~ (| (y)| + | (y)|)p(y, 0)dy. We now use the fact that R can be picked arbitrarily. Notice that the first remainder term has implicit constants which depend on R, but the second remainder term depends on R only through h(R) (recall the development leading to Eq. (10) and Eq. (11)). Therefore, thfirst remainder term e converges to 0 at a rate faster than 1/ m in probability for any R, and the secon remainder term can be made arbid trarily smaller than 1/ m in high probability by picking R to be large enough, since h(R) 0 as R . Thus, for any > 0, we can pick R so at the remainder terms th eventually become smaller than / m with arbitrarily high probability. As a result, we can replace the remainder terms by op (1/ m), with implicit constants not depending on R, and get that Eq. (16) can be rewritten as F = (y) ~ (y ) As , have the same distribution, we just need to show ( ~ | (y)|p(y, 0)dy = op (1/ m). (14) max Fmax F )B By Lemma 2, inside the bounded domain of B , we have ~ that | (y)| a for some constant a dependent solely on µ and R (as before, to avoid making the equations too cumbersome, we ignore terms involving higher powers of ). Moreover, since p(y, 0) is bounded, we can absorb this bound into a and get that ( ~ | (y)|p(y, 0)dy a ( ( 1dy, Fmax F )B This gives us an equivalent formulation of the upper bound in Eq. (9). As discussed immediately after Eq. (9), an identical analysis can be performed for the lower bound appearing there, and this leads to the result of the lemma. We now turn to prove Thm. 1. Let t(c, c , i, j ) be as defined in Lemma 4. Let Cc,c ,i denote the set of points in Rn which remain in the same cluster i for both clusterings defined by c, c . Then by definition, dm (Ak (S1 ), Ak (S2 )) is D equal to b 1 2 p(x)dx mt(c, c , i, j ). i 1 and a < 1, it holds that (1+M b )/2 1 - , Pr( v > M E v ) and Pr( v E v ) erf(erf-1 ()a ). Proof. The distribution of a norm of a Gaussian random vector is continuous, except possibly at 0 (cf. [3]). For any (1/2, 1), let med be a positive number which satisfies: Pr( v med ) = . Using two results from the literature on Gaussian measures in Banach spaces (theorem III.3 in [11], and theorem 1 from [8]), we have that for any M 1, and for any [0, 1], it holds that: 1 - (1+M )/2 Pr( v Pr( v > M med ) (22) (23) med ) erf(erf-1 ()). By a union bound argument, if we choose M and so that 1.1M / R, where R is as defined in the lemma, we get that Pr( vku 1.1 vks ) is upper bounded by ((1+M )b1 )/2 1 - 1 + erf(erf-1 (2 )a2 ), (25) 1 1 for any 1 , 2 (1/2, 1). Choosing different values for them (as well as the choice of appropriate M , ) leads to different bounds, with a trade off between the tightness of the constants, and minimality requirements on R (which stem from the requirements on M , by Lemma 6). Choosing 1 1 = 0.9, 2 = 0.8, M = 2 log(R)/(b1 log(1 /(1 - ))), = 1.1M /R, and using the fact that erf(x) (2/ )x for any x 0, we get that Eq. (25) is upper bounded by (0.3 + 3 log(R))/R for any R > 3, and therefore Eq. (24) is upper bounded by (0.3 + 3 log(R))/R + o(1). Assume the event dm (Aku (S1 ), Aku (S2 )) > 1.1dm (Aks (S1 ), Aks (S2 )), D D (26) It remains to convert these bounds on the deviation from med to the deviation from E v . To achieve this, we need to upper and lower bound E v /med . By substitution of variables, we have that E v is equal to Pr( v > t)dt = med Pr( v > M med )dM . 0 0 Using Eq. (22), this can be upper bounded by 1 1 (1+M )/2 - + med dM 1 , which after straightforward computations leads to E[ v ] med a , where a is as defined in the lemma. In a similar manner, we can write E v as 1 - Pr( v t)dt 0 = med 1 - Pr( v med )d, 0 which is lower bounded in term, using Eq. (23), by 1 med 1 - erf(erf-1 ())d 0 occurs. Recall that the quantities in Eq. (26) depend on the unknown underlying distribution D, and therefore cannot be calculated directly. Ins ad, we empirically estimate these te quantities (divided by m to be exact), as defined in the ^ theorem statement, to get the stability estimators ku ,4m and ^k ,4m . Thus, even if Eq. (26) occurs, it is still possible that s ^ ^ ku ,4m ks ,4m . Luckily, by Thm. 2 in [15], the probability for this, conditioned on the event in Eq. (26) is o(1) (namely, converges to 0 as m ). Therefore, the probability that Eq. (26) does not occur, or that it does occur but the empirical comparison of these quantities fail, is (0.3 + 3 log(R))/R + o(1) as required. 4.3 Proof of Thm. 3 Again by straightforward computations, we reach the conclusion that E v med b , where b is as defined in the lemma. Therefore, we have that if M b > 1, then Pr( v > M E v ) is upper bounded by 1 (1+M b )/2 - Pr( v > M b med ) . The other bound in the lemma is derived similarly. We can now turn to the proof of Thm. 2. By Lemma 5, m both dm (Aks (S1 ), Aks (S2 )) and dD (Aku (S1 ), Aku (S2 )) conD verge in distribution to vku and vks , where vku , vks are Gaussian random variables (non-degenerate by the assumptions on and V ). By Slutsky's theorem and the definition of convergence in distribution, Pr(dm (Aku (S1 ), Aku (S2 )) 1.1dm (Aks (S1 ), Aks (S2 ))) D D (24) = Pr( vku 1.1 vks ) + o(1). The combination of Lemma 5 and Lemma 6 allows us to upper bound the probability that vku is smaller than its expectation by a factor < 1, and upper bound the probability that vks is larger than its expectation by some factor M > 1, provided that , M satisfy the conditions specified in Lemma 6. To prove the theorem, we will borrow a setting discussed in [10] for a different purpose. Let be some small positive constant (say < 0.1). Consider the parameterized family of distributions {D } (where (0, 1/4)) on the real line, which assigns probability mass (1 - )/4 to x = -1 and x = -1 - , and (1 + )/4 to x = 1 and x = 1 + . Any such distribution satisfies the requirements of Thm. 1, except continuity. However, as mentioned in Sec. 2, the theorem only requires continuity in some region around the boundary points, so we may ignore this difficulty. Alternatively, we may introduce continuity by convolution with a small local smoothing operator. For any , it is easily seen that dm (Ak (S1 ), Ak (S2 )) converges to 0 in D probability, since the boundary points between the optimal clusters have zero density. Let A1 , denote the event where for a sample of size m m drawn i.i.d from D , there are more instances on {-1 - , -1} than on {1, 1 + }. Also, let A2 , denote the event m that for a sample of size m drawn i.i.d from D , there are more instances on {1, 1 + } than on {-1 - , -1}. Finally, let Bm, denote the event that every point in {-1 - , -1, 1, 1 + } is hit by at least one instance from the sample. Clearly, if A1 , Bm, occurs, then the optimal cluster m centers for the sample are {-1 - , -1, 1 + } for some [0, ], and if A2 , Bm, occurs, then the optimal m where (·) is the cumulative normal distribution function. The probability of the event A1 , is equal to the probm ability of a success rate of more than half in m Bernoulli trials, whose probability of success is (1 - )/2. Using the theorem above, we get after a few straightforward algebraic manipulations and relaxations that 4 . 1 Pr(Am, ) 1 - + 2 m (27) m cluster centers for the sample are {-1 - , 1, 1 + } for some [0, ]. By Thm. 2.1 in [16], for any Bernoulli random variable X such that E[X ] = p 1/2, and any whole number a such that a/m 1 - p, if X1 , . . . , Xm are m i.i.d copies of X , then 1m m a , i a Pr 1- -p Xi m =1 m p(1 - p) m There are several directions for future research. The most obvious perhaps is to extend our results and observations from the asymptotic domain to the finite sample size domain. Showing that clustering stability does not 'break down' in the large sample regime has theoretical and practical relevance, but leaves open the question of why clustering stability can work well for small finite samples. One route to achieve this might be through finite sample guarantees, but as demonstrated in Thm. 3, additional assumptions are needed for such results. Also, it would be interesting to perform a similar analysis for other clustering methods beyond the k means framework. Acknowledgements: The authors wish to thank Gideon Schechtman and Leonid Kontorovich for providing the necessary pointers for the proof of Thm. 2. References ´ ´ [1] Shai Ben-David, Ulrike von Luxburg, and David Pal. A sober look at clustering stability. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, pages 5­19, 2006. ´ [2] Asa Ben-Hur, Andre Elisseeff, and Isabelle Guyon. A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pages 6­17, 2002. [3] V.I. Bogachev. Gaussian Measures. American Mathematical Society, 1998. [4] S. Dudoit and J. Fridlyand. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3(7), 2002. [5] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [6] A. Krieger and P. Green. A cautionary note on using internal cross validation to select the number of clusters. Psychometrika, 64(3):341­353, 1999. [7] Tilman Lange, Volker Roth, Mikio L. Braun, and Joachim M. Buhmann. Stability-based validation of clustering solutions. Neural Computation, 16(6):1299­1323, June 2004. [8] RafalLatala and Krzysztof Oleszkiewicz. Gaussian measures of dilatations of convex symmetric sets. Annals of Probability, 27(4):1922­1938, 1999. [9] Erel Levine and Eytan Domany. Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13(11):2573­2593, 2001. [10] T. Linder. Principles of nonparametric learning, chapter 4: Learning-theoretic methods in vector quantization. Number 434 in CISM Courses and Lecture Notes (L. Gyorfi ed.). Springer-Verlag, New York, 2002. [11] Vitali D. Milman and Gideon Schechtman. Asymptotic Theory of Finite Dimensional Normed Spaces. Springer, 1986. [12] David Pollard. Personal communication. [13] David Pollard. A central limit theorem for k-means clustering. The Annals of Probability, 10(4):919­926, November 1982. [14] Peter Radchenko. Asymptotics Under Nonstandard Conditions. PhD thesis, Yale University, 2004. [15] Ohad Shamir and Naftali Tishby. Cluster stability for finite samples. In Advances in Neural Information Processing Systems 21, 2007. [16] E. V. Slud. Distribution inequalities for the binomial law. The Annals of Probability, 5(3):402­412, June 1977. [17] Y.L. Tong. The Multivariate Normal Distribution. Springer, 1990. [18] Aad W. Van Der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes : With Applications to Statistics. Springer, 1996. The probability of the event A2 , is equal to the probam bility of a success rate of less than half in m Bernoulli trials, whose probability of success is (1 - )/2. By a standard normal approximation argument, we have that for large enough values of m, and for any (0, 1/4), it holds that Pr(A2 , ) 1/2. m (28) Finally, it is straightforward to show that Pr(Bm, ) is arbitrarily close to 1 uniformly for any , if m is large enough. Combining this with Eq. (27), Eq. (28) and the easily proven formula Pr(A B ) Pr(A) - Pr(B ) for any two events A, B , we get that by choosing a large enough sample size m > m0 , and an appropriate value , it holds that for an arbitrarily small > 0. For that choice of m, , if we draw and cluster two independent samples S1 , S2 of size m from D , then the probability that event A1 , Bm, occurs m 2 for one sample, and Am , Bm, occurs for the second sample, is at least 2(1/2 - )2 , or at least 1/3 for a small enough . Note that in this case, we get the two different clusterings discussed above, and m(1 + 2 ) m m dD (A3 (S1 ), A3 (S2 )) = > . 4 4 So with a probability of at least 1/3 over drawing and clustering two independent amples, the distance between s the clusterings is more than m/4, as required. Pr(A1 , Bm, ), Pr(A2 , Bm, ) 1/2 - m m 5 Conclusions and Future Work In this paper, we analyzed the behavior of clustering stability in the k -means framework. We were able to explicitly characterize its asymptotic behavior, concluded that it does not 'break down' in the large sample regime, and made some preliminary observations about the factors influencing it. These factors appear to be reasonable requirements from a 'correct' model, and accords with clustering stability working successfully in many situations. However, they also imply that clustering stability might sometimes behave unexpectedly, for example in hierarchical clustering situations, as illustrated in subsection 3.3.